Add purls (Package URLs) to `PackageRecord` #63

baszalmstra · 2023-11-23T14:01:10Z

This CEP describes a change to the PackageRecord format and the corresponding repodata.json file to include purls (Package URLs of repackaged packages to identify packages across multiple ecosystems.

rendered

wolfv · 2023-11-23T20:37:05Z

Awesome CEP! :)

xhochy · 2023-11-24T10:05:12Z

cep-purls.md

+
+## Abstract
+
+This CEP describes a change to the `PackageRecord` format and the corresponding `repodata.json` file to include `purls` (Package Urls) of repackaged packages to identify packages across multiple ecosystems.


Can you add a link to the definition of a PackageRecord? I struggle to find an authoritative source for it.

Unfortunately, I believe that atm there is no actual "authorative" source.

There is this relatively old definition of a RepoDataRecord: https://github.com/conda/schemas/blob/main/repodata-record-1.schema.json

There is this new effort to document the schemas better (conda/schemas#26) where it's also called RepoDataRecord: https://github.com/conda/schemas/blob/b143c82a71833570fbe9be2313368b33c0e84726/conda_models/package_record.py#L23

And we have the definition in rattler: https://docs.rs/rattler_conda_types/latest/rattler_conda_types/struct.PackageRecord.html

In rattler (and I believe in conda as well), there is this distinction:

PackageRecord: contains all the fields for a single entry in the repodata.json

RepoDataRecord: inherits all fields from PackageRecord and adds fields to identify the origin of the data (channel, url, etc.)

PrefixRecord: inherits all fields from RepoDataRecord and additionally stores information about how the package was installed.

Yea, I think the most "official" source for this is https://github.com/conda/conda/blob/e783377439ed1c413c6bffb9b785ae1d79c2392a/conda/models/records.py#L247. That module also offers some sort of definition in the top-level docstring.

Implementation of conda/ceps#63

This PR adds support for checking the satisfiability of the lock-file which includes pypi-dependencies. Purls have been added to the lock-file (conda/rattler#414) (See also: conda/ceps#63). This enables checking which conda packages will install which pypi packages without needing to check the internet. This ensures we can still check if a lock-file is up to date quickly. I did not profile this code but I think there are a lot of places we can improve the performance. Thats for a later PR. I also didn't add tests. I think we should but we can also do that in another PR. Closes #467 --------- Co-authored-by: Ruben Arts <[email protected]>

jaimergp · 2023-11-26T19:57:57Z

cep-purls.md

+}
+```
+
+PURL is already supported by dependency-related tooling like SPDX (see [External Repository Identifiers in the SPDX 2.3 spec](https://spdx.github.io/spdx-spec/v2.3/external-repository-identifiers/#f35-purl)), the [Open Source Vulnerability format](https://ossf.github.io/osv-schema/#affectedpackage-field), and the [Sonatype OSS Index](https://ossindex.sonatype.org/doc/coordinates); not having to wait years before support in such tooling arrives is valuable.


I would also mention PEP-725 (WIP).

The Discourse thread has examples showing how the Spack community wants to use this kind of thing: https://discuss.python.org/t/pep-725-specifying-external-dependencies-in-pyproject-toml/31888/31

jaimergp · 2023-11-26T19:59:17Z

cep-purls.md

+* We can keep this information close to the conda package description.
+* We can incrementally add `purls` through repodata patches.
+
+The downside is that the (already large) repodata.json file will grow.


What if we add a separate-yet-adjacent purls.json like we did with run_exports.json in CEP-12?

jaimergp

I like the idea and I will be supportive. Havin this metadata readily available would allow us to be listed in repology.org, for example! It would also play nicely with the (draft) PEP-725 for external metadata in PyPI.

However, I think this CEP right now is talking about serving metadata before we have discussed how to source it, define it and store it.

Whatever ends up in the repodata.json comes, in part, from the info/index.json metadata inside the conda artifact. Then this is augmented with things like sha256 and final size by conda-index (because they cannot be known when the package is being archived).

So before we speak about repodata, we should discuss where in the inner artifact metadata we will store the PURL info. To answer that, we must answer where in the conda-build recipe we will include that information :D

IOW, I'd like to know your thoughts about:

Where in the current meta.yaml we should define the PURLs. about seems to be the most obvious one, which means this will probably end up in info/about.json.
Whether to serve the PURLs separately in a purls.json or not. I honestly don't think putting it in repodata.json is a good idea. I get that it makes sense if you want to have a canonical link between PyPI in conda-forge so Pixi can solve things nicely. It might also be served in channeldata.json (since most of the time PURLs are tied to the source not the platform-dependent, target artifact).

jakirkham · 2024-05-08T17:43:24Z

Would this also help us address Repology's needs for supporting Conda packages ( repology/repology-updater#518 )?

Edit: Nvm missed Jaime has the same idea

ytausch · 2024-10-14T17:00:19Z

Where in the current meta.yaml we should define the PURLs. about seems to be the most obvious one, which means this will probably end up in info/about.json.

I agree that about makes the most sense. However, this adds the redundancy of defining the upstream package twice in the recipe. A more sophisticated solution would be adding a new purl source type for the source section, which gets resolved to a PyPI tarball URL by conda build. The purls for a package could then be automatically inferred from the sources it has been built from. In all cases, a manual option to define the purls likely has to remain for some specific use cases.

While this would facilitate simplicity, avoid redundancy, and avoid errors in the recipe, I see the following downsides with that solution:

different package outputs or variants may not actually use all sources that are available, requiring manually overriding the purls or another clever solution for that
how to verify the hash of a source tarball is more evident if the source URL is stated explicitly in the recipe
to avoid complexity, we could only support a subset of purl types. PyPI is by far the most important IMO. It could confuse people if only a subset of purls are valid sources.
introducing a new source type is simply more work than introducing a new about field - especially in related tooling such as cf-scripts or conda-smithy
backward compatibility?

Whether to serve the PURLs separately in a purls.json or not. I honestly don't think putting it in repodata.json is a good idea. I get that it makes sense if you want to have a canonical link between PyPI in conda-forge so Pixi can solve things nicely. It might also be served in channeldata.json (since most of the time PURLs are tied to the source not the platform-dependent, target artifact).

I do not have a strong opinion here since I am not too involved with the tools that would need to process that data.

bollwyvl · 2024-10-23T01:55:18Z

I think a broader question is whether package-urlcan be adopted more directly by the ecosystem. Relying on "the URL where you get your package" or "what it's called on disk" aren't as effective as an agreed-upon grammar for identifying packages, especially in the "is this CVE relevant to me" scenario.

Brief aside, and likely worth including in the text
A purl is a URL composed of seven components:
 scheme:type/namespace/name@version?qualifiers#subpath

To put this in the context of the above, a given .conda package might claim a PURL as a proxy in one or more other types, but by existing, it should claim one in the conda type. Indeed, a subset is already part of the spec. For example, an old version of django:

It might make sense to advocate for some changes to the conda part of the spec (and test data), namely:

update the default channel from r.a.com -> c.a.org
- then the namespace part of the url would encode the conda channel (e.g. pkg:conda/conda-forge/django)
- i don't know what the "new null" would be, but pretty sure it can't be defaults
use label for e.g. main (not channel, as in the example)

While i don't think much can be done about "where you got the source tarball" (because GitHub sources, etc), I don't think a recipe author should have to calculate all these things... but certainly could given the available data today:

# meta.yaml
{% set version = "1.10.1" %}
package: 
  name: django
  version: {{ version }}
# ...
about:
  # ...
  purls:
    - pkg:pypi/django@{{ version }}
    # this should be fully automated, either at build time (weird?) or trivially-derivable
    - pkg:conda/{{ channel_targets.split(" ")[0] }}/[email protected]?subdir={{ target_platform }}&label={{ channel_targets.split(" ")[1] }}&build=py{{ py }}_{{ build_number }}

So the above full purls might expand to

purls:
- pkg:pypi/[email protected]
- pkg:conda/conda-forge/[email protected]?subdir=win-32&label=main&build=py35_0

purls CEP

17c9c76

tiny improvements to spelling and sentences

f9d65ff

baszalmstra mentioned this pull request Nov 24, 2023

feat: add purls to PackageRecord and lockfile conda/rattler#414

Merged

xhochy reviewed Nov 24, 2023

View reviewed changes

baszalmstra added a commit to conda/rattler that referenced this pull request Nov 24, 2023

feat: add purls to PackageRecord and lockfile (#414)

d7ecd1f

Implementation of conda/ceps#63

baszalmstra mentioned this pull request Nov 24, 2023

feat: implement lock-file satisfiability with pypi dependencies prefix-dev/pixi#494

Merged

jaimergp reviewed Nov 26, 2023

View reviewed changes

jaimergp requested changes Nov 26, 2023

View reviewed changes

ruben-arts mentioned this pull request Aug 9, 2024

Proxy settings for a project prefix-dev/pixi#474

Open

bollwyvl mentioned this pull request Oct 23, 2024

Tame the PyPI / Conda mapping chaos conda/grayskull#564

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add purls (Package URLs) to `PackageRecord` #63

Add purls (Package URLs) to `PackageRecord` #63

baszalmstra commented Nov 23, 2023

wolfv commented Nov 23, 2023

xhochy Nov 24, 2023

baszalmstra Nov 24, 2023 •

edited

Loading

jaimergp Nov 26, 2023

jaimergp Nov 26, 2023

jaimergp Nov 26, 2023

jaimergp Nov 26, 2023

jaimergp left a comment

jakirkham commented May 8, 2024 •

edited

Loading

ytausch commented Oct 14, 2024

bollwyvl commented Oct 23, 2024


		## Abstract

		This CEP describes a change to the `PackageRecord` format and the corresponding `repodata.json` file to include `purls` (Package Urls) of repackaged packages to identify packages across multiple ecosystems.

Add purls (Package URLs) to PackageRecord #63

Are you sure you want to change the base?

Add purls (Package URLs) to PackageRecord #63

Conversation

baszalmstra commented Nov 23, 2023

wolfv commented Nov 23, 2023

xhochy Nov 24, 2023

Choose a reason for hiding this comment

baszalmstra Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

jaimergp Nov 26, 2023

Choose a reason for hiding this comment

jaimergp Nov 26, 2023

Choose a reason for hiding this comment

jaimergp Nov 26, 2023

Choose a reason for hiding this comment

jaimergp Nov 26, 2023

Choose a reason for hiding this comment

jaimergp left a comment

Choose a reason for hiding this comment

jakirkham commented May 8, 2024 • edited Loading

ytausch commented Oct 14, 2024

bollwyvl commented Oct 23, 2024

Add purls (Package URLs) to `PackageRecord` #63

Add purls (Package URLs) to `PackageRecord` #63

baszalmstra Nov 24, 2023 •

edited

Loading

jakirkham commented May 8, 2024 •

edited

Loading