Kickstart the schemas revamp (WIP) #26

jaimergp · 2023-05-08T17:58:17Z

Description

Checklist - did you ...

Add a file to the news directory (using the template) for the next release's release notes?
Add / update necessary tests?
Add / update outdated documentation?

Tasks

Item	Model	Tests	Docs
`channeldata.json`	X
`conda_build_config.yaml`
`conda.lock`
`condarc`	X
`constructor.yml`
`environment.yml`	X
`conda-meta/history`	X
`MatchSpec`	X
`menuinst.json`
`meta.yaml`	X
`info/*`
`PackageRecord`	X
`conda-meta/record.json`
`patch_instructions.json`	X
`repodata.state`
`repodata.json`	X
`requirements.txt`	X
`Version`
`types`	X
`exceptions`

Co-authored-by: AlbertDeFusco <[email protected]>

Co-authored-by: travishathaway <[email protected]>

baszalmstra · 2023-05-09T21:24:30Z

conda_models/repodata_patch.py

+class RepodataPatchInstructions(ExtrasForbiddenModel):
+    packages: dict[TarBz2PackageFileNameStr, AllOptionalRepodataRecord]
+    "The .tar.bz2 packages in the repodata that will be patched"
+    packages_conda: dict[CondaPackageFileNameStr, AllOptionalRepodataRecord] = Field(


Note that it doesnt make sense to patch all fields from a RepodataRecord. Changing the name of a package doesnt make sense and changing the hash might break cache implementations. In Rattler we explicity limit the fields that can be patched for that reason. We also didnt encounter fields not covered by our implementation. Might be worth considering!

Agreed that it should be prevented and it's A Bad Idea (tm).

I don't see anywhere in the Anaconda code where this is properly enforced though. In principle only these fields, but then there are some exceptions right below... e.g. timestamps, build_number... 😬)

In conda-forge, anything goes, even new filenames?

So while the file format seems to accept everything, there are some guidelines place for "reasonable" use and it looks like this is mostly addressed in code reviews, but sometimes we have exceptions. 🤷

I've been thinking whether we want to get descriptive or prescriptive in this PR, and I have mixed feelings about it. For scalars, I've been trying to be more prescriptive, but with structs... just reflecting what (I think) is in use these days. I am not saying I won't change my mind about it, just writing down what the approach have been so far :D

Thanks for the feedback!

Also, are we sure we have a packages.conda field? Maybe the readme is outdated, and the seed dict is too?

I've been thinking whether we want to get descriptive or prescriptive in this PR, and I have mixed feelings about it. For scalars, I've been trying to be more prescriptive, but with structs... just reflecting what (I think) is in use these days. I am not saying I won't change my mind about it, just writing down what the approach have been so far :D

I understand where you're coming from. With rattler we definitely choose the prescriptive approach because I feel like there is way too much "legacy" code and supporting all those exceptions is simply harder. But the same logic does not necessarily hold for this project.

The packages.conda field is definitely used by the conda-forge-repodata-patches package. The readme does indeed seem to be outdated. Here is a small excerpt from the latest linux-64 patches:

"packages.conda": { "4ti2-1.6.9-hbc9de56_1.conda": { "license_family": "GPL" }, "acctaudem-1.0.1-h69042ef_3.conda": { "license_family": "GPL" }, "actions-runner-2.299.1-he0ac6c6_0.conda": { "license_family": "MIT" }, "actions-runner-2.300.0-h0cdce71_0.conda": { "license_family": "MIT" },

Looking at the current state, I thin we should be descriptive in the initial implementation: The first step in this effort is to capture what is used currently so that people and projects can refer to the schema to ensure that their software is implemented correctly.

In further steps, we could get more and more prescriptive.

This brings me to a second thing that we'll need to have: versioning. As with everything, at some point, the schema will change.

To enable implementations to reference "the schema" they're using, we'll need to version the schemas. If we don't, we effectively can't change them without breaking something (be that software or user trust).

Edit: Is the repodata_version the version of the schema?

barabo · 2023-11-03T16:44:41Z

conda_models/package_record.py

+    subdir: SubdirStr
+    "The subdirectory of the channel this package is in"
+    timestamp: Optional[PositiveInt] = None
+    "The date this entry was created"


Should this be specified at seconds since epoch UTC? Also, is the intent to track the date the package was built, the date the record was created, or something else?

It is already a unix timestamp (from the unix epoch which is UTC) but in milliseconds

Some older packages specify a timestamp in seconds. Would definitely be good to document this!

Weird system. If the seconds would be in the far future, divide by 1000. But that's how it works. I prefer seconds.

Sure is!

This is what we do in rattler.

While building http://prefix.dev we ran into more of these slight inconsistencies. Everything we encountered we added to the rattler library.

barabo · 2023-11-03T16:56:44Z

conda_models/package_record.py

+    "The md5 hash of the package archive"
+    name: PackageNameStr
+    "The name of the package"
+    noarch: NoarchStr


Should this be Optional? Many current package records on .ORG do not include this field.

barabo · 2023-11-03T17:01:40Z

conda_models/package_record.py

Given the present state of package records in anaconda.org, there are several commonly used fields included in package records that are not accounted for here. To handle the pragmatics of removing them without breaking anyone downstream, I'm wondering if there should be a new deprecated field which would be an object containing the deprecated values.

Consider this example package current listed in .ORG:

"cuda-11.3.0-h3b286be_0.tar.bz2": { "arch": "x86_64", "binstar": { "channel": "main", "owner_id": "5be2119d5c4b9a1104ccdc8f", "package_id": "60ef382c6bd0f2f84f7c23e1" }, "build": "h3b286be_0", "build_number": 0, "depends": [ "cuda-runtime >=11.3.0", "cuda-toolkit >=11.3.0" ], "has_prefix": false, "machine": "x86_64", "md5": "2685f7931e7576b43ce597a87ae78091", "name": "cuda", "operatingsystem": "linux", "platform": "linux", "requires": [], "size": 2030, "subdir": "linux-64", "target-triplet": "x86_64-any-linux", "timestamp": 1627403602120, "version": "11.3.0" }

With a deprecated section it might look like this:

"cuda-11.3.0-h3b286be_0.tar.bz2": { "arch": "x86_64", "build": "h3b286be_0", "build_number": 0, "depends": [ "cuda-runtime >=11.3.0", "cuda-toolkit >=11.3.0" ], "deprecated": { "binstar": { "removal_date": "2024-09-20", "value": { "channel": "main", "owner_id": "5be2119d5c4b9a1104ccdc8f", "package_id": "60ef382c6bd0f2f84f7c23e1" } }, "has_prefix": { "removal_date": "2024-01-01", "value": false }, "machine": { "removal_date": "2024-01-01", value: "x86_64" }, "operatingsystem": { "removal_date": "2024-09-20", "value": "linux" }, "requires": { "note": "consider using the 'constrains' attribute.", "removal_date": "2024-01-01", "value": [] }, "target-triplet": { "removal_date": "2024-01-01", "value": "x86_64-any-linux" } }, "md5": "2685f7931e7576b43ce597a87ae78091", "name": "cuda", "platform": "linux", "size": 2030, "subdir": "linux-64", "timestamp": 1627403602120, "version": "11.3.0" }

Also, WRT the RepodataRecord resource, is the intent that it's a write-once resource? I assume this is true because of the existence of hotfixing, but it might be worth clarifying that somewhere.

If there is strong intent that the resource IS an immutable record, it might also be reasonable to define a canonicalized record hash (or signature), which could be used to verify the record is complete.

While I don't love stuffing a verifier into an object that must first be removed for it to be verified, I do like that it would be 'close' to the record, and not in a separate section (like the signatures section used in conda-content-trust).

Anyway, if there's no appetite for such a feature, I'd be happy with a comment somewhere clarifying whether the resource is or isn't intended to be immutable.

I would not include deprecated fields an a deprecated field, but rather include them as they are used today. That will make the schema more usable for e.g. autocompletion.

However, we do need a mechanism to deprecate and track deprecations.

chenghlee · 2023-11-03T20:24:56Z

IMO, @barabo's comments bring up an excellent point: what we should consider "canonical", as things stand today? I've been operating under the [pseudo-]assumption that "canonical" means whatever conda/conda-build/conda-index/etc. do, but those clearly don't quite match the ecosystem's package hosting provider (anaconda.org) does...except in the cases on the "major" channels like Anaconda, conda-forge, bioconda, etc. do. 🙃

baszalmstra · 2023-11-03T21:35:42Z

We’ve had a similar discussion above: #26 (comment)

jaimergp · 2023-11-08T13:07:39Z

FTR, @morremeyer and I talked yesterday about this. One idea that came up was to "canonicalize" the classic implementation in conda/conda-build/conda-info, but leave the schema open for extra fields (which seems to be something Anaconda.org uses). Those extra fields can be further documented in derivative schemas (à la Python subclass) for those non-canonical channels.

Once we have schemas and versions for those, then we can start deprecating and migrating implementations over to a unified standard.

We also discussed how the fields in depends and constrains are basically free strings, and are only validated by chance during the test section in conda-build (but for example, an unused run_constrained item with a bad version will not ring any alarms until a user tries to install something that brings it to the solver!). Ideally the syntax in these strings is also standardized with a grammar or at least a decent regex:

package_name [version [build_string]]

, where package_name has its rules, and so do version and build_string.

baszalmstra · 2023-11-08T13:28:16Z

All of that sounds perfect! 👌

Kickstart the schemas revamp

23b74de

conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label May 8, 2023

jaimergp and others added 11 commits May 9, 2023 12:16

move to archive/

e2e0109

add pre-commit

f4af07c

add channeldata

c8ea81e

interrogate

25719b2

add GHA workflow

9473efc

matchspec

6aec2e5

add environment.yml

a68c6a3

Co-authored-by: AlbertDeFusco <[email protected]>

add condarc

2018671

Co-authored-by: travishathaway <[email protected]>

make optional explicit

aa20f40

add repodata patch

a60f75c

simpler approach

a0f3f02

baszalmstra reviewed May 9, 2023

View reviewed changes

jaimergp added 7 commits May 10, 2023 10:44

add history

6f3140e

add requirements.txt

ce971ce

add meta yaml draft

bd57b51

mark revoke(d) as deprecated

231ced7

simplify environment file schema

9761791

simplify condarc a bit

bbb643e

typo

b143c82

barabo reviewed Nov 3, 2023

View reviewed changes

baszalmstra mentioned this pull request Nov 24, 2023

Add purls (Package URLs) to PackageRecord conda/ceps#63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kickstart the schemas revamp (WIP) #26

Kickstart the schemas revamp (WIP) #26

jaimergp commented May 8, 2023 •

edited

Loading

baszalmstra May 9, 2023

jaimergp May 10, 2023

baszalmstra May 10, 2023

morremeyer Sep 27, 2023 •

edited

Loading

barabo Nov 3, 2023

dholth Nov 3, 2023

baszalmstra Nov 3, 2023

dholth Nov 3, 2023

baszalmstra Nov 3, 2023

barabo Nov 3, 2023

baszalmstra Nov 3, 2023

barabo Nov 3, 2023 •

edited

Loading

barabo Nov 3, 2023 •

edited

Loading

morremeyer Nov 6, 2023

chenghlee commented Nov 3, 2023

baszalmstra commented Nov 3, 2023

jaimergp commented Nov 8, 2023

baszalmstra commented Nov 8, 2023

Kickstart the schemas revamp (WIP) #26

Are you sure you want to change the base?

Kickstart the schemas revamp (WIP) #26

Conversation

jaimergp commented May 8, 2023 • edited Loading

Description

Checklist - did you ...

Tasks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morremeyer Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barabo Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

barabo Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenghlee commented Nov 3, 2023

baszalmstra commented Nov 3, 2023

jaimergp commented Nov 8, 2023

baszalmstra commented Nov 8, 2023

jaimergp commented May 8, 2023 •

edited

Loading

morremeyer Sep 27, 2023 •

edited

Loading

barabo Nov 3, 2023 •

edited

Loading

barabo Nov 3, 2023 •

edited

Loading