-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kickstart the schemas revamp (WIP) #26
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: AlbertDeFusco <[email protected]>
Co-authored-by: travishathaway <[email protected]>
conda_models/repodata_patch.py
Outdated
class RepodataPatchInstructions(ExtrasForbiddenModel): | ||
packages: dict[TarBz2PackageFileNameStr, AllOptionalRepodataRecord] | ||
"The .tar.bz2 packages in the repodata that will be patched" | ||
packages_conda: dict[CondaPackageFileNameStr, AllOptionalRepodataRecord] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it doesnt make sense to patch all fields from a RepodataRecord. Changing the name of a package doesnt make sense and changing the hash might break cache implementations. In Rattler we explicity limit the fields that can be patched for that reason. We also didnt encounter fields not covered by our implementation. Might be worth considering!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that it should be prevented and it's A Bad Idea (tm).
- I don't see anywhere in the Anaconda code where this is properly enforced though. In principle only these fields, but then there are some exceptions right below... e.g. timestamps, build_number... 😬)
- In conda-forge, anything goes, even new filenames?
So while the file format seems to accept everything, there are some guidelines place for "reasonable" use and it looks like this is mostly addressed in code reviews, but sometimes we have exceptions. 🤷
I've been thinking whether we want to get descriptive or prescriptive in this PR, and I have mixed feelings about it. For scalars, I've been trying to be more prescriptive, but with structs... just reflecting what (I think) is in use these days. I am not saying I won't change my mind about it, just writing down what the approach have been so far :D
Thanks for the feedback!
Also, are we sure we have a packages.conda
field? Maybe the readme is outdated, and the seed dict is too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking whether we want to get descriptive or prescriptive in this PR, and I have mixed feelings about it. For scalars, I've been trying to be more prescriptive, but with structs... just reflecting what (I think) is in use these days. I am not saying I won't change my mind about it, just writing down what the approach have been so far :D
I understand where you're coming from. With rattler we definitely choose the prescriptive approach because I feel like there is way too much "legacy" code and supporting all those exceptions is simply harder. But the same logic does not necessarily hold for this project.
The packages.conda
field is definitely used by the conda-forge-repodata-patches
package. The readme does indeed seem to be outdated. Here is a small excerpt from the latest linux-64
patches:
"packages.conda": {
"4ti2-1.6.9-hbc9de56_1.conda": {
"license_family": "GPL"
},
"acctaudem-1.0.1-h69042ef_3.conda": {
"license_family": "GPL"
},
"actions-runner-2.299.1-he0ac6c6_0.conda": {
"license_family": "MIT"
},
"actions-runner-2.300.0-h0cdce71_0.conda": {
"license_family": "MIT"
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the current state, I thin we should be descriptive in the initial implementation: The first step in this effort is to capture what is used currently so that people and projects can refer to the schema to ensure that their software is implemented correctly.
In further steps, we could get more and more prescriptive.
This brings me to a second thing that we'll need to have: versioning. As with everything, at some point, the schema will change.
To enable implementations to reference "the schema" they're using, we'll need to version the schemas. If we don't, we effectively can't change them without breaking something (be that software or user trust).
Edit: Is the repodata_version
the version of the schema?
subdir: SubdirStr | ||
"The subdirectory of the channel this package is in" | ||
timestamp: Optional[PositiveInt] = None | ||
"The date this entry was created" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be specified at seconds since epoch UTC? Also, is the intent to track the date the package was built, the date the record was created, or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is already a unix timestamp (from the unix epoch which is UTC) but in milliseconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some older packages specify a timestamp in seconds. Would definitely be good to document this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird system. If the seconds would be in the far future, divide by 1000. But that's how it works. I prefer seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure is!
This is what we do in rattler.
While building http://prefix.dev we ran into more of these slight inconsistencies. Everything we encountered we added to the rattler library.
"The md5 hash of the package archive" | ||
name: PackageNameStr | ||
"The name of the package" | ||
noarch: NoarchStr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Optional
? Many current package records on .ORG do not include this field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jep!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the present state of package records in anaconda.org, there are several commonly used fields included in package records that are not accounted for here. To handle the pragmatics of removing them without breaking anyone downstream, I'm wondering if there should be a new deprecated
field which would be an object containing the deprecated values.
Consider this example package current listed in .ORG:
"cuda-11.3.0-h3b286be_0.tar.bz2": {
"arch": "x86_64",
"binstar": {
"channel": "main",
"owner_id": "5be2119d5c4b9a1104ccdc8f",
"package_id": "60ef382c6bd0f2f84f7c23e1"
},
"build": "h3b286be_0",
"build_number": 0,
"depends": [
"cuda-runtime >=11.3.0",
"cuda-toolkit >=11.3.0"
],
"has_prefix": false,
"machine": "x86_64",
"md5": "2685f7931e7576b43ce597a87ae78091",
"name": "cuda",
"operatingsystem": "linux",
"platform": "linux",
"requires": [],
"size": 2030,
"subdir": "linux-64",
"target-triplet": "x86_64-any-linux",
"timestamp": 1627403602120,
"version": "11.3.0"
}
With a deprecated
section it might look like this:
"cuda-11.3.0-h3b286be_0.tar.bz2": {
"arch": "x86_64",
"build": "h3b286be_0",
"build_number": 0,
"depends": [
"cuda-runtime >=11.3.0",
"cuda-toolkit >=11.3.0"
],
"deprecated": {
"binstar": {
"removal_date": "2024-09-20",
"value": {
"channel": "main",
"owner_id": "5be2119d5c4b9a1104ccdc8f",
"package_id": "60ef382c6bd0f2f84f7c23e1"
}
},
"has_prefix": {
"removal_date": "2024-01-01",
"value": false
},
"machine": {
"removal_date": "2024-01-01",
value: "x86_64"
},
"operatingsystem": {
"removal_date": "2024-09-20",
"value": "linux"
},
"requires": {
"note": "consider using the 'constrains' attribute.",
"removal_date": "2024-01-01",
"value": []
},
"target-triplet": {
"removal_date": "2024-01-01",
"value": "x86_64-any-linux"
}
},
"md5": "2685f7931e7576b43ce597a87ae78091",
"name": "cuda",
"platform": "linux",
"size": 2030,
"subdir": "linux-64",
"timestamp": 1627403602120,
"version": "11.3.0"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, WRT the RepodataRecord
resource, is the intent that it's a write-once resource? I assume this is true because of the existence of hotfixing, but it might be worth clarifying that somewhere.
If there is strong intent that the resource IS an immutable record, it might also be reasonable to define a canonicalized record hash (or signature), which could be used to verify the record is complete.
While I don't love stuffing a verifier into an object that must first be removed for it to be verified, I do like that it would be 'close' to the record, and not in a separate section (like the signatures
section used in conda-content-trust).
Anyway, if there's no appetite for such a feature, I'd be happy with a comment somewhere clarifying whether the resource is or isn't intended to be immutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not include deprecated fields an a deprecated
field, but rather include them as they are used today. That will make the schema more usable for e.g. autocompletion.
However, we do need a mechanism to deprecate and track deprecations.
IMO, @barabo's comments bring up an excellent point: what we should consider "canonical", as things stand today? I've been operating under the [pseudo-]assumption that "canonical" means whatever conda/conda-build/conda-index/etc. do, but those clearly don't quite match the ecosystem's package hosting provider (anaconda.org) does...except in the cases on the "major" channels like Anaconda, conda-forge, bioconda, etc. do. 🙃 |
We’ve had a similar discussion above: #26 (comment) |
FTR, @morremeyer and I talked yesterday about this. One idea that came up was to "canonicalize" the classic implementation in conda/conda-build/conda-info, but leave the schema open for extra fields (which seems to be something Anaconda.org uses). Those extra fields can be further documented in derivative schemas (à la Python subclass) for those non-canonical channels. Once we have schemas and versions for those, then we can start deprecating and migrating implementations over to a unified standard. We also discussed how the fields in
, where |
All of that sounds perfect! 👌 |
Description
xref conda/ceps#52
Checklist - did you ...
news
directory (using the template) for the next release's release notes?Tasks
channeldata.json
conda_build_config.yaml
conda.lock
condarc
constructor.yml
environment.yml
conda-meta/history
MatchSpec
menuinst.json
meta.yaml
info/*
PackageRecord
conda-meta/record.json
patch_instructions.json
repodata.state
repodata.json
requirements.txt
Version
types
exceptions