-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for conda channels conda-forge and bioconda #518
Comments
Thanks for the suggestion, looks doable. |
I don't like that it does contain a lot of python modules not distinguishable from other software. We already have this problem with some other repos. |
Well, they would almost all be recognizable by having pypi in the source url. To get that, you'd have to parse the recipes though: At least conda-forge and bioconda are built from recipes in maintained in github repositories, similar to Each recipe has a Code for parsing the |
(I'm guessing that parsing the source URL is the most reliable way to distinguish packages named differently in various distros. Sets of versions would give additional info, or keywords). |
FYI: conda also has a somewhat queer feature called "features". For each package-version-architecture combination, there still may be more than one binary. These are tracking "features". E.g. You can see the features in both the package name and the "build" value in the repodata.json. The "build" value is string is a underscore separated list of features and the build number. |
Yes, I've thought of using Could also try parsing recipes, I didn't go this way because I've seen separate repos for each recipe, but if there's single repo which may be used to check them out all at once it may be worth trying. |
They also should not be present for pure python modules marked as
You can find get all conda-forge recipes here: https://github.com/conda-forge/feedstocks - it just needs a lengthy recursive checkout. To get all information, you may have to parse both. The |
I would love for this to be bumped to a higher priority. FreeCAD relies heavily on conda for it's development builds. |
By now I believe most Here's a little more info on the URL template: So we've got the channel, which is effectively a distribution unto itself. E.g. In the repodata, the key Of most interest is the The keys 'arch' and 'platform' I'd ignore as they aren't consistently available and not used by The key To get more information you'd have to parse the |
The code is committed, but I'm not enabling it. For instance, there are 65.0% unique (e.g. not matched with other repos) packages in bioconda, even with Also, I'm now hesitant to add repositories which do not provide information on project homepages, as this information is crucial for distinguishing similarly names projects, which are becoming more numerous with each added repo. A compilation of preprocessed |
Thanks!
That's pretty much expected, at least for Bioconda. It exists because a lot of software relevant to Bioinformatics is not packaged with the major distros, and even if, the packages are too old. The language specific repos (CRAN, PyPi, CPAN, ...), OTOH don't work that nicely for things that require compilation, plus there is the issue of a software depending on things written in Perl/Python/R that isn't handled by either of those (and the motivation for conda).
You mean resolution of package names, right? Yes, that would be a problem. I honestly don't know how to resolve this.
Perhaps we can get someone at anaconda.org to help with this. What information specifically would you need to have in the repodata? The only other thing I can offer is Bioconda specific. We parse all our
Yes. Since conda-forge hosts it's build recipes each in a single git repo, while Bioconda hosts everything in one big repo, the process would have to be different though. If you are interested, I can provide you with the necessary code or data from the Bioconda side. E.g. we could have something in
Yes and no. The Basically, the That said - we can parse the raw pseudo yaml recipe to extract the information you need here. Would you prefer doing that with your own scripts, with scripts provided by the channel, or by accessing data we keep online for you? |
This is actually the only option. With 200+ repos to maintain, I don't have ability to maintain additionak repository specific code, and using external utilities would hinder repology portability. With more detailed data on hand maybe I'll be able to find a way to separate python modules. |
Here's an elaborate documentation of what Repology expects: https://repology.org/addrepo. |
I wonder whether conda-forge's own format of the repository at https://github.com/regro/libcfgraph would provide better access to the conda-forge repository than the There is also a mapping for all Python packages from PyPI to the conda-forge names available at https://github.com/regro/cf-graph-countyfair/blob/master/mappings/pypi/grayskull_pypi_mapping.yaml. |
@awvwgk , that looks good. It seems to meet the current repology criteria |
there is also https://conda.anaconda.org/conda-forge/channeldata.json which has records e.g. "datalad": {
"activate.d": false,
"binary_prefix": false,
"deactivate.d": false,
"description": "DataLad aims to make data management and data distribution more accessible. To do that it stands on the shoulders of Git and Git-annex to deliver a decentralized system for data exchange. This includes automated ingestion of data from online portals, and exposing it in readily usable form as Git(-annex) repositories, so-called datasets. The actual data storage and permission management, however, remains with the original data providers.",
"dev_url": "https://github.com/datalad/datalad",
"doc_url": "http://datalad.readthedocs.io/",
"home": "http://datalad.org",
"license": "MIT",
"post_link": false,
"pre_link": false,
"pre_unlink": false,
"run_exports": {},
"source_url": "https://pypi.io/packages/source/d/datalad/datalad-0.17.6.tar.gz",
"subdirs": [
"linux-64",
"noarch",
"osx-64",
"win-64"
],
"summary": "data distribution geared toward scientific datasets",
"text_prefix": true,
"timestamp": 1664986525,
"version": "0.17.6"
}, where name is the key and
but PyPI on its own is disabled ATM) :-/ somewhat related good news is that for each PS actually packages seems are provided information about PyPI e.g. https://repology.org/project/python:lazy-loader/versions and even our https://repology.org/project/python:datalad/versions -- although correct would be to unite |
This is definitely better than repodata, still the main problem remains.
Not reliable. Example: "base58": {
"activate.d": false,
"binary_prefix": false,
"deactivate.d": false,
"home": "https://github.com/keis/base58",
"license": "MIT",
"post_link": false,
"pre_link": false,
"pre_unlink": false,
"run_exports": {},
"source_url": "https://github.com/keis/base58/archive/v2.1.1.tar.gz",
"subdirs": [
"noarch"
],
"summary": "Base58 and Base58Check implementation",
"text_prefix": false,
"timestamp": 1635724257,
"version": "2.1.1"
}, |
Let me repeat myself: for Repology to support conda, there needs to be a single JSON file which makes python modules reliably distinguishable. Multi-gigabyte (compressed!) repositories, third party package name mappings or fetching an additional file (which on top of that is not self-contained templated yaml which cannot even be expanded) per each package are absolutely not acceptable. |
Conda repos are still not enabled because there's still no way to distinguish python modules
This is what meta.yaml looks like after it has been rendered into json, which will be unique per package file (so name * version * platforms copies of same). Obviously not accessible to repology, but easier to parse?
|
The format itself is parsable, but the distribution of these would not necessarily be. The size and the number of entries could pose a problem. And I still see no markers of a python module. |
It looks like that one is a bit old, the newer ones have better about sections like
We'll have to see about the pypi link, it has a "run" dependency on python at least. More ordinary packages like sqlalchemy just link to pypi for their source code. |
Neither of these is reliable still. |
conda-forge bots maintain a PyPI mapping at https://github.com/regro/cf-graph-countyfair/blob/master/mappings/pypi/grayskull_pypi_mapping.json. The logic is defined in this module. If a PyPI package is in conda-forge, it's contained in this mapping. Right now the logic is a bit too strict (it requires a PyPI source), but I am willing to extend this to do further checks if required. Would that be enough to identify Python modules (i.e. the package is already on PyPI) with sufficient accuracy? Thanks! |
conda
is a popular package manager in science. It installs packages into the user's home directory, supports "virtual environments" and custom "channels" (akin to ppa's). There are a few major channels that can be considered distributions in their own right, mainly bioconda and conda-forge.Links:
https://conda.io/docs/
https://conda-forge.org/
https://bioconda.github.io/
Channel packages are hosted at anaconda.org, which also offers an API for querying available packages. A repo dump can be obtained here:
https://conda.anaconda.org/{channel}/{arch}-{bits}/repodata.json
where
{arch}
is one oflinux
,osx
andwin
, and bits one of32
and64
. (Not sure how much 32 is used, bioconda e.g. builds only linux-64 and osx-64).The text was updated successfully, but these errors were encountered: