Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API endpoint to get latest version of all projects #347

Open
dstufft opened this issue Jan 9, 2015 · 28 comments
Open

Add API endpoint to get latest version of all projects #347

dstufft opened this issue Jan 9, 2015 · 28 comments

Comments

@dstufft
Copy link
Member

dstufft commented Jan 9, 2015

There are projects like https://release-monitoring.org/ which want to monitor PyPI to see when new versions of specific projects are released. Currently this requires doing 1 HTTP request per tracked project which can easily turn into hundreds or thousands of HTTP requests. Offering a JSON endpoint that simply lists the names and versions of all projects can make this take a single HTTP request.

@pypingou
Copy link

\cc

@iambibhas
Copy link

It'd also be helpful if it was possible to search through the list with some filters. I've been meaning to add an instant answer to duckduckgo for python packages, but sadly it's not possible right now as there is no search API. I can also work on it. Just wondering why it doesn't exit yet.

@nlhkabu nlhkabu added the requires triaging maintainers need to do initial inspection of issue label Jul 2, 2016
@fungi
Copy link

fungi commented Jul 13, 2016

Would it be possible to solve this need (the use case for release-monitoring.org) instead by streaming and parsing the PyPI update changelog, similar to how bandersnatch figures out what new updates to sync on a mirror?

@trofimander
Copy link

Would be also useful for PyCharm IDE to notify about outdated packages in the integrated package manager.

@westurner
Copy link

It could be efficient to also serve cached deltas of the whole catalog? e.g. an optional from=iso8601datetime?

  • Is there something like DRPMS or rsync over HTTP that could reasonably be added in a view and generated in a celery WarehouseTask?
  • etags, If-Modified-Since

@ewdurbin
Copy link
Member

ewdurbin commented Jul 13, 2016

@Traff it seems that pulling current versions for all packages on the index may be a bit overzealous. getting a list from /simple should be sufficient for offering users a dialog of installable packages... the current/latest version identifier is arguably unimportant at that point.

perhaps after a user has identified a package by name, a cheap (cached at edge) call to the json API would be successful for offering a list of installable versions. GET https://pypi.python.org/pypi/<package_name>/json

regarding offering updates, again the existing json api would support a call per package to get current versions using the same call as above . these calls are cached at the edge, should be fast for end users, and are 100% fair game for community use.

i suppose my question is how is 1 long (2-10s) request to obtain a list of all package/versions better than N 200-500ms requests (which can be submitted concurrently) in order to provide information on current versions and available updates.

@ewdurbin
Copy link
Member

to be clear, i'm not opposed to supporting the specific feature request in this issue. but am trying to aide in guiding PyCharm off of the currently used index page they are scraping for this information.

that page is incredibly expensive to generate and often causes congestion for the PyPI backends.

@trofimander
Copy link

trofimander commented Jul 13, 2016

@ewdurbin Yes, that makes sense. The only doubt we had is that we thought that making N request is worse than making one. But in this case, it could be on the contrary.

@ewdurbin
Copy link
Member

indeed, @dstufft did a great job of summarizing the state of the world:

here and here

@westurner
Copy link

Relevant mailing list threads:

@brettcannon
Copy link
Contributor

One thing like this could allow for is providing a JSON API for this info instead of having to parse the HTML of the /simple page to just get a list of projects.

@westurner
Copy link

So, what is the objective here:

  • "Add API endpoint to get latest version of all projects"
    • add versions to /simple
      • cache invalidate on every package upload and then JOIN all packages
      • (so that everyone re-downloads the whole catalog every time)
  • "Add API endpoint to get latest versions and package checksums of a
    specific subset of projects matching version and/or package and platform
    constraints"

On Aug 6, 2016 3:18 PM, "Brett Cannon" [email protected] wrote:

One thing like this could allow for is providing a JSON API for this info
instead of having to parse the HTML of the /simple page to just get a list
of projects.

So, caching and conditional requests (with etags) would probably be the
only way to afford this functionality (SELECT * FROM packages WHERE package
IN {pip, pytest, virtualenv, scripttest, mock, pretend})

Seemingly OT, but this may be the best guide to REST API cache management
(cache keys, etags, invalidation) I've ever read:
http://chibisov.github.io/drf-extensions/docs/#caching


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@wimglenn
Copy link
Contributor

wimglenn commented Aug 9, 2016

Hi there, just to throw in my $0.0.2 here - I use a utility which checks a requirements.txt file's pinned versions to see if there are newer releases existing on pypi. If there are N packages in the file I currently make N requests to pypi. I make a GET at, for example, https://pypi.python.org/pypi/django/json - which is a fairly large response (last time I checked, content-length: 128597) whilst the only data I'm actually interested in is under the key data['info'] to see the latest version details. Most of the response bytes are in data['releases'] which I just ignore.

It would be very helpful if there was an API endpoint which could return me just this latest version info for the package. Or, even, better for a user-specified list of packages - so I could make just one request instead of N requests. Thanks!

@jakirkham
Copy link

Would love to be able to checksums for the releases from this API somehow.

@dstufft
Copy link
Member Author

dstufft commented Aug 30, 2016

@jakirkham Checksum for what? The files?

@jakirkham
Copy link

Sorry for the delay. Yes, for the files.

@jakirkham
Copy link

Added issue ( #1638 ) for getting the checksums via the API.

@AMDmi3
Copy link

AMDmi3 commented Jul 19, 2017

+1 here, the same is needed for repology.org, from which I've had to remove PyPi support since https://pypi.python.org/pypi/ which it used got deprecated.

So, it'd be nice to have a (not necessary realtime, regenerated hourly is ok for my purposes) machine-readable dump of all PyPi packages with versions and preferrably other metadata such as summaries and licenses.

Example of what I'd like to have for repology:

[
    {
        "name": "requests",
        "version": "2.18.1",
        "summary": "Python HTTP for Humans."
    }
    ...
]

@westurner
Copy link

westurner commented Aug 11, 2017 via email

@westurner
Copy link

westurner commented Aug 11, 2017 via email

@westurner
Copy link

westurner commented Oct 17, 2017

It could be efficient to also serve cached deltas of the whole catalog? e.g. an optional from=iso8601datetime?

  • Is there something like DRPMS or rsync over HTTP that could reasonably be added in a view and generated in a celery WarehouseTask?
  • etags, If-Modified-Since

Zsync is like rsync over HTTP
http://zsync.moria.org.uk/

zsync provides transfers that are nearly as efficient as rsync -z or cvsup, without the need to run a special server application. All that is needed is an HTTP/1.1-compliant web server.

[...]

Single meta-file — zsync downloads are offered by building a .zsync file, which contains the meta-data needed by zsync. This file contains the precalculated checksums for the rsync algorithm; it is generated on the server, once, and is then used by any number of downloaders.

@brainwane
Copy link
Contributor

I'm grateful for the discussion here and apologize for the slow response.

There's now an open issue #1478 for getting a regular dump of the PyPI database, and other open issues tagged as "APIs/feeds". And now, https://warehouse.readthedocs.io/api-reference/ has a bunch more guidance on how developers can use the Warehouse APIs (RSS feeds, JSON, /simple/ emulation of the legacy API, and XML-RPC methods) already available at https://pypi.org .

The folks working on Warehouse have gotten funding to concentrate on improving and deploying Warehouse, and have kicked off work towards our development roadmap -- the most urgent task is to improve Warehouse to the point where we can redirect pypi.python.org to pypi.org so the site is more sustainable and reliable, and shut down the legacy site. We discussed this issue in our core developers' meeting today. Since this feature isn't something that the legacy site has, I've moved it to a future milestone.

I would be pleased to add this issue to the list of things people work on at this year's PyCon sprints if folks are interested.

Thanks and sorry again for the wait.

@brainwane
Copy link
Contributor

Folks who need this might want to check whether the Libraries.io API for https://libraries.io/pypi might suit their needs in the short term.

@AMDmi3
Copy link

AMDmi3 commented Jun 6, 2018

It doesn't look suitable - I see no means to bulk get information on all projects. The closest search endpoint returns right information, but it's only possible to request packages page by page and maximal page size is 100 items, which, given API rate limit of 60 requests/minute and PyPi size of >136k packages gives us more than 20 minutes needed to get all the data - this is too much. There are other issues as well:

  • there's no option of getting results sorted by package name (other sort variants may lead to lost or duplicate packages because of reordering during retrieval)
  • the API doesn't seem to be stable, I'm having 500 errors when requesting above 100th page
  • the requirement to get an API key may be unacceptable to some users (including me)
  • the requirement to use third party site is also not good

@retr0h
Copy link

retr0h commented Jul 16, 2018

Hi there, just to throw in my $0.0.2 here - I use a utility which checks a requirements.txt file's pinned versions to see if there are newer releases existing on pypi. If there are N packages in the file I currently make N requests to pypi.

@wimglenn which util is this? I'm looking for this type of tool.

@brainwane
Copy link
Contributor

I predict that work on this may depend on the progress of #284.

From December 2017 till the end of April 2018, PyPI had funding to get the new site up and running and perform the switchover. Then the grant ran out and we have, as far as I know, no one paid to work on PyPI; volunteers are maintaining and improving the software and infrastructure sides of things, but we need dedicated funding to add complex features. The Packaging Working Group is seeking donations and applying for further grants to fund more design work, more and faster development (including reviewing code contributed by volunteers), and requisite project management.

Sorry for the wait.

@wimglenn
Copy link
Contributor

@retr0h This was a tool developed internally at $EMPLOYER, but I've since requested permission to open source it and $EMPLOYER has agreed. It's on PyPI so you can pip install luddite and the project homepage is now right here on github.

@westurner
Copy link

process_line() in https://github.com/pypa/pip/blob/master/src/pip/_internal/req/req_file.py may be helpful for parsing requirements files; though this does not solve for "API endpoint to get latest version of all projects".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.