Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cursor to updates RSS feed. #9155

Closed
wants to merge 10 commits into from

Conversation

tiegz
Copy link
Contributor

@tiegz tiegz commented Mar 2, 2021

Greetings! I have two changes to propose for the rss/updates.xml endpoint:

  • adds an after cursor parameter: e.g. https://pypi.org/rss/updates.xml?after=1542745017. This would only return releases after the timestamp, in oldest->newest order, which should be safe since there's an index on Release#created.
  • increases the page size from 40 to 100 releases per page. (I couldn't figure out why 40 was the original choice when this endpoint was introduced in Add RSS feeds #990)

This should make the feed more useful to regular followers of it, allow them to hit the endpoint less frequently, and allow a way to catch up using the cursor. The legacy changelog XMLRPC call also has a since parameter for pagination, so it seemed useful to allow similar pagination in the RSS feed.

@di
Copy link
Member

di commented Mar 2, 2021

Thanks for the PR!

I don't see any reason not to increase the page size to 100 releases. I'm not sure there is a historical reason for the current limit of 40.

With regards to the after parameter, we need to consider how this will affect the size of PyPI's cache, and the likelihood that a given request will hit the cache instead of passing through to PyPI's backends.

Right now, we have a single cache entry for this page, which gets updated every time a new release is made, which happens every few seconds or so. Most requests hit the cache.

If we add the after parameter, we go from 1 cache entry to a potential maximum of 2³¹-1 cache entries. In addition, the increase in the number of cache entries reduces the likelihood that two or more given requests will have the same timestamp, and thus hit the cache instead of our backends.

This is part of the consideration for #284 with regards to the "changelog" XML-RPC endpoint as it has the same issue, and attempting to reproduce that as a JSON API has the same problem with the cache here.

@tiegz
Copy link
Contributor Author

tiegz commented Mar 8, 2021

@di Thanks, just broke out the page size change to #9177 to keep it separate.

That caching problem is indeed tricky. The only fix we can think of so far is:

  • after date param (e.g. after=2021-03-08): could either do a created desc or created asc ordering
  • page param (e.g. page=3): used in conjunction with after, which lets you page through 100-at-a-time for that day.

That'd get us down to a worst case of 21,474,836 cache entries.

@di
Copy link
Member

di commented Mar 9, 2021

@tiegz One thing that's mentioned in #284 that would work for a JSON API but not for RSS would be including next/previous links in the response. This would let someone consuming the changelog/updates feed start at any point and consume up to the latest change. There would only be N cache entries (where N is the number of changes/releases/etc) and the cache would only ever be invalidated once: when there is a new "head".

I think given that, we probably rather put effort into bringing this new API into existence rather than trying to make the RSS feed more useful.

@tiegz
Copy link
Contributor Author

tiegz commented Mar 10, 2021

@di Makes sense, that feature is definitely better suited for that JSON API. I'll close this and subscribe to that thread to follow future discussion.

@tiegz tiegz closed this Mar 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants