Add a new check-feed mode for the backend. #300

ralphbean · 2016-05-11T04:23:27Z

This is a rewrite of #239.

It adds a new check_feed method to all of the backends that I could figure out.
It adds tests for those.
It adds a new --check-feed option to the cronjob, which uses those methods instead of the regular approach.

Without this PR, we currently run the cronjob twice a day (once every 12 hours)
which means that some new upstream releases we don't find out about for a "very
long time" (~12 hours). We do that because our cronjob checks for a new
upstream release for every project we know about. That takes a very long
time to run.

This new mode for the cronjob checks only for projects that are listed in the
RSS feeds and APIs of public indexes like pypi, rubygems, and cpan, etc..
It runs very quickly in comparison. I bet we could run it as a cronjob once
every 5 minutes without much worry (we should wrap it in a lock script, though,
just to make sure they don't pile up). I'd like to get as close to realtime as
possible.

In addition to scanning for new upstream releases, if the cronjob encounters a
new package in the RSS feed or API, then it also adds that to anitya's
database of projects (so that will grow over time, automatically now).

ncoghlan · 2016-05-11T05:22:41Z

anitya/lib/backends/npmjs.py

+        except Exception:  # pragma: no cover
+            raise AnityaPluginException('No JSON returned by %s' % url)
+
+        for item in data[:40]:


Does the limit here risk missing updates if not polled frequently enough or if a large batch of updates land at the same time?

Yes. In fact, all of these backends share that same risk.

The saving grace is that, in production we should schedule this --check-feed mode to run every 5 minutes but we should continue to schedule the full scan mode to run every 12 hours, which can catch anything that falls through the cracks (as well as also catch updates from backends which do not support this check_feed stuff).

"full scan"? There is nothing like full scan as far as I understand. The "full scan" looks just for known packages, i.e. if you miss newly introduced package, the record for that package is never created in Anitya and hence it won't be discovered until its new release is caught by this code again.

full scan is what we are currently doing

Right, but then you miss some newly introduced packages ...

I'm not sure I fully follow you, how would this PR change anything in that regards compared to the current approach?

the record for that package is never created in Anitya and hence it won't be discovered until its new release is caught by this code again.

This is correct.

@pypingou, our current "full scan" mode scans for all of the projects that anitya already knows about, and it does so very well.

This new mode not only updates projects that anitya already know about, but it also creates entries for new projects that it discovers.

If we worry only about the first part here, then it is ok to miss an announcement due to the aliasing effect at play here, since our full scan will pick up the release that we missed. However, like @voxik says, we run the risk of failing to discover new projects if they get pushed off the list between our 5-minute scan windows.

Ok, I understand the issue, that could be a problem if people assume anitya knows every projects

ncoghlan · 2016-05-11T06:18:37Z

This looks like a very nice improvement to me 👍

pypingou · 2016-05-11T08:17:36Z

anitya/lib/backends/hackage.py

+        ## See http://hackage.haskell.org/api#recentPackages
+        ## It should be possible to query this, but I can't figure out how to
+        ## get it to give me non-html.
+        # url = 'http://hackage.haskell.org/packages/recent/revisions'


maybe @juhp would know (but that's something to look at later)

Just adding .rss seems to do it: http://hackage.haskell.org/packages/recent.rss

Doh! I tried passing all kinds of headers... but not that. Thanks @lubomir! Filed #301.

pypingou · 2016-05-11T08:20:56Z

👍 for me as well., nice change! :)

voxik · 2016-05-11T10:07:52Z

anitya/lib/backends/rubygems.py

+    @classmethod
+    def check_feed(cls):
+        ''' Return a generator over the latest 50 uploads to rubygems.org
+


Will this be checked often enough to be sure that we don't miss anything? Should I open upstream RFE asking to implement some dynamic filter, e.g. "give all updates since gem foo v1.2.3 was released" or alternatively "give me all project since some timestamp"?

@ralphbean started to answer it at: #300 (comment)

But RFE upstream would be cool nonetheless as it would reduce the amount of data we ask them.

Opened RFE: rubygems/rubygems.org#1266

The best case we could get would be to have all of the upstream forges enabled with webhooks, so they can ping us every time someone uploads a new release of anything.

I looked at all of them last night, and I think rubygems.org is the only one that can do this. If more of them supported it, it would be worth adding support to anitya to do it, I think. Then we can have real-real-time.

Maybe something to ping @dstufft about for next-gen pypi :)

See pypi/warehouse#360

@ralphbean yeah, that would be really awesome.

ralphbean · 2016-05-11T16:59:14Z

Thanks all. Merging this.

juhp · 2016-05-12T00:22:42Z

Very nice!

TomasTomecek · 2016-05-12T10:15:41Z

anitya/lib/backends/npmjs.py

+        by querying an weird JSON endpoint.
+        '''
+
+        url = 'https://registry.npmjs.org/-/all/static/today.json'


you can even directly connect to their database and get realtime feed: curl -vL "https://skimdb.npmjs.com/registry/_changes?descending=true&include_docs=true&feed=continuous"

Say whaaaaat? That's awesome.

It's just that it's not super-stable. It can timeout, return invalid responses, hang...

TomasTomecek · 2016-05-12T10:19:27Z

Ralph, this is an awesome piece of work!

My only concern is this: some of the calls return specific number of recent releases -- it's pretty easy to miss something e.g. when someone releases tenths (or hundreds) of packages in a short period of time. That's Nick's first comment. I just slapped myself.

ncoghlan reviewed May 11, 2016
View reviewed changes

pypingou reviewed May 11, 2016
View reviewed changes

voxik reviewed May 11, 2016
View reviewed changes

ralphbean added 3 commits May 11, 2016 10:59

Use a common http session to make things a little faster.

eb67257

Add a new method to backends allowing them to query repo feeds.

9d23309

Add a --check-feed argument to the cronjob to run in a different mode.

2f0718b

ralphbean closed this May 11, 2016

ralphbean force-pushed the index-mode branch from eac38fc to 65e1174 Compare May 11, 2016 15:00

pypingou deleted the index-mode branch May 11, 2016 15:02

pypingou restored the index-mode branch May 11, 2016 15:02

Add new dep.

309587d

ralphbean reopened this May 11, 2016

ralphbean merged commit ebf093d into master May 11, 2016

ralphbean deleted the index-mode branch May 11, 2016 16:59

ralphbean mentioned this pull request May 11, 2016

Add check_feed call for hackage #301

Closed

TomasTomecek reviewed May 12, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new check-feed mode for the backend. #300

Add a new check-feed mode for the backend. #300

ralphbean commented May 11, 2016

ncoghlan May 11, 2016

ralphbean May 11, 2016

voxik May 11, 2016

pypingou May 11, 2016

voxik May 11, 2016

pypingou May 11, 2016

ralphbean May 11, 2016

pypingou May 11, 2016

ncoghlan commented May 11, 2016

pypingou May 11, 2016

lubomir May 11, 2016

ralphbean May 11, 2016

pypingou commented May 11, 2016

voxik May 11, 2016

pypingou May 11, 2016

voxik May 11, 2016

pypingou May 11, 2016

ralphbean May 11, 2016

pypingou May 11, 2016 •

edited

Loading

dstufft May 11, 2016

voxik May 11, 2016

ralphbean commented May 11, 2016

juhp commented May 12, 2016

TomasTomecek May 12, 2016

ralphbean May 12, 2016

TomasTomecek May 13, 2016

TomasTomecek commented May 12, 2016 •

edited

Loading

Add a new check-feed mode for the backend. #300

Add a new check-feed mode for the backend. #300

Conversation

ralphbean commented May 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncoghlan commented May 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pypingou commented May 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pypingou May 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ralphbean commented May 11, 2016

juhp commented May 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomasTomecek commented May 12, 2016 • edited Loading

pypingou May 11, 2016 •

edited

Loading

TomasTomecek commented May 12, 2016 •

edited

Loading