Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new check-feed mode for the backend. #300

Merged
merged 4 commits into from
May 11, 2016
Merged

Add a new check-feed mode for the backend. #300

merged 4 commits into from
May 11, 2016

Conversation

ralphbean
Copy link
Contributor

This is a rewrite of #239.

  • It adds a new check_feed method to all of the backends that I could figure out.
  • It adds tests for those.
  • It adds a new --check-feed option to the cronjob, which uses those methods instead of the regular approach.

Without this PR, we currently run the cronjob twice a day (once every 12 hours)
which means that some new upstream releases we don't find out about for a "very
long time" (~12 hours). We do that because our cronjob checks for a new
upstream release for every project we know about. That takes a very long
time to run.

This new mode for the cronjob checks only for projects that are listed in the
RSS feeds and APIs of public indexes like pypi, rubygems, and cpan, etc..
It runs very quickly in comparison. I bet we could run it as a cronjob once
every 5 minutes without much worry (we should wrap it in a lock script, though,
just to make sure they don't pile up). I'd like to get as close to realtime as
possible.

In addition to scanning for new upstream releases, if the cronjob encounters a
new package in the RSS feed or API, then it also adds that to anitya's
database of projects (so that will grow over time, automatically now).

except Exception: # pragma: no cover
raise AnityaPluginException('No JSON returned by %s' % url)

for item in data[:40]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the limit here risk missing updates if not polled frequently enough or if a large batch of updates land at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In fact, all of these backends share that same risk.

The saving grace is that, in production we should schedule this --check-feed mode to run every 5 minutes but we should continue to schedule the full scan mode to run every 12 hours, which can catch anything that falls through the cracks (as well as also catch updates from backends which do not support this check_feed stuff).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"full scan"? There is nothing like full scan as far as I understand. The "full scan" looks just for known packages, i.e. if you miss newly introduced package, the record for that package is never created in Anitya and hence it won't be discovered until its new release is caught by this code again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full scan is what we are currently doing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but then you miss some newly introduced packages ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully follow you, how would this PR change anything in that regards compared to the current approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the record for that package is never created in Anitya and hence it won't be discovered until its new release is caught by this code again.

This is correct.

  • @pypingou, our current "full scan" mode scans for all of the projects that anitya already knows about, and it does so very well.
  • This new mode not only updates projects that anitya already know about, but it also creates entries for new projects that it discovers.

If we worry only about the first part here, then it is ok to miss an announcement due to the aliasing effect at play here, since our full scan will pick up the release that we missed. However, like @voxik says, we run the risk of failing to discover new projects if they get pushed off the list between our 5-minute scan windows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand the issue, that could be a problem if people assume anitya knows every projects

@ncoghlan
Copy link
Contributor

This looks like a very nice improvement to me 👍

## See http://hackage.haskell.org/api#recentPackages
## It should be possible to query this, but I can't figure out how to
## get it to give me non-html.
# url = 'http://hackage.haskell.org/packages/recent/revisions'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe @juhp would know (but that's something to look at later)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding .rss seems to do it: http://hackage.haskell.org/packages/recent.rss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! I tried passing all kinds of headers... but not that. Thanks @lubomir! Filed #301.

@pypingou
Copy link
Member

👍 for me as well., nice change! :)

@classmethod
def check_feed(cls):
''' Return a generator over the latest 50 uploads to rubygems.org

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be checked often enough to be sure that we don't miss anything? Should I open upstream RFE asking to implement some dynamic filter, e.g. "give all updates since gem foo v1.2.3 was released" or alternatively "give me all project since some timestamp"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ralphbean started to answer it at: #300 (comment)

But RFE upstream would be cool nonetheless as it would reduce the amount of data we ask them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best case we could get would be to have all of the upstream forges enabled with webhooks, so they can ping us every time someone uploads a new release of anything.

I looked at all of them last night, and I think rubygems.org is the only one that can do this. If more of them supported it, it would be worth adding support to anitya to do it, I think. Then we can have real-real-time.

Copy link
Member

@pypingou pypingou May 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something to ping @dstufft about for next-gen pypi :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ralphbean yeah, that would be really awesome.

@ralphbean ralphbean closed this May 11, 2016
@pypingou pypingou deleted the index-mode branch May 11, 2016 15:02
@pypingou pypingou restored the index-mode branch May 11, 2016 15:02
@ralphbean ralphbean reopened this May 11, 2016
@ralphbean
Copy link
Contributor Author

Thanks all. Merging this.

@ralphbean ralphbean merged commit ebf093d into master May 11, 2016
@ralphbean ralphbean deleted the index-mode branch May 11, 2016 16:59
@juhp
Copy link
Contributor

juhp commented May 12, 2016

Very nice!

by querying an weird JSON endpoint.
'''

url = 'https://registry.npmjs.org/-/all/static/today.json'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can even directly connect to their database and get realtime feed: curl -vL "https://skimdb.npmjs.com/registry/_changes?descending=true&include_docs=true&feed=continuous"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say whaaaaat? That's awesome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that it's not super-stable. It can timeout, return invalid responses, hang...

@TomasTomecek
Copy link
Contributor

TomasTomecek commented May 12, 2016

Ralph, this is an awesome piece of work!

My only concern is this: some of the calls return specific number of recent releases -- it's pretty easy to miss something e.g. when someone releases tenths (or hundreds) of packages in a short period of time. That's Nick's first comment. I just slapped myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants