-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new check-feed mode for the backend. #300
Conversation
except Exception: # pragma: no cover | ||
raise AnityaPluginException('No JSON returned by %s' % url) | ||
|
||
for item in data[:40]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the limit here risk missing updates if not polled frequently enough or if a large batch of updates land at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In fact, all of these backends share that same risk.
The saving grace is that, in production we should schedule this --check-feed
mode to run every 5 minutes but we should continue to schedule the full scan mode to run every 12 hours, which can catch anything that falls through the cracks (as well as also catch updates from backends which do not support this check_feed
stuff).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"full scan"? There is nothing like full scan as far as I understand. The "full scan" looks just for known packages, i.e. if you miss newly introduced package, the record for that package is never created in Anitya and hence it won't be discovered until its new release is caught by this code again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full scan is what we are currently doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but then you miss some newly introduced packages ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I fully follow you, how would this PR change anything in that regards compared to the current approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the record for that package is never created in Anitya and hence it won't be discovered until its new release is caught by this code again.
This is correct.
- @pypingou, our current "full scan" mode scans for all of the projects that anitya already knows about, and it does so very well.
- This new mode not only updates projects that anitya already know about, but it also creates entries for new projects that it discovers.
If we worry only about the first part here, then it is ok to miss an announcement due to the aliasing effect at play here, since our full scan will pick up the release that we missed. However, like @voxik says, we run the risk of failing to discover new projects if they get pushed off the list between our 5-minute scan windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand the issue, that could be a problem if people assume anitya knows every projects
This looks like a very nice improvement to me 👍 |
## See http://hackage.haskell.org/api#recentPackages | ||
## It should be possible to query this, but I can't figure out how to | ||
## get it to give me non-html. | ||
# url = 'http://hackage.haskell.org/packages/recent/revisions' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe @juhp would know (but that's something to look at later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding .rss
seems to do it: http://hackage.haskell.org/packages/recent.rss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for me as well., nice change! :) |
@classmethod | ||
def check_feed(cls): | ||
''' Return a generator over the latest 50 uploads to rubygems.org | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be checked often enough to be sure that we don't miss anything? Should I open upstream RFE asking to implement some dynamic filter, e.g. "give all updates since gem foo v1.2.3 was released" or alternatively "give me all project since some timestamp"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ralphbean started to answer it at: #300 (comment)
But RFE upstream would be cool nonetheless as it would reduce the amount of data we ask them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened RFE: rubygems/rubygems.org#1266
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best case we could get would be to have all of the upstream forges enabled with webhooks, so they can ping us every time someone uploads a new release of anything.
I looked at all of them last night, and I think rubygems.org is the only one that can do this. If more of them supported it, it would be worth adding support to anitya to do it, I think. Then we can have real-real-time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something to ping @dstufft about for next-gen pypi :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ralphbean yeah, that would be really awesome.
Thanks all. Merging this. |
Very nice! |
by querying an weird JSON endpoint. | ||
''' | ||
|
||
url = 'https://registry.npmjs.org/-/all/static/today.json' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can even directly connect to their database and get realtime feed: curl -vL "https://skimdb.npmjs.com/registry/_changes?descending=true&include_docs=true&feed=continuous"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say whaaaaat? That's awesome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just that it's not super-stable. It can timeout, return invalid responses, hang...
Ralph, this is an awesome piece of work!
|
This is a rewrite of #239.
Without this PR, we currently run the cronjob twice a day (once every 12 hours)
which means that some new upstream releases we don't find out about for a "very
long time" (~12 hours). We do that because our cronjob checks for a new
upstream release for every project we know about. That takes a very long
time to run.
This new mode for the cronjob checks only for projects that are listed in the
RSS feeds and APIs of public indexes like pypi, rubygems, and cpan, etc..
It runs very quickly in comparison. I bet we could run it as a cronjob once
every 5 minutes without much worry (we should wrap it in a lock script, though,
just to make sure they don't pile up). I'd like to get as close to realtime as
possible.
In addition to scanning for new upstream releases, if the cronjob encounters a
new package in the RSS feed or API, then it also adds that to anitya's
database of projects (so that will grow over time, automatically now).