Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update projects on per-backend base #239

Closed
wants to merge 0 commits into from

Conversation

MichaelMraka
Copy link

This request adds update_projects() method to backend classes. The method is used to update projects in much faster / scalable way by calling upstream server's (pypi.python.org, cpan.org, etc.) API or getting its RSS feed with updated projects. So instead of polling all projects (many thousands) on every update it polls only projects which reported some modifications.

It also automatically adds new projects found in feed.

There is a replacement for current anitya_cron.py - anitya_cron_backends.py script - which uses backend update_projects() method.

This is very useful in e.g. automatic rebuild of upstream packages to rpm (in COPR) which I'm testing.

session = anitya.app.SESSION
projects = self.list_recent_projects(session)
anitya.LOG.info(projects)
p = multiprocessing.Pool(anitya.app.APP.config.get('CRON_POOL', 10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on using multiprocessing.pool.ThreadPool here instead? Using threads in python is no good when the work done is CPU-bound -- the GIL kills performance. However, this is mostly an IO bound thing, so we should be okay (fingers crossed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a threadpool (instead of a multiproc pool) you might be able to pass the session object into update_project and thereby avoid re-initializing it for every work item.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I simply reused code from current anitya_cron.py here. But yes, I can see cron job failures from time to time so different parallel execution module might suit better. I'll give it a try.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any luck with ThreadPool?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadPool is obsolete module and should not be used for new projects. @MichaelMraka if multiprocessing cause you problem, you may try asyncio https://docs.python.org/3/library/asyncio.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xsuchy, can you point to some docs on the deprecation? I don't see any mention of it in the multiprocessing.pool.ThreadPool docstring.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ralphbean https://pypi.python.org/pypi/threadpool but that is something different than multiprocessing.pool.ThreadPool, which is on the other hand barely documented.

@ralphbean
Copy link
Contributor

Implementation aside for a moment, we currently run the full-scan cronjob about twice a day.

With something like this, we could run it much more frequently -- say, every hour or more? We could then also run the full-scan cronjob still twice a day in order to catch anything else that might have fallen through the cracks.

@MichaelMraka
Copy link
Author

Thanks Ralph for the comments.

With something like this, we could run it much more frequently -- say, every hour or more? We could then also run the full-scan cronjob still twice a day in order to catch anything else that might have fallen through the cracks.

Exactly. Moreover I have a testing instance with more than 15k packages (automatically added pypi modules updated in the last 3 month) and current full-scan cron job runs 25 mins on it while new API-scan finishes in 2-3 mins.
So simple calculation says somewhere around 400k projects full-scan will take 12 hours :).
Well, 400k might seem like an insane number now but with automatic new project registration it can be reached couple of months (there are 70k modules on PyPI, 150k on CPAN, 200k on npmjs, ...).

@@ -265,6 +266,30 @@ def call_url(self, url, insecure=False):

return requests.get(url, headers=headers, verify=not insecure)

@classmethod
def list_recent_projects(self, session):
return anitya.lib.model.Project.by_backend(session, self.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write an inline comment here explaining how child classes of the BaseBackend can override this?

Can you add to that comment some description of the default functionality -- i.e., what happens for the child classes that do not override this?

@ralphbean
Copy link
Contributor

Thinking about this some more -- this change is pretty major. It doesn't actually change a lot of the existing code, but it does add a whole new mode that the Backend classes need to update themselves for over time.

Can you add a narrative blurb (one or two paragraphs) to the docs (maybe the README?) describing the two different modes and what methods the backends need to implement in order to take advantage of them?

@ralphbean
Copy link
Contributor

Any updates here @MichaelMraka?

@MichaelMraka
Copy link
Author

Unfortunately I had no luck finding better solution for multiprocessing (which would eliminated all cron job failures).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants