-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update projects on per-backend base #239
Conversation
session = anitya.app.SESSION | ||
projects = self.list_recent_projects(session) | ||
anitya.LOG.info(projects) | ||
p = multiprocessing.Pool(anitya.app.APP.config.get('CRON_POOL', 10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on using multiprocessing.pool.ThreadPool
here instead? Using threads in python is no good when the work done is CPU-bound -- the GIL kills performance. However, this is mostly an IO bound thing, so we should be okay (fingers crossed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a threadpool (instead of a multiproc pool) you might be able to pass the session
object into update_project
and thereby avoid re-initializing it for every work item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I simply reused code from current anitya_cron.py here. But yes, I can see cron job failures from time to time so different parallel execution module might suit better. I'll give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any luck with ThreadPool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ThreadPool is obsolete module and should not be used for new projects. @MichaelMraka if multiprocessing cause you problem, you may try asyncio https://docs.python.org/3/library/asyncio.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xsuchy, can you point to some docs on the deprecation? I don't see any mention of it in the multiprocessing.pool.ThreadPool
docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ralphbean https://pypi.python.org/pypi/threadpool but that is something different than multiprocessing.pool.ThreadPool, which is on the other hand barely documented.
Implementation aside for a moment, we currently run the full-scan cronjob about twice a day. With something like this, we could run it much more frequently -- say, every hour or more? We could then also run the full-scan cronjob still twice a day in order to catch anything else that might have fallen through the cracks. |
Thanks Ralph for the comments.
Exactly. Moreover I have a testing instance with more than 15k packages (automatically added pypi modules updated in the last 3 month) and current full-scan cron job runs 25 mins on it while new API-scan finishes in 2-3 mins. |
@@ -265,6 +266,30 @@ def call_url(self, url, insecure=False): | |||
|
|||
return requests.get(url, headers=headers, verify=not insecure) | |||
|
|||
@classmethod | |||
def list_recent_projects(self, session): | |||
return anitya.lib.model.Project.by_backend(session, self.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write an inline comment here explaining how child classes of the BaseBackend can override this?
Can you add to that comment some description of the default functionality -- i.e., what happens for the child classes that do not override this?
Thinking about this some more -- this change is pretty major. It doesn't actually change a lot of the existing code, but it does add a whole new mode that the Backend classes need to update themselves for over time. Can you add a narrative blurb (one or two paragraphs) to the docs (maybe the README?) describing the two different modes and what methods the backends need to implement in order to take advantage of them? |
Any updates here @MichaelMraka? |
Unfortunately I had no luck finding better solution for multiprocessing (which would eliminated all cron job failures). |
This request adds update_projects() method to backend classes. The method is used to update projects in much faster / scalable way by calling upstream server's (pypi.python.org, cpan.org, etc.) API or getting its RSS feed with updated projects. So instead of polling all projects (many thousands) on every update it polls only projects which reported some modifications.
It also automatically adds new projects found in feed.
There is a replacement for current anitya_cron.py - anitya_cron_backends.py script - which uses backend update_projects() method.
This is very useful in e.g. automatic rebuild of upstream packages to rpm (in COPR) which I'm testing.