Update projects on per-backend base #239

MichaelMraka · 2015-12-09T12:42:20Z

This request adds update_projects() method to backend classes. The method is used to update projects in much faster / scalable way by calling upstream server's (pypi.python.org, cpan.org, etc.) API or getting its RSS feed with updated projects. So instead of polling all projects (many thousands) on every update it polls only projects which reported some modifications.

It also automatically adds new projects found in feed.

There is a replacement for current anitya_cron.py - anitya_cron_backends.py script - which uses backend update_projects() method.

This is very useful in e.g. automatic rebuild of upstream packages to rpm (in COPR) which I'm testing.

ralphbean · 2015-12-09T14:29:40Z

anitya/lib/backends/__init__.py

+        session = anitya.app.SESSION
+        projects = self.list_recent_projects(session)
+        anitya.LOG.info(projects)
+        p = multiprocessing.Pool(anitya.app.APP.config.get('CRON_POOL', 10))


Any thoughts on using multiprocessing.pool.ThreadPool here instead? Using threads in python is no good when the work done is CPU-bound -- the GIL kills performance. However, this is mostly an IO bound thing, so we should be okay (fingers crossed).

With a threadpool (instead of a multiproc pool) you might be able to pass the session object into update_project and thereby avoid re-initializing it for every work item.

Well, I simply reused code from current anitya_cron.py here. But yes, I can see cron job failures from time to time so different parallel execution module might suit better. I'll give it a try.

Any luck with ThreadPool?

ThreadPool is obsolete module and should not be used for new projects. @MichaelMraka if multiprocessing cause you problem, you may try asyncio https://docs.python.org/3/library/asyncio.html

@xsuchy, can you point to some docs on the deprecation? I don't see any mention of it in the multiprocessing.pool.ThreadPool docstring.

@ralphbean https://pypi.python.org/pypi/threadpool but that is something different than multiprocessing.pool.ThreadPool, which is on the other hand barely documented.

ralphbean · 2015-12-09T14:32:08Z

Implementation aside for a moment, we currently run the full-scan cronjob about twice a day.

With something like this, we could run it much more frequently -- say, every hour or more? We could then also run the full-scan cronjob still twice a day in order to catch anything else that might have fallen through the cracks.

MichaelMraka · 2015-12-11T09:37:05Z

Thanks Ralph for the comments.

With something like this, we could run it much more frequently -- say, every hour or more? We could then also run the full-scan cronjob still twice a day in order to catch anything else that might have fallen through the cracks.

Exactly. Moreover I have a testing instance with more than 15k packages (automatically added pypi modules updated in the last 3 month) and current full-scan cron job runs 25 mins on it while new API-scan finishes in 2-3 mins.
So simple calculation says somewhere around 400k projects full-scan will take 12 hours :).
Well, 400k might seem like an insane number now but with automatic new project registration it can be reached couple of months (there are 70k modules on PyPI, 150k on CPAN, 200k on npmjs, ...).

ralphbean · 2015-12-15T20:30:20Z

anitya/lib/backends/__init__.py

@@ -265,6 +266,30 @@ def call_url(self, url, insecure=False):

            return requests.get(url, headers=headers, verify=not insecure)

+    @classmethod
+    def list_recent_projects(self, session):
+        return anitya.lib.model.Project.by_backend(session, self.name)


Can you write an inline comment here explaining how child classes of the BaseBackend can override this?

Can you add to that comment some description of the default functionality -- i.e., what happens for the child classes that do not override this?

ralphbean · 2015-12-15T20:31:58Z

Thinking about this some more -- this change is pretty major. It doesn't actually change a lot of the existing code, but it does add a whole new mode that the Backend classes need to update themselves for over time.

Can you add a narrative blurb (one or two paragraphs) to the docs (maybe the README?) describing the two different modes and what methods the backends need to implement in order to take advantage of them?

ralphbean · 2016-01-30T16:06:48Z

Any updates here @MichaelMraka?

MichaelMraka · 2016-02-23T15:00:56Z

Unfortunately I had no luck finding better solution for multiprocessing (which would eliminated all cron job failures).

ralphbean reviewed Dec 9, 2015
View reviewed changes

ralphbean reviewed Dec 15, 2015
View reviewed changes

MichaelMraka closed this May 2, 2016

MichaelMraka force-pushed the master branch from ef94ebd to 45cbe28 Compare May 2, 2016 10:55

ralphbean mentioned this pull request May 11, 2016

Add a new check-feed mode for the backend. #300

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update projects on per-backend base #239

Update projects on per-backend base #239

MichaelMraka commented Dec 9, 2015

ralphbean Dec 9, 2015

ralphbean Dec 9, 2015

MichaelMraka Dec 11, 2015

ralphbean Dec 15, 2015

xsuchy Feb 25, 2016

ralphbean Feb 26, 2016

xsuchy Feb 28, 2016

ralphbean commented Dec 9, 2015

MichaelMraka commented Dec 11, 2015

ralphbean Dec 15, 2015

ralphbean commented Dec 15, 2015

ralphbean commented Jan 30, 2016

MichaelMraka commented Feb 23, 2016

Update projects on per-backend base #239

Update projects on per-backend base #239

Conversation

MichaelMraka commented Dec 9, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ralphbean commented Dec 9, 2015

MichaelMraka commented Dec 11, 2015

Choose a reason for hiding this comment

ralphbean commented Dec 15, 2015

ralphbean commented Jan 30, 2016

MichaelMraka commented Feb 23, 2016