Improvements for Queue#processStalledJobs #311

xdc0 · 2016-06-21T19:02:27Z

This PR makes Queue#processStalledJobs more flexible by implementing the following 2 items:

Queue now takes a queueOptions hash that, currently, has only one recognized option: processStalledJobs. When set to false, the queue won't automatically process stalled jobs.
Queue#processStalledJobs itself accepts an optional parameter: limit which will only process that many jobs at a time.

bradvogel · 2016-06-21T19:19:39Z

lib/queue.js

@@ -47,7 +47,7 @@ var LOCK_RENEW_TIME = 5000; // 5 seconds is the renew time.
 var CLIENT_CLOSE_TIMEOUT_MS = 5000;
 var POLLING_INTERVAL = 5000;

-var Queue = function Queue(name, redisPort, redisHost, redisOptions){
+var Queue = function Queue(name, redisPort, redisHost, redisOptions, queueOptions){


needs readme update?

Bull's documentation is currently outdated in many regards, I'm addressing this: #309 by documenting everything in jsdoc format, if documentation lives closer to the code, it's much more likely it will be up to date. There are many helpers that generate nice looking documentation, I particularly liked this one: https://camo.githubusercontent.com/724b9224844b6b4f2cd19b3bce8d25015fa54cfa/687474703a2f2f7075752e73682f674f794e652f363663336164636239372e706e67

In any case, yes, this needs a readme update

manast · 2016-06-21T21:17:08Z

thanks for the PR. I am still not convinced that adding this extra complexity is needed.
Could you elaborate what is the problem we are trying to solve with this?

xdc0 · 2016-06-21T21:38:25Z

The processStalledJobs functionality is opinionated by assuming that it is completely fine to restart a job in all usage cases, when that's certainly not what you might want on several use cases, specially those where the job handler has side effects that you don't want to accidentally repeat multiple times.

The limit option is useful when you do not want to clog your application by processing stalled jobs - if there are hundreds of thousands of jobs to retry, this means that a single call to processStalledJobs will handle all the jobs in a single go

manast · 2016-06-21T21:47:23Z

@chuym ok. but if you disable the process stalled jobs, the active queue will grow overtime and you will not get any notification or explanation why these jobs have not completed.
Regarding the limit parameter, even if processStalledJobs tries to process all the stalled jobs, it will do so sequentially, so I don't think it will make any noticeable difference...

manast · 2016-06-21T21:56:20Z

If stalled jobs cannot be re-executed I think it makes more sense to make them fail and put as reason "job stalled".

bradvogel · 2016-06-21T22:04:35Z

lib/queue.js

@@ -423,7 +430,10 @@ Queue.prototype.run = function(concurrency){
  var promises = [];
  var _this = this;

-  return this.processStalledJobs().then(function(){
+  // In case of connection loss, running `processStalledJobs` will repick jobs here.
+  var start = this.opts.processStalledJobs ? this.processStalledJobs() : Promise.resolve();


is this still needed to be run when the queue starts, instead of just letting it start after _this.LOCK_RENEW_TIME ?

The expectations of Bull is to process them immediately on reconnection. Not calling it to begin with result in many failures across tests.

xdc0 · 2016-06-21T22:41:28Z

the active queue will grow overtime and you will not get any notification or explanation why these jobs have not completed.

Good point, yet, we do not have a periodic Queue#clean, so completed jobs will also grow overtime without user interaction, although with a straightforward explanation in this case.
Similarly to how users implement Queue#clean loops, users can implement their own Queue#processStalledJobs as their convenience.

An unbounded processStalledJobs is fine for low volume, low parallel queues count, but if you have thousands of jobs per second with many parallel queues, you can imagine how it may grow out of control by pulling thousands of repeated jobs many times where only a subset of these jobs are actually processed.

In fact, I think this can be further improved by both moving this lrange to a script, and calling the same function as in getStalledJobs, thus avoiding getting jobs that might have been picked up by a parallel processStalledJobs call.

manast · 2016-06-21T22:47:45Z

having an optimized getNextStalledJob should be better than an extra limit parameter that requires finetuning by the user. I can work on that.
regarding the active queue growing overtime is both a more inconsistent behaviour than completed/failed sets, also more difficult to handle. If stalled jobs cannot be restarted, which is the reason for not having auto-process-stalled-jobs, then it will not matter if the user do it manually or not. I think the proper way to solve this is to move the stalled jobs to the failed set...

xdc0 · 2016-06-21T23:30:40Z

My only reservation is the potential flakiness of processStalledJobs, given that Bull considers a job as stalled if it fails to renew the lock on time, and that's not a completely accurate way to tell if the job handler in fact stalled.

I think processStalledJobs is currently dangerous, some reasons:

If a lock fails to renew, processJob won't stop processing and a second worker may work in the same job on parallel. This has major implications when users expects jobs to run exactly once.
Moving jobs to completed is not safe, because a job can be moved to completed multiple times - while completed jobs are kept in a set, "returnvalue" is overwritten.
If you have many thousands of jobs with high throughput, processStalledJobs will clog the queue.

Item 1 above is not addressable, there isn't a safe way to tell if a handler effectively stalled, this is a limitation of Javascript and I don't think processStalledJobs will ever be 100% reliable, I would love to be wrong though since it is a nice feature.
Item 2 is an issue that can be fixed as a separate issue
Item 3 is addressed by this PR by giving users control of the behavior of processStalledJobs

I understand how useful processStalledJobs is, but users should have control of its behavior since it may not fix every usage case.

xdc0 · 2016-06-22T23:11:00Z

Hey @manast so what is your take on this? Should this be dropped on favor of improving and optimizing processStalledJobs ?

manast · 2016-06-23T08:52:49Z

I would like to discuss this feature a bit more so that we are on sync on what is the best approach. There are a few things I would like to decide:

Have a configurable TTL for jobs. Jobs should fail if they take more time than the defined TTL.
When a worker that is running a job dies, should the job be restarted automatically by another worker or not? If not, it should be moved to the failed set. If yes, the retry could be implemented using the retry-failed jobs functionality we already have in place.
New arguments on the Queue constructor bloats the API. I would like instead to have one options argument where options for redis, the queue, etc are well defined and organised.

manast · 2016-06-23T09:00:59Z

btw, moving stalled/stuck jobs back to the wait queue as part of the retry mechanism would not be only an elegant approach but even better performance wise when you have a lot of elements in the active list in a large setup.

bradvogel · 2016-10-15T04:06:06Z

Implemented in #359

bradvogel reviewed Jun 21, 2016
View reviewed changes

Chuy Martinez added 7 commits June 22, 2016 13:05

Opt out of processStalledJobs

e9b69c7

Optionally limiting the number of stalled jobs to process

2fdd71c

Unit tests

bbe3587

lrange is inclusive, so need to handle accordingly.

42eaeb1

Tests for not processing stalled jobs on reconnection

433d5fb

Only recovering stalled jobs if the option is set on reconnection

9ecda58

Updated README.md

e3051f8

xdc0 force-pushed the chuy/parameters-for-stalled-jobs branch from 1d471ec to e3051f8 Compare June 22, 2016 20:06

bradvogel mentioned this pull request Oct 15, 2016

Rewrite the handling of process stalled jobs to be atomic so it doesn… #358

Closed

manast closed this Nov 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements for Queue#processStalledJobs #311

Improvements for Queue#processStalledJobs #311

xdc0 commented Jun 21, 2016

bradvogel Jun 21, 2016

xdc0 Jun 21, 2016

manast commented Jun 21, 2016

xdc0 commented Jun 21, 2016

manast commented Jun 21, 2016 •

edited

Loading

manast commented Jun 21, 2016

bradvogel Jun 21, 2016

xdc0 Jun 21, 2016

xdc0 commented Jun 21, 2016 •

edited

Loading

manast commented Jun 21, 2016

xdc0 commented Jun 21, 2016

xdc0 commented Jun 22, 2016

manast commented Jun 23, 2016

manast commented Jun 23, 2016 •

edited

Loading

bradvogel commented Oct 15, 2016

Improvements for Queue#processStalledJobs #311

Improvements for Queue#processStalledJobs #311

Conversation

xdc0 commented Jun 21, 2016

bradvogel Jun 21, 2016

Choose a reason for hiding this comment

xdc0 Jun 21, 2016

Choose a reason for hiding this comment

manast commented Jun 21, 2016

xdc0 commented Jun 21, 2016

manast commented Jun 21, 2016 • edited Loading

manast commented Jun 21, 2016

bradvogel Jun 21, 2016

Choose a reason for hiding this comment

xdc0 Jun 21, 2016

Choose a reason for hiding this comment

xdc0 commented Jun 21, 2016 • edited Loading

manast commented Jun 21, 2016

xdc0 commented Jun 21, 2016

xdc0 commented Jun 22, 2016

manast commented Jun 23, 2016

manast commented Jun 23, 2016 • edited Loading

bradvogel commented Oct 15, 2016

manast commented Jun 21, 2016 •

edited

Loading

xdc0 commented Jun 21, 2016 •

edited

Loading

manast commented Jun 23, 2016 •

edited

Loading