Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements for Queue#processStalledJobs #311

Closed

Conversation

xdc0
Copy link
Contributor

@xdc0 xdc0 commented Jun 21, 2016

This PR makes Queue#processStalledJobs more flexible by implementing the following 2 items:

  1. Queue now takes a queueOptions hash that, currently, has only one recognized option: processStalledJobs. When set to false, the queue won't automatically process stalled jobs.
  2. Queue#processStalledJobs itself accepts an optional parameter: limit which will only process that many jobs at a time.

@@ -47,7 +47,7 @@ var LOCK_RENEW_TIME = 5000; // 5 seconds is the renew time.
var CLIENT_CLOSE_TIMEOUT_MS = 5000;
var POLLING_INTERVAL = 5000;

var Queue = function Queue(name, redisPort, redisHost, redisOptions){
var Queue = function Queue(name, redisPort, redisHost, redisOptions, queueOptions){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs readme update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bull's documentation is currently outdated in many regards, I'm addressing this: #309 by documenting everything in jsdoc format, if documentation lives closer to the code, it's much more likely it will be up to date. There are many helpers that generate nice looking documentation, I particularly liked this one: https://camo.githubusercontent.com/724b9224844b6b4f2cd19b3bce8d25015fa54cfa/687474703a2f2f7075752e73682f674f794e652f363663336164636239372e706e67

In any case, yes, this needs a readme update

@manast
Copy link
Member

manast commented Jun 21, 2016

thanks for the PR. I am still not convinced that adding this extra complexity is needed.
Could you elaborate what is the problem we are trying to solve with this?

@xdc0
Copy link
Contributor Author

xdc0 commented Jun 21, 2016

The processStalledJobs functionality is opinionated by assuming that it is completely fine to restart a job in all usage cases, when that's certainly not what you might want on several use cases, specially those where the job handler has side effects that you don't want to accidentally repeat multiple times.

The limit option is useful when you do not want to clog your application by processing stalled jobs - if there are hundreds of thousands of jobs to retry, this means that a single call to processStalledJobs will handle all the jobs in a single go

@manast
Copy link
Member

manast commented Jun 21, 2016

@chuym ok. but if you disable the process stalled jobs, the active queue will grow overtime and you will not get any notification or explanation why these jobs have not completed.
Regarding the limit parameter, even if processStalledJobs tries to process all the stalled jobs, it will do so sequentially, so I don't think it will make any noticeable difference...

@manast
Copy link
Member

manast commented Jun 21, 2016

If stalled jobs cannot be re-executed I think it makes more sense to make them fail and put as reason "job stalled".

@@ -423,7 +430,10 @@ Queue.prototype.run = function(concurrency){
var promises = [];
var _this = this;

return this.processStalledJobs().then(function(){
// In case of connection loss, running `processStalledJobs` will repick jobs here.
var start = this.opts.processStalledJobs ? this.processStalledJobs() : Promise.resolve();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed to be run when the queue starts, instead of just letting it start after _this.LOCK_RENEW_TIME ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectations of Bull is to process them immediately on reconnection. Not calling it to begin with result in many failures across tests.

@xdc0
Copy link
Contributor Author

xdc0 commented Jun 21, 2016

the active queue will grow overtime and you will not get any notification or explanation why these jobs have not completed.

Good point, yet, we do not have a periodic Queue#clean, so completed jobs will also grow overtime without user interaction, although with a straightforward explanation in this case.
Similarly to how users implement Queue#clean loops, users can implement their own Queue#processStalledJobs as their convenience.

An unbounded processStalledJobs is fine for low volume, low parallel queues count, but if you have thousands of jobs per second with many parallel queues, you can imagine how it may grow out of control by pulling thousands of repeated jobs many times where only a subset of these jobs are actually processed.

In fact, I think this can be further improved by both moving this lrange to a script, and calling the same function as in getStalledJobs, thus avoiding getting jobs that might have been picked up by a parallel processStalledJobs call.

@manast
Copy link
Member

manast commented Jun 21, 2016

having an optimized getNextStalledJob should be better than an extra limit parameter that requires finetuning by the user. I can work on that.
regarding the active queue growing overtime is both a more inconsistent behaviour than completed/failed sets, also more difficult to handle. If stalled jobs cannot be restarted, which is the reason for not having auto-process-stalled-jobs, then it will not matter if the user do it manually or not. I think the proper way to solve this is to move the stalled jobs to the failed set...

@xdc0
Copy link
Contributor Author

xdc0 commented Jun 21, 2016

My only reservation is the potential flakiness of processStalledJobs, given that Bull considers a job as stalled if it fails to renew the lock on time, and that's not a completely accurate way to tell if the job handler in fact stalled.

I think processStalledJobs is currently dangerous, some reasons:

  1. If a lock fails to renew, processJob won't stop processing and a second worker may work in the same job on parallel. This has major implications when users expects jobs to run exactly once.
  2. Moving jobs to completed is not safe, because a job can be moved to completed multiple times - while completed jobs are kept in a set, "returnvalue" is overwritten.
  3. If you have many thousands of jobs with high throughput, processStalledJobs will clog the queue.

Item 1 above is not addressable, there isn't a safe way to tell if a handler effectively stalled, this is a limitation of Javascript and I don't think processStalledJobs will ever be 100% reliable, I would love to be wrong though since it is a nice feature.
Item 2 is an issue that can be fixed as a separate issue
Item 3 is addressed by this PR by giving users control of the behavior of processStalledJobs

I understand how useful processStalledJobs is, but users should have control of its behavior since it may not fix every usage case.

@xdc0 xdc0 force-pushed the chuy/parameters-for-stalled-jobs branch from 1d471ec to e3051f8 Compare June 22, 2016 20:06
@xdc0
Copy link
Contributor Author

xdc0 commented Jun 22, 2016

Hey @manast so what is your take on this? Should this be dropped on favor of improving and optimizing processStalledJobs ?

@manast
Copy link
Member

manast commented Jun 23, 2016

I would like to discuss this feature a bit more so that we are on sync on what is the best approach. There are a few things I would like to decide:

  • Have a configurable TTL for jobs. Jobs should fail if they take more time than the defined TTL.
  • When a worker that is running a job dies, should the job be restarted automatically by another worker or not? If not, it should be moved to the failed set. If yes, the retry could be implemented using the retry-failed jobs functionality we already have in place.
  • New arguments on the Queue constructor bloats the API. I would like instead to have one options argument where options for redis, the queue, etc are well defined and organised.

@manast
Copy link
Member

manast commented Jun 23, 2016

btw, moving stalled/stuck jobs back to the wait queue as part of the retry mechanism would not be only an elegant approach but even better performance wise when you have a lot of elements in the active list in a large setup.

@bradvogel
Copy link
Contributor

Implemented in #359

@manast manast closed this Nov 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants