-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite the handling of process stalled jobs to be atomic so it doesn… #359
Rewrite the handling of process stalled jobs to be atomic so it doesn… #359
Conversation
…'t double-process jobs. See OptimalBits#356 for more on how this can happen. Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (OptimalBits#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active. Note that this includes a small API change: the 'stalled' event is now passed an array of events.
return Job.fromId(_this, jobId).then(_this.processStalledJob); | ||
}); | ||
return scripts.processStalledJobs(this, grace).then(function(jobs){ | ||
_this.emit('stalled', jobs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should only emit this event if jobs.length > 0
' local jobKey = ARGV[2] .. job', | ||
' if(redis.call("EXISTS", jobKey .. ":lock") == 0) then', | ||
' local jobTS = redis.call("HGET", jobKey, "timestamp")', | ||
' if(jobTS and jobTS < ARGV[1]) then', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this. Isn't timestamp the time when the job was inserted inte the queue? Wouldn't the timestamp that we are interested in be the one that represents the moment in time where the job was moved to active? (I see a chicken and egg problem here already unless I am missing something...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. This was only a hack to cover the majority case: where jobs were just created and about to be processed by the worker that moved.
I did a brief code review and also some thinking about this issue. Firstly, I think we need to properly define what a "stalled" job is. For me it is a job that has started to be processed, but for some reason the worker did not manage to keep the lock until the job finally completed or failed. So the question is this, what does it matter if the worker that moved the job from any queue to the active list actually is the one stating the job? If there is another worker that was faster in starting the job from the active queue, then that job should not be considered stalled at all since it has never been started. So what we need to take care of is that independently who moved the job to the active list the job is only processed once, and that it is re-processed if it gets stalled. We should probably rethink the way we process jobs and divide it in two separate parts, moving the job to active and the actual processing the job. It is just a happy "coincidence" that the worker that moves the job is often the one able to starts it faster than anybody else, but if somebody else does it, it should also be fine. When a job starts, and lock has been taken, the worker could mark the job as started (it could be done atomically when getting the lock), if in the future that job is picked by another worker because that job is in active and not locked anymore, then it can test if the job has the "started" status flag on, and if so then it knows it was stalled and can emit the stalled event, which in this case will be completely right. |
Yes, I agree that the definition of 'stalled' should be "has been started once". So there are two types of jobs this function should be concerned with:
I agree that we should only emit the 'stalled' event for type (2), and to distinguish between them I can add a But if I were to remove the grace period, then this will pick up many jobs that fit (1) and those jobs will just get bounced back |
I like the idea of moving back to wait. That means that we have 2 different tasks, one (task A) is to process the jobs, another (task B) is to check in the active list for jobs that may have stalled and put them back to the wait queue. We can assume that the normal case (lets say 99% of cases or more) is that a job is moved from wait to active and then it starts to be processed by that same worker. The remaining case is that some task B finds a job in the active queue that is not locked it should move it to wait, in the hope that it will processed later, it could be moved to the beginning of the queue, since that job has already been in active and it would be unfair to wait for the whole queue before it can be processed again. It could also happen that task B finds a job that is not locked but that has its status === 'started', in that case it also moves to wait but also emits the stalled event. The only remaining question that I can think of right now is if it is possible to end in an endless loop where a task is moved from active to wait for ever. If we have a counter that atomically counts when the task is moved back to wait, then we could also fail the job if that counter has exceeded a maximum configurable value with a proper reason. Does this sound reasonable to you? |
and there is the degenerate case where the worker that moved from wait to active did not manage to start the job faster than task B, which would result on the worker being moved back to wait for no reason. This degenerate case should happen seldom enough to not being a big problem. And maybe in this case we should not increase the counter, although there is a very small risk that we could end in a endless loop. |
…job in the active queue on every cycle (looking at the oldest active jobs first).
Just pushed another proposed change. After thinking about it, you're right that it doesn't matter if the worker who processes the active job isn't the worker who moved it to 'active'. We just need to distinguish between jobs that have been previously started so the So how about we just make the task of "looking through the active list for an unlocked job - whether it's stalled or freshly moved to 'active'" a part of the processJobs run loop? This should address all concerns:
The downside to this approach is that it's much more expensive by iterating the active job list after processing every job. One way we can mitigate this is to only run it for the first call to |
I agree on most of what you propose. But I do not think we can run this stalled-checker in the processJobs run loop. I think it needs to be a separated loop that runs at a different interval. For example it could run at the same interval as LOCK_RENEW_TIME (I do not think it makes sense to run it more often than that). This is both for performance but also for the case where you have a queue that does process sparse jobs, you could end with stalled jobs that do not get processed because no new jobs have been added to the queue. The problem with this approach is of course that you could end having more parallel jobs executed than you would like to, but if we process the stalled jobs serially it should be no more than one concurrent extra job than what you have specified in your concurrency level. |
"the case where you have a queue that does process sparse jobs" -> this is actually handled already because the So we're basically saying that for every job we process, AND on a minimum interval This current approach is also nice because it respects the desired concurrency setting, as it will only ever process one job (either regular or 'stalled') at one time per concurrency. If we're concerned about performance, we could have only the first call to |
ok. I agree mostly on all what you wrote. One thing, just to be 100% clear on the solution, for most of the cases, a stalled job will be a real stalled job, i.e., a job that has started but stalled before completion, and in this case we should not process it, just move it back to the wait queue. It is only the jobs that are picked by the stalled-checker part but that are not really stalled, the ones to be executed directly. So basically the active queue is most of the time just clean from stalled jobs and with only real active ones left. That also means that even in the case where you call the stalled-checker once per LOCK_RENEW_TIME it will be fine, since all the stalled jobs will be moved to the wait queue in one sweep, almost immediately. No risk for the active list growing ad infinitum. |
…ted' property needs to be written).
Why not just process the (real) stalled jobs directly in the run loop? The problem with moving them back to An advantage to having a single codepath process both types of stalled jobs - real stalled jobs and unlocked active jobs - is that it makes it very elegant to emit the 'stalled' event (see 417ba9d for a demonstration, though I would need to add the code to write the 'started' property when getting the first lock). Otherwise My concern in a previous comment about not putting the stalled jobs back into |
I see two worrisome issues with not moving to wait the un-stalled jobs. Let me argument it. First, let me make clear one assumption that I am doing. That is, a stalled job is (or should be with a well constructed process function) a quite rare event, and even more rare, but certainly proportional to how often you check for stalled jobs, is the event where a job that has never started is picked by the stalled-checker. Finally, I am not sure I understand the solution you mentioned to avoid too many concurrent jobs, by starting them after every turn of the event loop. What happens in the likely event that a job takes more time to complete than event loop? |
Ok, so if I understand correctly, the solution you're proposing is:
Ok, this sounds reasonable and I can work on it tomorrow morning, unless you have time to get to it first. I'm still slightly concerned that jobs will needlessly get bounced between |
sorry, I meant |
btw, since you are expending so much resources on this project, how would it feel to put the mixmax logo on the readme as a sponsor? |
per OptimalBits#359 (comment) - hopefully more companies are willing to contribute to this wonderful project and get their logo up there!
Sorry for the delay on this. Will finish this tomorrow |
* Note that there is no way to know for _certain_ if a job is "stalled" | ||
* (since the process that moved it to active might just be slow to get the | ||
* lock on it), so this function takes a 'grace period' parameter to ignore | ||
* jobs that were created in the recent X milliseconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess some things in this comment regarding the grace period are not relevant anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, fixed!
} | ||
return null; | ||
}).catch(function(err){ | ||
console.error(err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be useful to add a text to this log to explain it was related to the moveUlockedJobsToWait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
' if(redis.call("SET", jobKey .. ":lock", ARGV[2], "PX", ARGV[3], "NX")) then', | ||
' return job', | ||
' end', | ||
'local stalled = {}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, maybe it would be interesting to have a configurable LRANGE here. For example, with very large queues, the active list could be too big for being traversed too often. I have to double check, but if the oldest jobs are at the end of the queue, limiting to a max number of elements per call may work well. I am thinking to expose also the other constants that we have, for better finetuning. We do not need to change more in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea.
' redis.call("HSET", jobKey, "failedReason", "job stalled more than allowable limit")', | ||
' else', | ||
// Move the job back to the wait queue, to immediately be picked up by a waiting worker. | ||
' redis.call("RPUSH", KEYS[2], job)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the queue is a LIFO (check options), we need to do a LPUSH here instead. LIFO should also imply a new name for the script hash (since we could have different queues (LIFO/FIFO) in the same redis instance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't LIFO an option for the job, not the queue? Also, if we're reprocessing jobs that already made it to 'active', don't we always want to make them LIFO? Otherwise they'd be unfairly penalized by waiting the entire wait
queue again.
' local lockAcquired = redis.call("HGET", jobKey, "lockAcquired")', | ||
' if(lockAcquired) then', | ||
// If it was previously locked then we consider it 'stalled' (Case A above). | ||
' table.insert(stalled, job)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this jobs fails due too many times stalled, should we still emit it as an event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I fixed this and changed the stalled
event to emit just the single job, so no more API breakage.
great work. When this is merged I will take a look on the failing tests and after that I think I should release a new bull version (1.1.0 maybe). |
Ok, addressed comments. Let me know what you think. I can also do a pass on fixing unit tests, and writing new ones for this, if you'd like me to. |
I saw most test failures are actually because they try to call the old |
I'll fix these up. |
@manast I made a few small fixes, but I need your help with the tests. They're very flaky for me, giving me different results on each run (even on master) |
No problem. I'll do a pass on the tests later today, as well as fix eslint. Just wanted to make sure the overall strategy was approved by you before writing tests.
|
I will merge this now and I can take a look at the tests after that. |
Thanks and let me know if you need anything from me! |
Thanks for the fix! 👍 |
Could you please update the package version? :) Thanks! |
per OptimalBits/bull#359 (comment) - hopefully more companies are willing to contribute to this wonderful project and get their logo up there!
…'t double-process jobs. See #356 for more on how this can happen.
Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active.