-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs can be processed and left in the wait
state
#370
Comments
To fix this, what if Perhaps we only ensure the job is in active when |
I would need to check the code but is there any reason to allow locking a job that it is not in the active set? If not, then we can just have that constraint in takeLock, it could fail the same way it does when trying to get a lock on a already locked job... |
|
#371 describes a more dangerous case: a job can get double processed because a worker is allowed to take the lock while the job is in |
@bradvogel I think I understand how this could happen, do you have a test case to reproduce this issue? It doesn't seem to be covered by the test suite. Also is there a reason not to lock a job from leaving wait until failed/complete? Rather than releasing the lock between the move to active and when processing starts. |
I don't have a test case yet. It's a bit difficult to write since this is a subtle race condition. Your test case in #371 (comment) should cover it though. We can't lock the job atomically while it's being moved from wait to active. We use the Redis operation We also don't want to lock the job in 'wait' (prior to the move) because that would require polling the wait queue to try to lock the first job. |
Double-processing happens when two workers find out about the same job at the same time via `getNextJob`. One worker is taking the lock, processing the job, and moving it to completed before the second worker can even try to get the lock. When the second worker finally gets around to trying to get the lock, the job is already in the completed state. But it processes it anyways since it got the lock. So the fix here is for the takeLock script to ensure the job is in the active queue prior to taking the lock. That will make sure jobs that are in wait, completed, or even removed from the queue altogether don't get double processed. Per the discussion in OptimalBits#370 though, takeLock is parameterized to only require the job be in active when taking the lock while processing the job. There are other cases such as job.remove() that the job might be in a different state, but we still want to be able to lock it. This fixes existing existing broken unit test "should process each job once". This also prevents hazard OptimalBits#370.
Ok, I was able to reliably reproduce this bug. I have fixed this test case by creating a single atomic operation including the commands to check the active list, set the lock, and set the lockAcquired counter, similar to your fix to 377. Still a bit more work to get the rest of the tests passing but major progress 👍 |
Sound great. File a PR (even if still a work in progress) so we can take a look. |
@bradvogel Can do, filed as #379. |
…ent). Double-processing happens when two workers find out about the same job at the same time via `getNextJob`. One worker is taking the lock, processing the job, and moving it to completed before the second worker can even try to get the lock. When the second worker finally gets around to trying to get the lock, the job is already in the completed state. But it processes it anyways since it got the lock. So the fix here is for the takeLock script to ensure the job is in the active queue prior to taking the lock. That will make sure jobs that are in wait, completed, or even removed from the queue altogether don't get double processed. Per the discussion in #370 though, takeLock is parameterized to only require the job be in active when taking the lock while processing the job. There are other cases such as job.remove() that the job might be in a different state, but we still want to be able to lock it. This fixes existing existing broken unit test "should process each job once". This also prevents hazard OptimalBits/bull#370.
@manast Following the discussion in #359, I found a hazard that I think we should address before releasing a new version: a job can be processed an still left in wait. Here's how:
wait
toactive
moveUnlockedJobsToWait
happens to run and pick up the job in step 1, and moves it back to waitIf there is no other process around to move the job back to
active
then Process A will complete the job and leave it inwait
. This can lead to a data inconsistency.The text was updated successfully, but these errors were encountered: