Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs can be processed and left in the wait state #370

Closed
bradvogel opened this issue Nov 9, 2016 · 10 comments
Closed

Jobs can be processed and left in the wait state #370

bradvogel opened this issue Nov 9, 2016 · 10 comments
Labels

Comments

@bradvogel
Copy link
Contributor

@manast Following the discussion in #359, I found a hazard that I think we should address before releasing a new version: a job can be processed an still left in wait. Here's how:

Time Process A Process B
1 In the regular Bull run loop of Process A, getNextJob moves a job from wait to active
2 In Process B, moveUnlockedJobsToWait happens to run and pick up the job in step 1, and moves it back to wait
3 Process A gets lock on the job (that is now in 'wait') and begins to process it

If there is no other process around to move the job back to active then Process A will complete the job and leave it in wait. This can lead to a data inconsistency.

@bradvogel
Copy link
Contributor Author

bradvogel commented Nov 9, 2016

To fix this, what if takeLock takes a parameter (perhaps ensureActive) to ensure that the job is in the active state? Then processJob would use that parameter - so it can only ever keep a lock while the job is in active - while other uses of the lock (eg job.remove is called) can acquire the lock in any state.

Perhaps we only ensure the job is in active when renew !== true to keep it efficient. So it only checks that it's active the first time the job is locked.

@manast
Copy link
Member

manast commented Nov 9, 2016

I would need to check the code but is there any reason to allow locking a job that it is not in the active set? If not, then we can just have that constraint in takeLock, it could fail the same way it does when trying to get a lock on a already locked job...

@manast manast added the bug label Nov 9, 2016
@bradvogel
Copy link
Contributor Author

job.remove() tries to get the lock before removing, and I assume we still want to be able to call remove() on jobs that aren't in the active queue

@bradvogel
Copy link
Contributor Author

#371 describes a more dangerous case: a job can get double processed because a worker is allowed to take the lock while the job is in completed, cc @doublerebel

@doublerebel
Copy link
Contributor

@bradvogel I think I understand how this could happen, do you have a test case to reproduce this issue? It doesn't seem to be covered by the test suite.

Also is there a reason not to lock a job from leaving wait until failed/complete? Rather than releasing the lock between the move to active and when processing starts.

@bradvogel
Copy link
Contributor Author

I don't have a test case yet. It's a bit difficult to write since this is a subtle race condition. Your test case in #371 (comment) should cover it though.

We can't lock the job atomically while it's being moved from wait to active. We use the Redis operation brpoplpush to move the job, but due to redis limitations (described in #258), brpoplpush can't be used inside a lua script that we'd need to atomically lock it while moving.

We also don't want to lock the job in 'wait' (prior to the move) because that would require polling the wait queue to try to lock the first job. brpoplpush allows us to have a poll-free design.

bradvogel added a commit to mixmaxhq/bull that referenced this issue Nov 13, 2016
Double-processing happens when two workers find out about the same job at the same time via `getNextJob`. One worker is taking the lock, processing the job, and moving it to completed before the second worker can even try to get the lock. When the second worker finally gets around to trying to get the lock, the job is already in the completed state. But it processes it anyways since it got the lock.

So the fix here is for the takeLock script to ensure the job is in the active queue prior to taking the lock. That will make sure jobs that are in wait, completed, or even removed from the queue altogether don't get double processed. Per the discussion in OptimalBits#370 though, takeLock is parameterized to only require the job be in active when taking the lock while processing the job. There are other cases such as job.remove() that the job might be in a different state, but we still want to be able to lock it.

This fixes existing existing broken unit test "should process each job once".

This also prevents hazard OptimalBits#370.
@bradvogel
Copy link
Contributor Author

#377

@doublerebel
Copy link
Contributor

doublerebel commented Nov 14, 2016

Ok, I was able to reliably reproduce this bug. I have fixed this test case by creating a single atomic operation including the commands to check the active list, set the lock, and set the lockAcquired counter, similar to your fix to 377.
https://travis-ci.org/nextorigin/bull/builds/175582853

Still a bit more work to get the rest of the tests passing but major progress 👍

@bradvogel
Copy link
Contributor Author

Sound great. File a PR (even if still a work in progress) so we can take a look.

@doublerebel
Copy link
Contributor

@bradvogel Can do, filed as #379.

@manast manast closed this as completed in 6ccb860 Nov 16, 2016
duyenddd added a commit to duyenddd/bull that referenced this issue Jul 28, 2024
…ent).

Double-processing happens when two workers find out about the same job at the same time via `getNextJob`. One worker is taking the lock, processing the job, and moving it to completed before the second worker can even try to get the lock. When the second worker finally gets around to trying to get the lock, the job is already in the completed state. But it processes it anyways since it got the lock.

So the fix here is for the takeLock script to ensure the job is in the active queue prior to taking the lock. That will make sure jobs that are in wait, completed, or even removed from the queue altogether don't get double processed. Per the discussion in #370 though, takeLock is parameterized to only require the job be in active when taking the lock while processing the job. There are other cases such as job.remove() that the job might be in a different state, but we still want to be able to lock it.

This fixes existing existing broken unit test "should process each job once".

This also prevents hazard OptimalBits/bull#370.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants