Jobs can be processed and left in the `wait` state #370

bradvogel · 2016-11-09T21:53:46Z

@manast Following the discussion in #359, I found a hazard that I think we should address before releasing a new version: a job can be processed an still left in wait. Here's how:

Time	Process A	Process B
1	In the regular Bull run loop of Process A, getNextJob moves a job from `wait` to `active`
2		In Process B, `moveUnlockedJobsToWait` happens to run and pick up the job in step 1, and moves it back to wait
3	Process A gets lock on the job (that is now in 'wait') and begins to process it

If there is no other process around to move the job back to active then Process A will complete the job and leave it in wait. This can lead to a data inconsistency.

The text was updated successfully, but these errors were encountered:

bradvogel · 2016-11-09T21:56:29Z

To fix this, what if takeLock takes a parameter (perhaps ensureActive) to ensure that the job is in the active state? Then processJob would use that parameter - so it can only ever keep a lock while the job is in active - while other uses of the lock (eg job.remove is called) can acquire the lock in any state.

Perhaps we only ensure the job is in active when renew !== true to keep it efficient. So it only checks that it's active the first time the job is locked.

manast · 2016-11-09T23:26:21Z

I would need to check the code but is there any reason to allow locking a job that it is not in the active set? If not, then we can just have that constraint in takeLock, it could fail the same way it does when trying to get a lock on a already locked job...

bradvogel · 2016-11-09T23:54:23Z

job.remove() tries to get the lock before removing, and I assume we still want to be able to call remove() on jobs that aren't in the active queue

bradvogel · 2016-11-13T03:21:41Z

#371 describes a more dangerous case: a job can get double processed because a worker is allowed to take the lock while the job is in completed, cc @doublerebel

doublerebel · 2016-11-13T03:53:44Z

@bradvogel I think I understand how this could happen, do you have a test case to reproduce this issue? It doesn't seem to be covered by the test suite.

Also is there a reason not to lock a job from leaving wait until failed/complete? Rather than releasing the lock between the move to active and when processing starts.

bradvogel · 2016-11-13T04:42:20Z

I don't have a test case yet. It's a bit difficult to write since this is a subtle race condition. Your test case in #371 (comment) should cover it though.

We can't lock the job atomically while it's being moved from wait to active. We use the Redis operation brpoplpush to move the job, but due to redis limitations (described in #258), brpoplpush can't be used inside a lua script that we'd need to atomically lock it while moving.

We also don't want to lock the job in 'wait' (prior to the move) because that would require polling the wait queue to try to lock the first job. brpoplpush allows us to have a poll-free design.

Double-processing happens when two workers find out about the same job at the same time via `getNextJob`. One worker is taking the lock, processing the job, and moving it to completed before the second worker can even try to get the lock. When the second worker finally gets around to trying to get the lock, the job is already in the completed state. But it processes it anyways since it got the lock. So the fix here is for the takeLock script to ensure the job is in the active queue prior to taking the lock. That will make sure jobs that are in wait, completed, or even removed from the queue altogether don't get double processed. Per the discussion in OptimalBits#370 though, takeLock is parameterized to only require the job be in active when taking the lock while processing the job. There are other cases such as job.remove() that the job might be in a different state, but we still want to be able to lock it. This fixes existing existing broken unit test "should process each job once". This also prevents hazard OptimalBits#370.

bradvogel · 2016-11-13T20:45:40Z

#377

doublerebel · 2016-11-14T00:43:08Z

Ok, I was able to reliably reproduce this bug. I have fixed this test case by creating a single atomic operation including the commands to check the active list, set the lock, and set the lockAcquired counter, similar to your fix to 377.
https://travis-ci.org/nextorigin/bull/builds/175582853

Still a bit more work to get the rest of the tests passing but major progress 👍

bradvogel · 2016-11-14T01:12:26Z

Sound great. File a PR (even if still a work in progress) so we can take a look.

doublerebel · 2016-11-14T03:33:26Z

@bradvogel Can do, filed as #379.

…imalBits#370]

…ent). Double-processing happens when two workers find out about the same job at the same time via `getNextJob`. One worker is taking the lock, processing the job, and moving it to completed before the second worker can even try to get the lock. When the second worker finally gets around to trying to get the lock, the job is already in the completed state. But it processes it anyways since it got the lock. So the fix here is for the takeLock script to ensure the job is in the active queue prior to taking the lock. That will make sure jobs that are in wait, completed, or even removed from the queue altogether don't get double processed. Per the discussion in #370 though, takeLock is parameterized to only require the job be in active when taking the lock while processing the job. There are other cases such as job.remove() that the job might be in a different state, but we still want to be able to lock it. This fixes existing existing broken unit test "should process each job once". This also prevents hazard OptimalBits/bull#370.

manast added the bug label Nov 9, 2016

bradvogel mentioned this issue Nov 13, 2016

"Unexpected token u in JSON at position 0" while processing job #371

Closed

bradvogel mentioned this issue Nov 13, 2016

Fixes double-processing issue described in https://github.com/Optimal… #377

Merged

doublerebel mentioned this issue Nov 14, 2016

ioredis; atomic locking with redlock; support Node 6.x [Fixes #185, Fixes #370] #379

Merged

doublerebel added a commit to nextorigin/bull that referenced this issue Nov 14, 2016

[refactor] use redlock algo for reliable distributed locks [Fixes Opt…

a4474e1

…imalBits#370]

doublerebel added a commit to nextorigin/bull that referenced this issue Nov 14, 2016

[refactor] use redlock algo for reliable distributed locks [Fixes Opt…

3a41230

…imalBits#370]

doublerebel added a commit to nextorigin/bull that referenced this issue Nov 15, 2016

[refactor] use redlock algo for reliable distributed locks [Fixes Opt…

2981100

…imalBits#370]

doublerebel added a commit to nextorigin/bull that referenced this issue Nov 16, 2016

[refactor] use redlock algo for reliable distributed locks [Fixes Opt…

1b89bed

…imalBits#370]

manast closed this as completed in 6ccb860 Nov 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs can be processed and left in the `wait` state #370

Jobs can be processed and left in the `wait` state #370

bradvogel commented Nov 9, 2016

bradvogel commented Nov 9, 2016 •

edited

Loading

manast commented Nov 9, 2016 •

edited

Loading

bradvogel commented Nov 9, 2016

bradvogel commented Nov 13, 2016

doublerebel commented Nov 13, 2016

bradvogel commented Nov 13, 2016

bradvogel commented Nov 13, 2016

doublerebel commented Nov 14, 2016 •

edited

Loading

bradvogel commented Nov 14, 2016

doublerebel commented Nov 14, 2016

Jobs can be processed and left in the wait state #370

Jobs can be processed and left in the wait state #370

Comments

bradvogel commented Nov 9, 2016

bradvogel commented Nov 9, 2016 • edited Loading

manast commented Nov 9, 2016 • edited Loading

bradvogel commented Nov 9, 2016

bradvogel commented Nov 13, 2016

doublerebel commented Nov 13, 2016

bradvogel commented Nov 13, 2016

bradvogel commented Nov 13, 2016

doublerebel commented Nov 14, 2016 • edited Loading

bradvogel commented Nov 14, 2016

doublerebel commented Nov 14, 2016

Jobs can be processed and left in the `wait` state #370

Jobs can be processed and left in the `wait` state #370

bradvogel commented Nov 9, 2016 •

edited

Loading

manast commented Nov 9, 2016 •

edited

Loading

doublerebel commented Nov 14, 2016 •

edited

Loading