-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite the handling of process stalled jobs to be atomic so it doesn… #359
Changes from all commits
00e213a
4124ecf
abd20ac
417ba9d
afc980c
0f60b93
372fa3f
2abd183
6c67253
3a69319
6c9250d
88db693
8ad21bd
b025383
70996b6
9fdea16
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -281,13 +281,49 @@ var scripts = { | |
|
||
return execScript.apply(scripts, args); | ||
}, | ||
|
||
/** | ||
* Takes a lock | ||
*/ | ||
takeLock: function(queue, job, token, renew){ | ||
var lockCall; | ||
if (renew){ | ||
lockCall = 'redis.call("SET", KEYS[1], ARGV[1], "PX", ARGV[2])'; | ||
} else { | ||
lockCall = 'redis.call("SET", KEYS[1], ARGV[1], "PX", ARGV[2], "NX")'; | ||
} | ||
|
||
var script = [ | ||
'if(' + lockCall + ') then', | ||
// Mark the job as having been locked at least once. Used to determine if the job was stalled. | ||
' redis.call("HSET", KEYS[2], "lockAcquired", "1")', | ||
' return 1', | ||
'else', | ||
' return 0', | ||
'end' | ||
].join('\n'); | ||
|
||
var args = [ | ||
queue.client, | ||
'takeLock' + (renew ? 'Renew' : ''), | ||
script, | ||
2, | ||
job.lockKey(), | ||
queue.toKey(job.jobId), | ||
token, | ||
queue.LOCK_RENEW_TIME | ||
]; | ||
|
||
return execScript.apply(scripts, args); | ||
}, | ||
|
||
releaseLock: function(job, token){ | ||
var script = [ | ||
'if redis.call("get", KEYS[1]) == ARGV[1]', | ||
'then', | ||
'return redis.call("del", KEYS[1])', | ||
' return redis.call("del", KEYS[1])', | ||
'else', | ||
'return 0', | ||
' return 0', | ||
'end'].join('\n'); | ||
|
||
var args = [ | ||
|
@@ -355,26 +391,65 @@ var scripts = { | |
}, | ||
|
||
/** | ||
* Gets a stalled job by locking it and checking it is not already completed. | ||
* Returns a "OK" if the job was locked and not in completed set. | ||
* Looks for unlocked jobs in the active queue. There are two circumstances in which a job | ||
* would be in 'active' but NOT have a job lock: | ||
* | ||
* Case A) The job was being worked on, but the worker process died and it failed to renew the lock. | ||
* We call these jobs 'stalled'. This is the most common case. We resolve these by moving them | ||
* back to wait to be re-processed. To prevent jobs from cycling endlessly between active and wait, | ||
* (e.g. if the job handler keeps crashing), we limit the number stalled job recoveries to MAX_STALLED_JOB_COUNT. | ||
|
||
* Case B) The job was just moved to 'active' from 'wait' and the worker that moved it hasn't gotten | ||
* a lock yet, or died immediately before getting the lock (note that due to Redis limitations, the | ||
* worker can't move the job and get the lock atomically - https://github.com/OptimalBits/bull/issues/258). | ||
* For this case we also move the job back to 'wait' for reprocessing, but don't consider it 'stalled' | ||
* since the job had never been started. This case is much rarer than Case A due to the very small | ||
* timing window in which it must occur. | ||
*/ | ||
getStalledJob: function(queue, job, token){ | ||
moveUnlockedJobsToWait: function(queue){ | ||
var script = [ | ||
'if redis.call("sismember", KEYS[1], ARGV[1]) == 0 then', | ||
' return redis.call("set", KEYS[2], ARGV[2], "PX", ARGV[3], "NX")', | ||
'local MAX_STALLED_JOB_COUNT = tonumber(ARGV[1])', | ||
'local activeJobs = redis.call("LRANGE", KEYS[1], 0, -1)', | ||
'local stalled = {}', | ||
'local failed = {}', | ||
'for _, job in ipairs(activeJobs) do', | ||
' local jobKey = ARGV[2] .. job', | ||
' if(redis.call("EXISTS", jobKey .. ":lock") == 0) then', | ||
// Remove from the active queue. | ||
' redis.call("LREM", KEYS[1], 1, job)', | ||
' local lockAcquired = redis.call("HGET", jobKey, "lockAcquired")', | ||
' if(lockAcquired) then', | ||
// If it was previously locked then we consider it 'stalled' (Case A above). If this job | ||
// has been stalled too many times, such as if it crashes the worker, then fail it. | ||
' local stalledCount = redis.call("HINCRBY", jobKey, "stalledCounter", 1)', | ||
' if(stalledCount > MAX_STALLED_JOB_COUNT) then', | ||
' redis.call("SADD", KEYS[3], job)', | ||
' redis.call("HSET", jobKey, "failedReason", "job stalled more than allowable limit")', | ||
' table.insert(failed, job)', | ||
' else', | ||
// Move the job back to the wait queue, to immediately be picked up by a waiting worker. | ||
' redis.call("RPUSH", KEYS[2], job)', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if the queue is a LIFO (check options), we need to do a LPUSH here instead. LIFO should also imply a new name for the script hash (since we could have different queues (LIFO/FIFO) in the same redis instance). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't LIFO an option for the job, not the queue? Also, if we're reprocessing jobs that already made it to 'active', don't we always want to make them LIFO? Otherwise they'd be unfairly penalized by waiting the entire |
||
' table.insert(stalled, job)', | ||
' end', | ||
' else', | ||
// Move the job back to the wait queue, to immediately be picked up by a waiting worker. | ||
' redis.call("RPUSH", KEYS[2], job)', | ||
' end', | ||
' end', | ||
'end', | ||
'return 0'].join('\n'); | ||
'return {failed, stalled}' | ||
].join('\n'); | ||
|
||
var args = [ | ||
queue.client, | ||
'getStalledJob', | ||
'moveUnlockedJobsToWait', | ||
script, | ||
2, | ||
queue.toKey('completed'), | ||
job.lockKey(), | ||
job.jobId, | ||
token, | ||
queue.LOCK_RENEW_TIME | ||
3, | ||
queue.toKey('active'), | ||
queue.toKey('wait'), | ||
queue.toKey('failed'), | ||
queue.MAX_STALLED_JOB_COUNT, | ||
queue.toKey('') | ||
]; | ||
|
||
return execScript.apply(scripts, args); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, maybe it would be interesting to have a configurable LRANGE here. For example, with very large queues, the active list could be too big for being traversed too often. I have to double check, but if the oldest jobs are at the end of the queue, limiting to a max number of elements per call may work well. I am thinking to expose also the other constants that we have, for better finetuning. We do not need to change more in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea.