-
-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs go back into the queued state when a worker is killed #821
Comments
@TAGraves that's interesting! TIL other queue adapters have that setting. I recently added #794 which is semi-related. You're correct that the current behavior is intended; I am open to adding other modes. Is this discard-on-interrupt (naming?) behavior something you'd expect to be configured at the global level for all jobs, or would it be more on an individual job/job-class level (as some jobs might be safely idempotent)? If the former, I'd implement it as a GoodJob configuration setting, as the latter I'm imagining it would be an ActiveJob extension. Either way, I'm imagining it could be as simple as checking during dequeue if a job had been interrupted ( good_job/app/models/good_job/execution.rb Lines 232 to 235 in 430a6a0
|
Hmm, it's a good question. I'm not sure, but I do know we would configure it for all of our jobs. I'd guess you'd want to configure it for all jobs and then add a retry_on: GoodJob::DirtyExit (e.g.) for individual jobs that you know are idempotent. Glad to hear this should be simple to implement as it's a highly valuable feature for us 😄. |
@bensheldon Do you have some time this week to meet virtually and talk through this with me? We got bitten by it again at the end of last week and it's a major pain point for us. I'd be happy to take a stab at implementing it but have a few questions about how to implement it correctly. Email address is in my GH profile if you want to coordinate over email. Thanks! |
Today, if a process is running a job and is killed (e.g. by OOM, a kill signal, etc.), the job will go back into the queued state and another worker will pick up the job. This can cause jobs to partially executed multiple times. In the case of something like OOM caused by something going wrong it can be particularly devastating since that job might just keep being run by more workers and causing them to die 😱.
I'm pretty sure this is known and intended behavior, but in case you want a repro:
Create a job like
Enqueue the job via a Rails console and then run
bundle exec good_job start
.Is it possible within good_job's current architecture to make these jobs get discarded instead of re-enqueued (well, I realize enqueued really just means "no advisory lock")? Resque has
DirtyExit
which I've found really useful in the past. We try to make our jobs idempotent and also not have them crash the process that's running them, but in the case of someone making a mistake it can be really devastating to have the whole queue become practically inoperational.The text was updated successfully, but these errors were encountered: