Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs go back into the queued state when a worker is killed #821

Closed
TAGraves opened this issue Jan 31, 2023 · 3 comments
Closed

Jobs go back into the queued state when a worker is killed #821

TAGraves opened this issue Jan 31, 2023 · 3 comments

Comments

@TAGraves
Copy link
Contributor

Today, if a process is running a job and is killed (e.g. by OOM, a kill signal, etc.), the job will go back into the queued state and another worker will pick up the job. This can cause jobs to partially executed multiple times. In the case of something like OOM caused by something going wrong it can be particularly devastating since that job might just keep being run by more workers and causing them to die 😱.

I'm pretty sure this is known and intended behavior, but in case you want a repro:
Create a job like

def perform
  `kill -9 #{Process.pid}`
end

Enqueue the job via a Rails console and then run bundle exec good_job start.

Is it possible within good_job's current architecture to make these jobs get discarded instead of re-enqueued (well, I realize enqueued really just means "no advisory lock")? Resque has DirtyExit which I've found really useful in the past. We try to make our jobs idempotent and also not have them crash the process that's running them, but in the case of someone making a mistake it can be really devastating to have the whole queue become practically inoperational.

@bensheldon
Copy link
Owner

@TAGraves that's interesting! TIL other queue adapters have that setting. I recently added #794 which is semi-related.

You're correct that the current behavior is intended; I am open to adding other modes.

Is this discard-on-interrupt (naming?) behavior something you'd expect to be configured at the global level for all jobs, or would it be more on an individual job/job-class level (as some jobs might be safely idempotent)? If the former, I'd implement it as a GoodJob configuration setting, as the latter I'm imagining it would be an ActiveJob extension. Either way, I'm imagining it could be as simple as checking during dequeue if a job had been interrupted (performed_at would already be present) and then cleaning up the record (setting finished_at and setting a custom exception):

unfinished.dequeueing_ordered(parsed_queues).only_scheduled.limit(1).with_advisory_lock(unlock_session: true, select_limit: queue_select_limit) do |executions|
execution = executions.first
break if execution.blank?
break :unlocked unless execution&.executable?

@TAGraves
Copy link
Contributor Author

TAGraves commented Feb 1, 2023

Is this discard-on-interrupt (naming?) behavior something you'd expect to be configured at the global level for all jobs or would it be more on an individual job/job-class level (as some jobs might be safely idempotent)?

Hmm, it's a good question. I'm not sure, but I do know we would configure it for all of our jobs. I'd guess you'd want to configure it for all jobs and then add a retry_on: GoodJob::DirtyExit (e.g.) for individual jobs that you know are idempotent.

Glad to hear this should be simple to implement as it's a highly valuable feature for us 😄.

@TAGraves
Copy link
Contributor Author

TAGraves commented Feb 6, 2023

@bensheldon Do you have some time this week to meet virtually and talk through this with me? We got bitten by it again at the end of last week and it's a major pain point for us. I'd be happy to take a stab at implementing it but have a few questions about how to implement it correctly.

Email address is in my GH profile if you want to coordinate over email. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants