Jobs go back into the queued state when a worker is killed #821

TAGraves · 2023-01-31T17:07:18Z

Today, if a process is running a job and is killed (e.g. by OOM, a kill signal, etc.), the job will go back into the queued state and another worker will pick up the job. This can cause jobs to partially executed multiple times. In the case of something like OOM caused by something going wrong it can be particularly devastating since that job might just keep being run by more workers and causing them to die 😱.

I'm pretty sure this is known and intended behavior, but in case you want a repro:
Create a job like

def perform
  `kill -9 #{Process.pid}`
end

Enqueue the job via a Rails console and then run bundle exec good_job start.

Is it possible within good_job's current architecture to make these jobs get discarded instead of re-enqueued (well, I realize enqueued really just means "no advisory lock")? Resque has DirtyExit which I've found really useful in the past. We try to make our jobs idempotent and also not have them crash the process that's running them, but in the case of someone making a mistake it can be really devastating to have the whole queue become practically inoperational.

The text was updated successfully, but these errors were encountered:

bensheldon · 2023-02-01T15:30:16Z

@TAGraves that's interesting! TIL other queue adapters have that setting. I recently added #794 which is semi-related.

You're correct that the current behavior is intended; I am open to adding other modes.

Is this discard-on-interrupt (naming?) behavior something you'd expect to be configured at the global level for all jobs, or would it be more on an individual job/job-class level (as some jobs might be safely idempotent)? If the former, I'd implement it as a GoodJob configuration setting, as the latter I'm imagining it would be an ActiveJob extension. Either way, I'm imagining it could be as simple as checking during dequeue if a job had been interrupted (performed_at would already be present) and then cleaning up the record (setting finished_at and setting a custom exception):

good_job/app/models/good_job/execution.rb

Lines 232 to 235 in 430a6a0

    
           unfinished.dequeueing_ordered(parsed_queues).only_scheduled.limit(1).with_advisory_lock(unlock_session: true, select_limit: queue_select_limit) do |executions| 
        
             execution = executions.first 
        
             break if execution.blank? 
        
             break :unlocked unless execution&.executable?

TAGraves · 2023-02-01T17:44:03Z

Is this discard-on-interrupt (naming?) behavior something you'd expect to be configured at the global level for all jobs or would it be more on an individual job/job-class level (as some jobs might be safely idempotent)?

Hmm, it's a good question. I'm not sure, but I do know we would configure it for all of our jobs. I'd guess you'd want to configure it for all jobs and then add a retry_on: GoodJob::DirtyExit (e.g.) for individual jobs that you know are idempotent.

Glad to hear this should be simple to implement as it's a highly valuable feature for us 😄.

TAGraves · 2023-02-06T17:09:58Z

@bensheldon Do you have some time this week to meet virtually and talk through this with me? We got bitten by it again at the end of last week and it's a major pain point for us. I'd be happy to take a stab at implementing it but have a few questions about how to implement it correctly.

Email address is in my GH profile if you want to coordinate over email. Thanks!

bensheldon mentioned this issue Feb 6, 2023

Create InterruptErrors extension to raise an exception when an interrupted job is retried #830

Merged

TAGraves closed this as completed Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs go back into the queued state when a worker is killed #821

Jobs go back into the queued state when a worker is killed #821

TAGraves commented Jan 31, 2023

bensheldon commented Feb 1, 2023

TAGraves commented Feb 1, 2023

TAGraves commented Feb 6, 2023

Jobs go back into the queued state when a worker is killed #821

Jobs go back into the queued state when a worker is killed #821

Comments

TAGraves commented Jan 31, 2023

bensheldon commented Feb 1, 2023

TAGraves commented Feb 1, 2023

TAGraves commented Feb 6, 2023