Document reliability guarantees #59

dv · 2020-07-25T06:09:51Z

What reliability specs is good_job built for? Is there a chance of jobs getting "lost" in certain (edge) cases such as workers being killed by the host, or network connections dieing, or is there a guarantee of at-least-once or exactly-once?

Your README right now mentions "run-once safety" but I'm not sure if you mean "exactly once" or "at least once", or if that's not a guarantee at all, since it's not explicit.

Also, it'd be great to know what behaviour to expect when a worker dies. I.e. does the in-progress job become available immediately for another worker to pick up, after the worker goes away, is there a timeout involved, does it require a cleanup process, etc...

In general, a paragraph in your documentation on the reliability of the jobs would be great. As it stands, I'm a bit hesitant to use it in production without having a clear idea of what behaviour to expect in failure cases.

The text was updated successfully, but these errors were encountered:

bensheldon · 2020-07-25T16:04:11Z

Thanks for flagging this. That's important to document better. I'll write out the way it currently works here, maybe we can have some back and forth, and then I'll capture that in the readme. And ~~probably~~ definitely I'll discover some things to be improved.

GoodJob currently

An ActiveJob job enqueued with GoodJob can have 3 states:

Enqueued/Unfinished. The job will be queried and picked up by the next available execution thread or after the polling interval if no execution threads are awake.
Locked/Executing/Being performed. This is when the job is being executed. Jobs are locked with a session-level Advisory Lock and should (theoretically as far as I know) prevent the job from being executed more than once concurrently by multiple execution threads. Once the job has finished successfully executing, the job record is updated with a finished_at or is deleted. It is then unlocked.
Finished/Discarded/Deleted. This is when the job has "successfully" finished executing

The recommendation in the readme is to add this to the base ApplicationJob:

retry_on StandardError, wait: :exponentially_longer, attempts: Float::INFINITY`

This has the effect of rescuing any StandardErrors and results in ActiveJob enqueuing a new GoodJob job; to GoodJob, a retry_on (or a discard_on) appears as if the initial job ran successfully. That core GoodJob::Job#perform wrapper lives here:

good_job/lib/good_job/job.rb

Lines 81 to 101 in 57cc7bf

    
             begin 
        
               result = ActiveJob::Base.execute(params) 
        
             rescue StandardError => e 
        
               error = e 
        
             end 
        
           end 
        
           if error.nil? && result.is_a?(Exception) 
        
             error = result 
        
             result = nil 
        
           end 
        
           error_message = "#{error.class}: #{error.message}" if error 
        
           self.error = error_message 
        
           self.finished_at = Time.current 
        
           if destroy_after 
        
             destroy! 
        
           else 
        
             save! 
        
           end

When errors do bubble up to the GoodJob adapter:

When a ~StandardError bubbles all the way up to the GoodJob adapter, the job will be discarded (marked as finished, or destroyed). (I am committed to changing this. See notes below)
When an ~Exception bubbles all the way up to the GoodJob adapter, the job will not be marked as finished/deleted, the exception will bubble up through the Advisory Lock and unlock it, and the record will be left in a state where another execution thread can pick it up immediately.

ActiveJob is complicated

The core challenge here is that the interaction of ActiveJob and the backend can be ambiguous.

When the retry_on attempts: are exhausted (e.g. when retry_on StandardError, attempts: 5 is used, and the job fails a 6th time), ActiveJob will re-raise the error to the backend. This creates an unclear situation on the backend, that I've seen challenge other adapters too.

Where to go from here

Having written all of this, I think I've changed my mind about how GoodJob should work in the default state.

When a ~StandardError bubbles all the way up to the GoodJob adapter, the ~~job will be discarded (marked as finished, or destroyed)~~ the job will be preserved and re-run, unless GoodJob.standard_error_results_in_finished_job (maybe with more words-mithing 😅 ).

Your issue also triggered some more research and it seems that currently ActionMailer::MailDeliveryJob which is used when calling deliver_later on a Mailer) inherits from ActiveJob::Base which means that mailer errors may be incorrectly marked as finished/discarded. Yikes! I'll write a separate bug ticket for that.

https://github.com/rails/rails/blob/a9336a67b0b1f25784f5304b4477235352fc53c5/actionmailer/lib/action_mailer/delivery_job.rb#L10

What needs more details?

The behavior should change to be maximally safe by default
How much of GoodJob's documentation should explain Ruby exception handling practices?
How much of GoodJob's documentation needs to be explaining the ins-and-outs of ActiveJob?

bensheldon · 2020-07-26T04:16:38Z

@dv I've updated the README and made the behavior safer by default. Please let me know if it answers your questions. Thanks again!

DannyBen · 2020-07-28T06:42:31Z

I have a question. The README states "GoodJob guarantees at-least-once performance of jobs".

What if I want "once and once only"?

I have jobs that send messages, and I never want them to be performed more or less than once.

bensheldon · 2020-07-28T16:08:05Z

@DannyBen that's a very good question that unfortunately falls into the gap between the backend (like GoodJob, Sidekiq, Que, etc) and how you write your own job's perform logic. In other words, you point out that "job" is ambiguous in our documentation.

GoodJob guarantees that a completely performed job (e.g. no exception is raised in the middle of execution) will run once and only once. But can only guarantee that any particular step of the business logic within your job will be performed "at-least-once".

"Idempotency" is something all backends try to answer:

DannyBen · 2020-07-28T16:21:20Z

Thanks for replying.

You point out that "job" is ambiguous in our documentation.

I am not sure it is. At least to me, a job means all the code invoked by the Job's perform method.

GoodJob guarantees that a completely performed job ... will run once and only once

That is great to know. This perhaps is the part that can be added to the README. I already assumed this was the case, but the text about "at least once" confused me.

As for other job "steps" that might run "at least once" - that is completely understandable, assuming I understand it correctly: If a job fails at "line 3" of its perform method, then lines 1 and 2 will be executed again (and again, until the break condition is met).

bensheldon · 2020-07-30T04:29:45Z

@DannyBen thank you for the feedback! I’ll integrate that into the Readme.

dv · 2020-08-03T15:43:12Z

Thank you @bensheldon, you've answered my questions and more. I'll let you close this issue in case you want to continue this discussion with the other participants.

bensheldon · 2020-08-08T15:57:56Z

Thank you everyone for the feedback and discussion! 🙌 This has been very helpful for improving GoodJob's documentation and my own understanding.

bensheldon mentioned this issue Jul 25, 2020

Preserve GoodJob::Jobs when a StandardError is raised #60

Closed

bensheldon changed the title ~~Reliability guarantees~~ Document reliability guarantees Jul 25, 2020

bensheldon mentioned this issue Jul 25, 2020

Re-perform a job if a StandardError bubbles up; better document job reliability #62

Merged

bensheldon mentioned this issue Aug 8, 2020

Clarify reliability guarantees #70

Merged

bensheldon closed this as completed Aug 8, 2020

bensheldon mentioned this issue Sep 21, 2020

run-once guarantee? #151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document reliability guarantees #59

Document reliability guarantees #59

dv commented Jul 25, 2020 •

edited

Loading

bensheldon commented Jul 25, 2020

bensheldon commented Jul 26, 2020

DannyBen commented Jul 28, 2020

bensheldon commented Jul 28, 2020

DannyBen commented Jul 28, 2020 •

edited

Loading

bensheldon commented Jul 30, 2020

dv commented Aug 3, 2020

bensheldon commented Aug 8, 2020

Document reliability guarantees #59

Document reliability guarantees #59

Comments

dv commented Jul 25, 2020 • edited Loading

bensheldon commented Jul 25, 2020

GoodJob currently

ActiveJob is complicated

Where to go from here

What needs more details?

bensheldon commented Jul 26, 2020

DannyBen commented Jul 28, 2020

bensheldon commented Jul 28, 2020

DannyBen commented Jul 28, 2020 • edited Loading

bensheldon commented Jul 30, 2020

dv commented Aug 3, 2020

bensheldon commented Aug 8, 2020

dv commented Jul 25, 2020 •

edited

Loading

DannyBen commented Jul 28, 2020 •

edited

Loading