What's the best way to stop and discard a running job? #625

pbcoronel · 2022-06-22T14:53:08Z

We recently had an issue where a couple of our jobs would get stuck in the running state due to an unrelated bug. These jobs would take up workers which would cause other jobs not to run.

We needed a way to kill and discard these jobs so they wouldn't get retried but couldn't find it. The discard button is disabled when they are in the running and trying to discard them via the rails console using discard_job would raise a GoodJob::Lockable::RecordAlreadyAdvisoryLockedError.

As a workaround we ended up commenting the code in these jobs out and deploying that which would trigger a retry which would then no-op. However, we do need to find a reliable way to kill these jobs in the future in case something like this happens again. Do you have any suggestions on how to do that?

The text was updated successfully, but these errors were encountered:

bensheldon · 2022-06-22T16:50:45Z

The best best way to stop and discard is to restart your worker or web-async processes.

Jobs marked "running" in the Dashboard means that the record has been Advisory Locked; that implies that they have an active database connection, which implies that the thread is still actively performing the job. And it's inherently unsafe to kill threads in Ruby.

That said, I understand there is a sequencing problem here. It sounds like what you want to do is:

Unlock and discard the job
At this point, the job is still executing in the thread. We can't change that from the outside, so we have to...
Restart the process

To unlock and discard the job as an external actor to the executing thread, you can do this:

ACTIVE_JOB_ID = "29cea3a0-2573-4d51-b3eb-6232ea57fc48"

# Get the Postgres Connection ID that owns the advisory lock
connection_pid = GoodJob::ActiveJobJob.where(active_job_id: ACTIVE_JOB_ID).joins_advisory_locks.pluck(:pid).first

# Terminate that database connection, thus releasing the lock
GoodJob::ActiveJobJob.connection.exec_query("SELECT pg_terminate_backend(#{connection_pid}) AS terminated").first['terminated']

# Discard the job
GoodJob::ActiveJobJob.find(ACTIVE_JOB_ID).discard_job("Discarded running job")

...and now you must restart the processes, otherwise the jobs will still continue running.

Please let me know if that works or if you have concerns.

Discussion: I could expose a dangerous_force_disconnect_advisory_unlock! method on the job that would make it simpler to do those first two steps of terminating the database connection. But I would really, heavily want to emphasis that unlocking the job does not stop its execution and violates a lot of core GoodJob contracts.

Question: Could you say more about what is causing these jobs to become unresponsive?

bensheldon · 2022-06-22T16:53:15Z

Also, thinking about longer-term features (not to distract from your current incident), there is discussion in #509 about a feature to kill a process from the Dashboard.

pbcoronel · 2022-06-22T17:43:00Z

Thanks for that @bensheldon, that actually would solve our issue in the case of an emergency (which is what I'm most concerned about, looking forward). The job was getting stuck due to a bug causing a loop. We've fixed this already but I want to be ready the next time something like this happens so we can kill problematic jobs and keep everything else running smoothly.

With the understanding that the process must still be restarted a console-only method such as you describe may not be a bad idea. Of course, if this could be wrapped into a first class feature that would be even better.

Thanks again for taking a look!

edouard · 2022-12-01T16:43:33Z

Ran into this issue a couple of times since I've deployed GoodJob (I've moved from DelayedJob a couple of weeks ago). Some jobs can get stuck sometimes. It can be due to bad code which contains an infinite loop or bloat, or users generating jobs that are bigger than what we can handle. Workers spend hours working these jobs off, and other smaller jobs are not processed. When you try to discard these jobs from the web interface in order to unclog the queue you hit that GoodJob::Lockable::RecordAlreadyAdvisoryLockedError exception.

Stopping good_job doesn't help: the jobs are still locked and impossible to discard.

Would be good to have a way to kill these bad jobs in case of emergency.

jgrau · 2023-09-07T11:59:52Z

Hiya. I don't want to dump an old issue but I was wondering if there's been any development in stopping and discarding a running job? If you have a proposal for how you would like for it to work I could be up for making a PR. Currently our "process" is to shut down the workers (scale our kubernetes worker deployment to 0) and then use the ui to discard the problematic job but I'd love if there was a better way.

bensheldon · 2023-09-07T14:33:29Z

@jgrau no worries, the issue is still open and it's good to know more about the systems experiencing it 😓

I think the fairly simple solution would be to have a method (and button in dashboard) that is like discard! or force_discard or unsafe_discard that would be different than the current discard in one key aspect: it wouldn't do a lock-check before marking the job as finished. That would solve the "how to prevent a runaway job from continually being restarted".

good_job/app/models/good_job/job.rb

Lines 216 to 217 in c0e1408

    
           def discard_job(message) 
        
             with_advisory_lock do

I'd love if someone wanted to work on that, otherwise I can put it (near the top) of the backlog. Leave a comment if you do.

A ignore-lock-and-discard would not solve the problem of "how to stop the current/active execution of a runaway job", but that also seems like it would be much harder as it would involve interprorcess communication and aborting threads, and hopefully we can defer that scenario 🤞🏻

jgrau · 2023-09-08T12:35:55Z

I'd love if someone wanted to work on that, otherwise I can put it (near the top) of the backlog. Leave a comment if you do.

Let me try to give it a shot but don't shoot me if it turns stale :)

bensheldon mentioned this issue Jul 11, 2022

Don't delegate GoodJob::Job#status to executions to avoid race condition #661

Merged

bensheldon mentioned this issue Sep 19, 2023

"Force" discard jobs that are already running/runaway to prevent retry #1073

Merged

bensheldon closed this as completed in #1073 Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the best way to stop and discard a running job? #625

What's the best way to stop and discard a running job? #625

pbcoronel commented Jun 22, 2022

bensheldon commented Jun 22, 2022

bensheldon commented Jun 22, 2022

pbcoronel commented Jun 22, 2022

edouard commented Dec 1, 2022

jgrau commented Sep 7, 2023

bensheldon commented Sep 7, 2023

jgrau commented Sep 8, 2023

What's the best way to stop and discard a running job? #625

What's the best way to stop and discard a running job? #625

Comments

pbcoronel commented Jun 22, 2022

bensheldon commented Jun 22, 2022

bensheldon commented Jun 22, 2022

pbcoronel commented Jun 22, 2022

edouard commented Dec 1, 2022

jgrau commented Sep 7, 2023

bensheldon commented Sep 7, 2023

jgrau commented Sep 8, 2023