Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the best way to stop and discard a running job? #625

Closed
pbcoronel opened this issue Jun 22, 2022 · 7 comments · Fixed by #1073
Closed

What's the best way to stop and discard a running job? #625

pbcoronel opened this issue Jun 22, 2022 · 7 comments · Fixed by #1073

Comments

@pbcoronel
Copy link

We recently had an issue where a couple of our jobs would get stuck in the running state due to an unrelated bug. These jobs would take up workers which would cause other jobs not to run.

We needed a way to kill and discard these jobs so they wouldn't get retried but couldn't find it. The discard button is disabled when they are in the running and trying to discard them via the rails console using discard_job would raise a GoodJob::Lockable::RecordAlreadyAdvisoryLockedError.

As a workaround we ended up commenting the code in these jobs out and deploying that which would trigger a retry which would then no-op. However, we do need to find a reliable way to kill these jobs in the future in case something like this happens again. Do you have any suggestions on how to do that?

@bensheldon
Copy link
Owner

The best best way to stop and discard is to restart your worker or web-async processes.

Jobs marked "running" in the Dashboard means that the record has been Advisory Locked; that implies that they have an active database connection, which implies that the thread is still actively performing the job. And it's inherently unsafe to kill threads in Ruby.

That said, I understand there is a sequencing problem here. It sounds like what you want to do is:

  1. Unlock and discard the job
  2. At this point, the job is still executing in the thread. We can't change that from the outside, so we have to...
  3. Restart the process

To unlock and discard the job as an external actor to the executing thread, you can do this:

ACTIVE_JOB_ID = "29cea3a0-2573-4d51-b3eb-6232ea57fc48"

# Get the Postgres Connection ID that owns the advisory lock
connection_pid = GoodJob::ActiveJobJob.where(active_job_id: ACTIVE_JOB_ID).joins_advisory_locks.pluck(:pid).first

# Terminate that database connection, thus releasing the lock
GoodJob::ActiveJobJob.connection.exec_query("SELECT pg_terminate_backend(#{connection_pid}) AS terminated").first['terminated']

# Discard the job
GoodJob::ActiveJobJob.find(ACTIVE_JOB_ID).discard_job("Discarded running job")

...and now you must restart the processes, otherwise the jobs will still continue running.

Please let me know if that works or if you have concerns.

Discussion: I could expose a dangerous_force_disconnect_advisory_unlock! method on the job that would make it simpler to do those first two steps of terminating the database connection. But I would really, heavily want to emphasis that unlocking the job does not stop its execution and violates a lot of core GoodJob contracts.

Question: Could you say more about what is causing these jobs to become unresponsive?

@bensheldon
Copy link
Owner

Also, thinking about longer-term features (not to distract from your current incident), there is discussion in #509 about a feature to kill a process from the Dashboard.

@pbcoronel
Copy link
Author

Thanks for that @bensheldon, that actually would solve our issue in the case of an emergency (which is what I'm most concerned about, looking forward). The job was getting stuck due to a bug causing a loop. We've fixed this already but I want to be ready the next time something like this happens so we can kill problematic jobs and keep everything else running smoothly.

With the understanding that the process must still be restarted a console-only method such as you describe may not be a bad idea. Of course, if this could be wrapped into a first class feature that would be even better.

Thanks again for taking a look!

@edouard
Copy link

edouard commented Dec 1, 2022

Ran into this issue a couple of times since I've deployed GoodJob (I've moved from DelayedJob a couple of weeks ago). Some jobs can get stuck sometimes. It can be due to bad code which contains an infinite loop or bloat, or users generating jobs that are bigger than what we can handle. Workers spend hours working these jobs off, and other smaller jobs are not processed. When you try to discard these jobs from the web interface in order to unclog the queue you hit that GoodJob::Lockable::RecordAlreadyAdvisoryLockedError exception.

Stopping good_job doesn't help: the jobs are still locked and impossible to discard.

Would be good to have a way to kill these bad jobs in case of emergency.

@jgrau
Copy link
Contributor

jgrau commented Sep 7, 2023

Hiya. I don't want to dump an old issue but I was wondering if there's been any development in stopping and discarding a running job? If you have a proposal for how you would like for it to work I could be up for making a PR. Currently our "process" is to shut down the workers (scale our kubernetes worker deployment to 0) and then use the ui to discard the problematic job but I'd love if there was a better way.

@bensheldon
Copy link
Owner

@jgrau no worries, the issue is still open and it's good to know more about the systems experiencing it 😓

I think the fairly simple solution would be to have a method (and button in dashboard) that is like discard! or force_discard or unsafe_discard that would be different than the current discard in one key aspect: it wouldn't do a lock-check before marking the job as finished. That would solve the "how to prevent a runaway job from continually being restarted".

def discard_job(message)
with_advisory_lock do

I'd love if someone wanted to work on that, otherwise I can put it (near the top) of the backlog. Leave a comment if you do.

A ignore-lock-and-discard would not solve the problem of "how to stop the current/active execution of a runaway job", but that also seems like it would be much harder as it would involve interprorcess communication and aborting threads, and hopefully we can defer that scenario 🤞🏻

@jgrau
Copy link
Contributor

jgrau commented Sep 8, 2023

I'd love if someone wanted to work on that, otherwise I can put it (near the top) of the backlog. Leave a comment if you do.

Let me try to give it a shot but don't shoot me if it turns stale :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants