-
-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's the best way to stop and discard a running job? #625
Comments
The best best way to stop and discard is to restart your worker or web-async processes. Jobs marked "running" in the Dashboard means that the record has been Advisory Locked; that implies that they have an active database connection, which implies that the thread is still actively performing the job. And it's inherently unsafe to kill threads in Ruby. That said, I understand there is a sequencing problem here. It sounds like what you want to do is:
To unlock and discard the job as an external actor to the executing thread, you can do this: ACTIVE_JOB_ID = "29cea3a0-2573-4d51-b3eb-6232ea57fc48"
# Get the Postgres Connection ID that owns the advisory lock
connection_pid = GoodJob::ActiveJobJob.where(active_job_id: ACTIVE_JOB_ID).joins_advisory_locks.pluck(:pid).first
# Terminate that database connection, thus releasing the lock
GoodJob::ActiveJobJob.connection.exec_query("SELECT pg_terminate_backend(#{connection_pid}) AS terminated").first['terminated']
# Discard the job
GoodJob::ActiveJobJob.find(ACTIVE_JOB_ID).discard_job("Discarded running job") ...and now you must restart the processes, otherwise the jobs will still continue running. Please let me know if that works or if you have concerns. Discussion: I could expose a Question: Could you say more about what is causing these jobs to become unresponsive? |
Also, thinking about longer-term features (not to distract from your current incident), there is discussion in #509 about a feature to kill a process from the Dashboard. |
Thanks for that @bensheldon, that actually would solve our issue in the case of an emergency (which is what I'm most concerned about, looking forward). The job was getting stuck due to a bug causing a loop. We've fixed this already but I want to be ready the next time something like this happens so we can kill problematic jobs and keep everything else running smoothly. With the understanding that the process must still be restarted a console-only method such as you describe may not be a bad idea. Of course, if this could be wrapped into a first class feature that would be even better. Thanks again for taking a look! |
Ran into this issue a couple of times since I've deployed GoodJob (I've moved from DelayedJob a couple of weeks ago). Some jobs can get stuck sometimes. It can be due to bad code which contains an infinite loop or bloat, or users generating jobs that are bigger than what we can handle. Workers spend hours working these jobs off, and other smaller jobs are not processed. When you try to discard these jobs from the web interface in order to unclog the queue you hit that Stopping Would be good to have a way to kill these bad jobs in case of emergency. |
Hiya. I don't want to dump an old issue but I was wondering if there's been any development in stopping and discarding a running job? If you have a proposal for how you would like for it to work I could be up for making a PR. Currently our "process" is to shut down the workers (scale our kubernetes worker deployment to 0) and then use the ui to discard the problematic job but I'd love if there was a better way. |
@jgrau no worries, the issue is still open and it's good to know more about the systems experiencing it 😓 I think the fairly simple solution would be to have a method (and button in dashboard) that is like good_job/app/models/good_job/job.rb Lines 216 to 217 in c0e1408
I'd love if someone wanted to work on that, otherwise I can put it (near the top) of the backlog. Leave a comment if you do. A ignore-lock-and-discard would not solve the problem of "how to stop the current/active execution of a runaway job", but that also seems like it would be much harder as it would involve interprorcess communication and aborting threads, and hopefully we can defer that scenario 🤞🏻 |
Let me try to give it a shot but don't shoot me if it turns stale :) |
We recently had an issue where a couple of our jobs would get stuck in the
running
state due to an unrelated bug. These jobs would take up workers which would cause other jobs not to run.We needed a way to kill and discard these jobs so they wouldn't get retried but couldn't find it. The
discard
button is disabled when they are in therunning
and trying to discard them via the rails console usingdiscard_job
would raise aGoodJob::Lockable::RecordAlreadyAdvisoryLockedError
.As a workaround we ended up commenting the code in these jobs out and deploying that which would trigger a retry which would then no-op. However, we do need to find a reliable way to kill these jobs in the future in case something like this happens again. Do you have any suggestions on how to do that?
The text was updated successfully, but these errors were encountered: