Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard kill resilience with execution counts #922

Closed
julik opened this issue Apr 11, 2023 · 2 comments
Closed

Hard kill resilience with execution counts #922

julik opened this issue Apr 11, 2023 · 2 comments

Comments

@julik
Copy link
Contributor

julik commented Apr 11, 2023

We have encountered a peculiar pattern with one of our heavy jobs. When executing, it would exhaust the RAM limit on the GCP instance. As we run Docker + "naked" GCP, what would happen is that GCP would reboot our instance with no warning (the OOM killer would kill the process, therefore there would be no healthcheck response, and GCP would "auto heal" by rebooting). Despite configuring the executions limit with the retry_on we haven't found a way to make sure these retries are honored in case of "hard kills".

In our case this led to a "poison pill" job which would endlessly restart on the cluster, exhausting the memory of the instance and leading to a hard reboot. The advisory lock gets released properly of course.

The SRE book specifies a nice way of measuring the number of failures - they recommend recording the "start" and "ok" events, but not "failure" - because the system may fail during execution in such a way that there won't be a possibility to record a failure as such.

Could we implement something similar (or change semantics of executions to increment on checkout for instance?) so that there would be some protection against those hard kills?

This could imply that the display in the dashboard would also change - for example a job that "is executing" might be "executing or the executing system has been killed or hung", or it could imply just a change with regards to where the executions get incremented, avoiding endless restarts.

Curious to know what would be the options?

@bensheldon
Copy link
Owner

We recently introduced (#830) an extension that will raise a rescuable exception if the job was interrupted/terminated during execution:

https://github.com/bensheldon/good_job/#interrupts

I think that might address your need... unless the termination is happening when Active Job deserializes the arguments (I could imagine that hydrating a huge number of global-id objects could wreck it) before Active Job execution callbacks are invoked.

I also like your suggestion of incrementing a value in an atomic way when the job is first fetched. I am trying to defer that sort of thing until #831 ever happens, but if it's a dealbreaker I don't want to defer it too much.

@julik
Copy link
Contributor Author

julik commented Apr 11, 2023

What is this black sorcery 🧙 ok we'll try that new interrupt exception, I think it is sane enough to configure it for all jobs actually (at least for us). Incrementing a "checkout counter" could be done with the selection query as part of the SELECT so it would be more reliable, but would require more intervention

@julik julik closed this as completed Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants