Hard kill resilience with execution counts #922

julik · 2023-04-11T10:33:44Z

We have encountered a peculiar pattern with one of our heavy jobs. When executing, it would exhaust the RAM limit on the GCP instance. As we run Docker + "naked" GCP, what would happen is that GCP would reboot our instance with no warning (the OOM killer would kill the process, therefore there would be no healthcheck response, and GCP would "auto heal" by rebooting). Despite configuring the executions limit with the retry_on we haven't found a way to make sure these retries are honored in case of "hard kills".

In our case this led to a "poison pill" job which would endlessly restart on the cluster, exhausting the memory of the instance and leading to a hard reboot. The advisory lock gets released properly of course.

The SRE book specifies a nice way of measuring the number of failures - they recommend recording the "start" and "ok" events, but not "failure" - because the system may fail during execution in such a way that there won't be a possibility to record a failure as such.

Could we implement something similar (or change semantics of executions to increment on checkout for instance?) so that there would be some protection against those hard kills?

This could imply that the display in the dashboard would also change - for example a job that "is executing" might be "executing or the executing system has been killed or hung", or it could imply just a change with regards to where the executions get incremented, avoiding endless restarts.

Curious to know what would be the options?

The text was updated successfully, but these errors were encountered:

bensheldon · 2023-04-11T17:48:05Z

We recently introduced (#830) an extension that will raise a rescuable exception if the job was interrupted/terminated during execution:

https://github.com/bensheldon/good_job/#interrupts

I think that might address your need... unless the termination is happening when Active Job deserializes the arguments (I could imagine that hydrating a huge number of global-id objects could wreck it) before Active Job execution callbacks are invoked.

I also like your suggestion of incrementing a value in an atomic way when the job is first fetched. I am trying to defer that sort of thing until #831 ever happens, but if it's a dealbreaker I don't want to defer it too much.

julik · 2023-04-11T21:58:27Z

What is this black sorcery 🧙 ok we'll try that new interrupt exception, I think it is sane enough to configure it for all jobs actually (at least for us). Incrementing a "checkout counter" could be done with the selection query as part of the SELECT so it would be more reliable, but would require more intervention

julik closed this as completed Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard kill resilience with execution counts #922

Hard kill resilience with execution counts #922

julik commented Apr 11, 2023

bensheldon commented Apr 11, 2023

julik commented Apr 11, 2023

Hard kill resilience with execution counts #922

Hard kill resilience with execution counts #922

Comments

julik commented Apr 11, 2023

bensheldon commented Apr 11, 2023

julik commented Apr 11, 2023