You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have encountered a peculiar pattern with one of our heavy jobs. When executing, it would exhaust the RAM limit on the GCP instance. As we run Docker + "naked" GCP, what would happen is that GCP would reboot our instance with no warning (the OOM killer would kill the process, therefore there would be no healthcheck response, and GCP would "auto heal" by rebooting). Despite configuring the executions limit with the retry_on we haven't found a way to make sure these retries are honored in case of "hard kills".
In our case this led to a "poison pill" job which would endlessly restart on the cluster, exhausting the memory of the instance and leading to a hard reboot. The advisory lock gets released properly of course.
The SRE book specifies a nice way of measuring the number of failures - they recommend recording the "start" and "ok" events, but not "failure" - because the system may fail during execution in such a way that there won't be a possibility to record a failure as such.
Could we implement something similar (or change semantics of executions to increment on checkout for instance?) so that there would be some protection against those hard kills?
This could imply that the display in the dashboard would also change - for example a job that "is executing" might be "executing or the executing system has been killed or hung", or it could imply just a change with regards to where the executions get incremented, avoiding endless restarts.
Curious to know what would be the options?
The text was updated successfully, but these errors were encountered:
I think that might address your need... unless the termination is happening when Active Job deserializes the arguments (I could imagine that hydrating a huge number of global-id objects could wreck it) before Active Job execution callbacks are invoked.
I also like your suggestion of incrementing a value in an atomic way when the job is first fetched. I am trying to defer that sort of thing until #831 ever happens, but if it's a dealbreaker I don't want to defer it too much.
What is this black sorcery 🧙 ok we'll try that new interrupt exception, I think it is sane enough to configure it for all jobs actually (at least for us). Incrementing a "checkout counter" could be done with the selection query as part of the SELECT so it would be more reliable, but would require more intervention
We have encountered a peculiar pattern with one of our heavy jobs. When executing, it would exhaust the RAM limit on the GCP instance. As we run Docker + "naked" GCP, what would happen is that GCP would reboot our instance with no warning (the OOM killer would kill the process, therefore there would be no healthcheck response, and GCP would "auto heal" by rebooting). Despite configuring the executions limit with the
retry_on
we haven't found a way to make sure these retries are honored in case of "hard kills".In our case this led to a "poison pill" job which would endlessly restart on the cluster, exhausting the memory of the instance and leading to a hard reboot. The advisory lock gets released properly of course.
The SRE book specifies a nice way of measuring the number of failures - they recommend recording the "start" and "ok" events, but not "failure" - because the system may fail during execution in such a way that there won't be a possibility to record a failure as such.
Could we implement something similar (or change semantics of
executions
to increment on checkout for instance?) so that there would be some protection against those hard kills?This could imply that the display in the dashboard would also change - for example a job that "is executing" might be "executing or the executing system has been killed or hung", or it could imply just a change with regards to where the executions get incremented, avoiding endless restarts.
Curious to know what would be the options?
The text was updated successfully, but these errors were encountered: