[help] error = "continue"
does not apply if a job is out of memory, the whole pipeline crashes
#1214
Replies: 1 comment 9 replies
-
In case it helps, I also get OOM errors using crew.cluster (or batchtools) on SLURM when storage is full. I'll see if I can brew up a reprex a little later today. I guess the same reprex could have a memory bomb option. It happens when workers write to /tmp, which is a tmpfs in our cluster so the OOM errors make sense. That was hard to troubleshoot. It should be isolated to the worker but have a feeling it takes down the pipeline as you've seen - I'll check it with the reprex. Also happens when workers write to /home - a network share in our system so shared with main controller. No surprise that one takes down the whole pipeline, though in our cluster that version is a bit easier to troubleshoot unless cleanup processes remove the evidence. I think NFS would 'clean up' any files in /home that were left unclosed when the worker went down - would free up space on the network share & make it look like it never happened. |
Beta Was this translation helpful? Give feedback.
-
Help
Description
There is a job for which it is hard to predict memory usage. That job is likely causing the
callr subprocess failed
. I don't know for sure because the offending job ID is not returned.However, my issue is that the shole pipeline crashes, eventhough I specified
error = "continue"
in thetar_option_set
. I would prefer the whole pipeline (3-5 days of execution, with 15000+ targets) to continue, so as to save days in case I am far away from the computer.Error below.
Beta Was this translation helpful? Give feedback.
All reactions