Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry slurm-commands on error #303

Open
mb706 opened this issue Oct 8, 2024 · 0 comments
Open

Retry slurm-commands on error #303

mb706 opened this issue Oct 8, 2024 · 0 comments

Comments

@mb706
Copy link

mb706 commented Oct 8, 2024

I am working on a cluster where squeue and sbatch sometimes fail, for whatever reason.

Submitting 1080000 jobs in 1080 chunks using cluster functions 'Slurm' ...
Submitting [========>------------------------------------------]  17% eta: 11mError: Fatal error occurred: 101. Command 'sbatch' produced exit code 1. Output: 'sbatch: error: Invalid user for SlurmUser slurm, ignored
sbatch: fatal: Unable to process configuration file'
Submitting 15000000 jobs in 7500 chunks using cluster functions 'Slurm' ...
Submitting [===========================>-----------------------]  55% eta:  4hError: Listing of jobs failed (exit code 1);
cmd: 'squeue --user=$USER --states=R,S,CG,RS,SI,SO,ST --noheader --format=%i -r'
output:
squeue: error: Invalid user for SlurmUser slurm, ignored
squeue: fatal: Unable to process configuration file

these errors are transient; after resubmitting the jobs, everything continues as it should. It would be nice to have an option for this to happen automatically, since then one could let batchtools submit jobs over night. My suggestion would be to give an option to retry slurm commands X times with Y seconds of pause in between (possibly with exponential backoff).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant