Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs stuck at "running" #86

Open
mpermana opened this issue Apr 24, 2020 · 2 comments
Open

Jobs stuck at "running" #86

mpermana opened this issue Apr 24, 2020 · 2 comments

Comments

@mpermana
Copy link

This can happen when a job is running and the ndscheduler process died.

I.e to reproduce:
can create shell job like that sleeps for a while i.e:
["bash","-c","sleep 3600"]

when the job is running, send kill signal, the next time ndscheduler starts, the job will be stuck at running.

@palto42
Copy link

palto42 commented Jun 30, 2020

I can confirm this behavior.
What would be needed is a database cleanup at the start of ndscheduler to change the status of those jobs to "failed" since they are most likely not completed.

@palto42
Copy link

palto42 commented Jul 4, 2020

I submitted a PR #90 which cleans the database from such interrupted executions.

In my case the interruption was caused by running the ndscheduler via systemd unit which sends a SIGTERM at stop/restart and not the SIGINT which is expected by ndscheduler. It is possible to change the stop signal used by systemd unit to SIGINT in order to ensure graceful stop of ndscheduler. Another alternative would be to add SIGTERM in server.py alongside with the handler for SIGINT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants