Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspending jobs with SGE will kill job #1656

Closed
grst opened this issue Jun 27, 2020 · 2 comments
Closed

Suspending jobs with SGE will kill job #1656

grst opened this issue Jun 27, 2020 · 2 comments
Milestone

Comments

@grst
Copy link
Contributor

grst commented Jun 27, 2020

Bug report

Expected behavior and actual behavior

On our HPC, we use the SGE scheduler and have implemented the "long queue" as a subordinate queue. The jobs in a subordinate queue will get suspended (s state) when the "short queue" gets busy and will be resumed once it is no longer busy.

Due to the -notify option, SGE sends the SIGUSR1 signal (code 138) to the processes before suspending them. Nextflow considers this as error and kills the jobs.

Expected behaviour:
Keep the jobs running, as they did not fail - they will resume later.

Connection to previous issues:
It seems this has been discussed already in #1001 some time ago and was closed without solution, because at that time, Nextflow didn't have a concept for suspended jobs. As far as I can tell, the s status has been implemented for SGE/UGE in #1536, however nextflow 20.04.1 still kills the jobs.

Possible solution:
I know that 138 is also sent when the soft resource is reached and nextflow considers this
worth killing the jobs. Either reconsider this entirey or possible make the behaviour configurable?

Steps to reproduce the problem

Minimal nextflow script:

#!/usr/bin/env nextflow

process test {
    input:
       val foo from Channel.from("bar")

    """
    sleep 600
    """
}
./test.nf
# find sge job id and suspend the job manually
qmod -s $JOB_ID

Program output

N E X T F L O W  ~  version 20.04.1
Launching `./test.nf` [elegant_bell] - revision: 9466ee37b4
executor >  sge (1)
[43/5e2050] process > test (1) [100%] 1 of 1, failed: 1 ✘
Error executing process > 'test (1)'

Caused by:
  Process `test (1)` terminated with an error exit status (138)

Command executed:

  sleep 600

Command exit status:
  138

Command output:
  (empty)

Work dir:
  /data/scratch/sturm/scratch/work/43/5e2050be59067acb9402bf97a4cc33

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

nextflow.log

Environment

  • Nextflow version: version 20.04.1 build 5335
  • Java version: 1.8.0_231
  • Operating system: CentOS 7
  • Bash version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)

CC @riederd

@grst
Copy link
Contributor Author

grst commented Jun 29, 2020

I know that 138 is also sent when the soft resource is reached and nextflow considers this
worth killing the jobs.

UPDATE: this is not the case. SIGUSR1 (138) is only sent before SIGSTOP, i.e. before suspension. Before SIGKILL the signal SIGUSR2 (139) is sent.

Therefore, I believe it's safe to ignore the 138 signal.

 -notify
          Available for qsub,  qrsh  (with  command)  and  qalter
          only.

         This flag, when set causes  Sun  Grid  Engine  to  send
          "warning" signals to a running job prior to sending the
          signals themselves. If a SIGSTOP is  pending,  the  job
          will  receive a SIGUSR1 several seconds before the SIG-
          STOP. If a SIGKILL is pending, the job will  receive  a
          SIGUSR2  several  seconds  before  the  SIGKILL.   This
          option provides the running job, before  receiving  the
          SIGSTOP  or  SIGKILL,  a configured time interval to do
          e.g. cleanup operations. The amount of  time  delay  is
          controlled by the notify parameter in each queue confi-
          guration (see queue_conf(5)).

(from man qsub)

@pditommaso
Copy link
Member

It looks SIGUSR1 is used to notify suspension and SIGUSR2 to signal a kill. Therefore SIGUSR1 should be removed 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants