Allow setting `rerunnable` from `metadata['options']` #4707

greschd · 2021-02-04T10:42:47Z

The rerunnable option is currently implemented in the scheduler plugins (LSF, PBS, SGE, SLURM), but there was no way to activate it for a particular CalcJob. This PR adds a metadata.options.rerunnable flag that is forwarded to the JobTemplate.

This is a follow-up on a discussion on Slack with @giovannipizzi and @espenfl.

Important points raised there:

it might depend on scheduler settings of the cluster whether re-queueing gives the same job id
(depending again on the scheduler config) it's possible that a code would execute more than once, so it should only be applied for idempotent codes

I've now been running this in production for a while, and did not notice any adverse affects. In our case, submitting jobs to SLURM can trigger the allocation of new nodes; if that fails for some reason, SLURM needs to re-queue the jobs such that it can assign different nodes to the job.

Some open questions:

is there a good place to test this? I looked at test_calc_job.py, but since the option doesn't affect the direct scheduler there's no way to check it was used. Do we have a calcjob test with different schedulers?
related to the first point, should the direct scheduler warn when the (unused) rerunnable flag is set?
looking at the SGE implementation, I'm not sure if that expects a bool (as do all the other schedulers), or a str.

codecov · 2021-02-04T10:59:28Z

Codecov Report

Merging #4707 (ba1fa63) into develop (3915f44) will increase coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head ba1fa63 differs from pull request most recent head b116c34. Consider uploading reports for the commit b116c34 to get more accurate results

@@             Coverage Diff             @@
##           develop    #4707      +/-   ##
===========================================
+ Coverage    80.42%   80.43%   +0.02%     
===========================================
  Files          531      531              
  Lines        36969    36973       +4     
===========================================
+ Hits         29730    29737       +7     
+ Misses        7239     7236       -3

Flag	Coverage Δ
django	`74.92% <100.00%> (+0.02%)`	⬆️
sqlalchemy	`73.87% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/engine/processes/calcjobs/calcjob.py	`88.71% <100.00%> (+0.04%)`	⬆️
aiida/schedulers/plugins/direct.py	`62.88% <100.00%> (+1.67%)`	⬆️
aiida/schedulers/plugins/sge.py	`70.84% <100.00%> (+0.55%)`	⬆️
aiida/engine/daemon/client.py	`75.26% <0.00%> (-1.01%)`	⬇️
aiida/transports/plugins/local.py	`81.41% <0.00%> (-0.25%)`	⬇️
aiida/schedulers/plugins/pbsbaseclasses.py	`57.19% <0.00%> (+0.33%)`	⬆️
aiida/schedulers/plugins/slurm.py	`70.14% <0.00%> (+0.35%)`	⬆️
aiida/schedulers/plugins/lsf.py	`70.48% <0.00%> (+0.40%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3915f44...b116c34. Read the comment docs.

giovannipizzi · 2021-02-04T12:39:34Z

Thanks @greschd !

A few comments:

it would be good do add documentation for this, and in particular also an explanation that this was tested with SLURM and it works fine, that it's not implemented in direct, and for the others it's untested (in particular, it might generate a new job ID and so AiiDA would not be able to 'pick' the rerunned job), and the other note on idempotent jobs. Maybe even you could open an issue, link to this PR, and in the docs you can point people to the issue, saying to report if they use a different scheduler with rerun so we can have this "tested", in the future
for testing, you could split the test in two units: one in the place you suggest, checking that the job template attribute is set, just calling presubmit; and (more than) one for each scheduler plugin, to check that they write the correct string in the header if the rerunnable flag is set to True
Regarding the direct scheduler - I think it's not crucial, but indeed since by default the flag is set to False, it's ok to issue a warning there if rerunnable is true, and continue

regarding SGE, here it says:

  -r y[es]|n[o]
    Available for qsub and qalter only.

    Identifies the ability of a job to be rerun or not.  If
    the  value of -r is 'yes', the job will be rerun if the
    job was  aborted  without  leaving  a  consistent  exit
    state.   (This  is  typically  the  case if the node on
    which the job is running crashes).  If -r is 'no',  the
    job will not be rerun under any circumstances.
    Interactive jobs submitted with qsh, qrsh or qlogin are
    not rerunnable.

    Qalter allows changing this option even while  the  job
    executes.

    The value specified with this option or the correspond-
    ing  value  specified  in  qmon  will only be passed to
    defined JSV instances if the value is yes.  The name of
    the  parameter will be r. The value will be y also when
    then long form yes  was  specified  during  submission.
    (see  -jsv  option  above or find more information con-
    cerning JSV in jsv(1))

So my guess is that the current implementation might even be wrong (I think the combination SGE+rerunnable was never tested) - I suggest you just replace it with

     if job_tmpl.rerunnable:
         lines.append(f'#$ -r yes')

greschd · 2021-02-04T14:23:06Z

Thanks @giovannipizzi, great points. One quick question: what's the best place to put the documentation? From a quick glance, either in the options paragraph or a separate paragraph after dry run makes sense to me, but I'm not very familiar with the organization of the docs.

that this was tested with SLURM and it works fine

If I understood @espenfl's point correctly, we should say that it works fine with one specific SLURM setup. People should do their own testing because SLURM configurations can differ.

giovannipizzi · 2021-02-04T14:38:07Z

What's the best place to put the documentation? From a quick glance, either in the options paragraph or a separate paragraph after dry run makes sense to me, but I'm not very familiar with the organization of the docs.

Indeed definitely in the options paragraph; if all the discussion fits in there, great. Otherwise a new paragraph with the more technical details (lined from the options paragraph) is also a good idea.

If I understood @espenfl's point correctly, we should say that it works fine with one specific SLURM setup. People should do their own testing because SLURM configurations can differ.

I see, good point - great to mention that, then

espenfl · 2021-02-04T14:41:56Z

Yes, SLURM setups are different (as are the other schedulers as well). Also, maybe this is a good time to put some life into my backburner scheduler container work. It was more or less done, just needed to integrate them with the scheduler code in AiiDA and introduce some tests based on those. I will try to do some of that next week.

Also, thanks a lot for adding this @greschd.

greschd · 2021-02-04T15:40:36Z

Regarding SGE: Do you know if the default is no, or should we add '#$ -r no' if rerunnable is False? From that man page, it wasn't 100% clear to me.

greschd · 2021-02-08T11:06:00Z

Regarding SGE: Do you know if the default is no, or should we add '#$ -r no' if rerunnable is False? From that man page, it wasn't 100% clear to me.

Note that I've now added the #$ -r no if rerunnable is False, but that can easily be reverted if desired. The benefit of adding it is that we the rerunnable option is definitely off when set to False, but it may break a setup that inadvertently relied on it being turned off somewhere in the cluster's settings. I think that is ok, because the "correct" fix would then be to turn on rerunnable explicitly.

ramirezfranciscof · 2021-05-18T11:20:44Z

Hey @greschd , thanks for the contribution! Sorry it got a bit buried 😅

I've taken a look and it all seems good; I don't know if you wanted a more specific greenlight for someone in particular, but feel free to tag them if that was the case, otherwise I can approve once the branch is updated. The conflicts in tests/schedulers/test_lsf.py are because I recently moved the tests out of the old format into normal pytest functions, I think you can just copy your added test method as it is outside of its current class and it should work (it isn't even using self anywhere else). If you don't currently have the time, let me know and I can do the update.

The only thing I'm a bit wary about is this:

Note that I've now added the #$ -r no if rerunnable is False, but that can easily be reverted if desired. The benefit of adding it is that we the rerunnable option is definitely off when set to False, but it may break a setup that inadvertently relied on it being turned off somewhere in the cluster's settings. I think that is ok, because the "correct" fix would then be to turn on rerunnable explicitly.

This is true as long as the schedulers didn't have / will not have any sort of modification that will change how this option works or is provided (for example, perhaps previously they only admitted #$ --rerunable no/yes or #$ -r disable/enable, or maybe the option wasn't even there yet). If this was the case, a "no keyword" approach would still work with old versions but only allowing for explicit enabling/disabling may break without any possible solution for the user. I'm not sure how common this is with these schedulers and if this is a scenario worth "code-considering", but I thought at least it was worth a brief "design-considering" =P.

I guess the super-safe road would be to default these options to None, in which case they are ignored, and allow the user to set both False or True so they can make sure they are being applied in both ways.

greschd · 2021-05-21T11:58:29Z

This is true as long as the schedulers didn't have / will not have any sort of modification that will change how this option works or is provided (for example, perhaps previously they only admitted #$ --rerunable no/yes or #$ -r disable/enable, or maybe the option wasn't even there yet). If this was the case, a "no keyword" approach would still work with old versions but only allowing for explicit enabling/disabling may break without any possible solution for the user. I'm not sure how common this is with these schedulers and if this is a scenario worth "code-considering", but I thought at least it was worth a brief "design-considering" =P.

Yeah, that is exactly the point where I am not 100% sure. I think it would be good to get buy-in from @giovannipizzi on this.

I guess the super-safe road would be to default these options to None, in which case they are ignored, and allow the user to set both False or True so they can make sure they are being applied in both ways.

The main difficulty here is that not all schedulers treated None the same so far -- PBS didn't emit anything, whereas SLURM and LSF added what is now the False option. So it's a trade-off between consistency and strict backwards-compatibility.

We could just say that None is "legacy behavior", and do some deprecation path. But then you'll end up emitting warnings to users for what may end up being no good reason.

ramirezfranciscof · 2021-05-25T14:53:14Z

The main difficulty here is that not all schedulers treated None the same so far -- PBS didn't emit anything, whereas SLURM and LSF added what is now the False option. So it's a trade-off between consistency and strict backwards-compatibility.

We could just say that None is "legacy behavior", and do some deprecation path. But then you'll end up emitting warnings to users for what may end up being no good reason.

So to make sure I understood this correctly: you mean that some scheduler plugins received the option=None and wrote #$ option = False in the running script, whereas others didn't write anything with the same option=None input?

If this is the case, I would say there are two cases:

The default behavior of the scheduler is the same when the option is not present or when it is set to False (or whatever the scheduler plugin defaults to). If this is the case I would argue there is no need to do a deprecation path for the change. I mean, it is true that we change what we write in the script, but if the resulting effect on the scheduler is the same then there is not really any observable change occurring from the point of view of the user. One could say that we need to deprecate because the result we have control over is the writing of the script, not what the scheduler does with it: I would counter that if the scheduler at some point changes what the default behavior is for the missing keyword, then it is the scheduler that is breaking backwards compatibility, not us. Not sure if one argument "objectively" wins over the other, but I like mine better 🤓 .
The default behavior of the scheduler is NOT the same. Indeed here a warning should be issued and a deprecation path provided. This could generate some friction and is a little extra work, but I think consistency in internal behavior saves you many more headaches in the long run.

In any case, I now understand that perhaps this should be considered a separate discussion and we could open an issue to get other people's input on this. If you think it is a good idea, I may write that later and ping you to check if my description of the situation is correct (or if you think it would be easier to just write it yourself feel free he he 😅 ).

greschd · 2021-05-25T16:22:41Z

So to make sure I understood this correctly: you mean that some scheduler plugins received the option=None and wrote #$ option = False in the running script, whereas others didn't write anything with the same option=None input?

Exactly. Also agree with your analysis of the two options. Since we're only changing behavior for the ~~PBS~~ SGE plugin, we need to determine the defaults for the ~~PBS~~ SGE scheduler to see if we're in scenario 1. or 2.

greschd · 2021-05-25T16:28:23Z

According to its manual, at least the current default for SGE is -R n (case 1.):

     -R y[es]|n[o]
          Available for qsub, qrsh, qsh, qlogin and qalter.

          Indicates whether a reservation for this job should  be
          done.  Reservation  is  never  done for immediate jobs,
          i.e. jobs submitted using the -now yes option.   Please
          note  that  regardless  of the reservation request, job
          reservation might be disabled using max_reservation  in
          sched_conf(5)  and  might  be limited only to a certain
          number of high priority jobs.

          By default jobs are submitted with the -R n option.

          The value specified with this option or the correspond-
          ing  value  specified  in  qmon  will only be passed to
          defined JSV instances if the value is yes.  The name of
          the  parameter will be R. The value will be y also when
          then long form yes  was  specified  during  submission.
          (see  -jsv  option  above or find more information con-
          cerning JSV in jsv(1))

I'd wager to say that this means we can merge this unchanged, but it would be good to get an additional pair of eyes on this (e.g. @giovannipizzi, since he was involved in the initial discussion).

chrisjsewell · 2021-06-02T08:20:20Z

@sphuber mentioned that @atztogo may use the SGE scheduler. Perhaps if he has time, he could quickly comment if he sees any reason that this would break anything for him?
Otherwise, I think this looks good to me thanks 👍

giovannipizzi · 2021-06-02T20:16:22Z

I agree that, according to the docs we could find this looks good to me.
I'd wait a few days in case @atztogo can confirm that this (i.e. explicitly adding -R n in the SGE header) does not break anything in his workflows.

Otherwise OK to merge for me

atztogo · 2021-06-02T23:07:31Z

I don't fully understand the discussion on SGE here (a little bit too technical for me), but I think there is no impact for me. The suggestion in the first @giovannipizzi's comment is probably most natural. I leave my SGE setting as below.

I didn't know that this option exists... Currently I use SGE installed on ubuntu 18.04 (debian package of 8.1.9+dfsg-4+deb9u2). man qsub gives (not -R but -r if I understand correctly.)

       -r y[es]|n[o]
              Available for qsub and qalter only.

              Specifies whether a job can be rerun or not.  If the value of -r
              is 'yes', the job will be rerun if it gets aborted without leav‐
              ing a consistent exit state.  (This is typically the case if the
              node  on  which the job is running crashes).  If -r is 'no' (the
              default), the job will not be rerun  under  such  circumstances.
              It  will  still be rerun if it finishes with exit code 99 unless
              FORBID_RESCHEDULE is set in qmaster_params in sge_conf(5).

              Interactive jobs submitted with qsh, qrsh,  or  qlogin  are  not
              rerunnable.

              Qalter allows changing this option even while the job executes.

              The  value specified with this option or the corresponding value
              specified in qmon will only be passed to defined  JSV  instances
              if  the  value is yes.  The name of the parameter will be r. The
              value will be y also when the long form yes was specified during
              submission.   (See  -jsv  option  above or find more information
              concerning JSV in jsv(1).)

The default behaviour (i.e., without writing #$ -r [yes|no] on each queue setting (a queue means a set of parameters such as resource, parallelization, etc) seems configurable. Some parameters are overrode by the job-script's header part. I haven't ever touched this option rerun or -r. So my current setting is FALSE as written in the man page, and could be confirmed by

% sudo qconf -sq queue-name
...
rerun                 FALSE
...

I tried running with #$ -r no and #$ -r yes to see how it affects to a job and looked for any log of SGE on this but could not find the information on the status of this option, though there is some way to find it.

atztogo · 2021-06-03T00:04:26Z

The suggestion in the first @giovannipizzi's comment is probably most natural. I leave my SGE setting as below.

This was not very correct. Rather, the below with rerunnable=None as the default for SGE sounds nice to me,

So to make sure I understood this correctly: you mean that some scheduler plugins received the option=None and wrote #$ option = False in the running script, whereas others didn't write anything with the same option=None input?

     if job_tmpl.rerunnable is True:
         lines.append(f'#$ -r yes')
     elif job_tmpl.rerunnable is False:
         lines.append(f'#$ -r no')
     elif job_tmpl.rerunnable is None:
         pass

giovannipizzi · 2021-06-03T16:44:52Z

Thanks @atztogo - I realise now that @greschd 's commment was misleading ;-D he reported indeed the -R option (reservation) rather than -r (rerunnable). So from Togo's comment it seems indeed that there is no SGE default.

So, just so it's clear to me: the "problem" with Togo's last suggestion, i.e. allowing also a None value to mean "don't specify => use the scheduler's default", would be that the behaviour is different than the other schedulers (where instead the default is enforced to be no?

Maybe it's better to do the opposite then, and do not specify at all the rerunnable option in other plugins, if the value is not specified or None? Or there is too much of a risk to create other subtle problems with the other schedulers?

greschd · 2021-06-03T19:35:42Z

Thanks @atztogo - I realise now that @greschd 's commment was misleading ;-D he reported indeed the -R option (reservation) rather than -r (rerunnable). So from Togo's comment it seems indeed that there is no SGE default.

Yeah, copied the wrong part 😅

Maybe it's better to do the opposite then, and do not specify at all the rerunnable option in other plugins, if the value is not specified or None? Or there is too much of a risk to create other subtle problems with the other schedulers?

This just creates the opposite problem, since other scheduler plugins did set a default for None. It's an inherent tradeoff between being consistent across schedulers or keeping exact backwards compatibility.

sphuber · 2021-08-12T20:55:40Z

Since this has been open for a long time, I have given it a rough look. Please correct me if I'm wrong but I think that adding the new option does not change the current behavior, since the default is to have rerunnable = False. Since you also added documentation to explain what the pre-conditions are for this flag to work with AiiDA, I would be tempted to accept and merge this. @giovannipizzi and @ramirezfranciscof any objections?

giovannipizzi

Thanks @greschd , and thanks @sphuber for pinging! After solving the conflicts, it's good to merge for me!

espenfl · 2021-08-13T09:17:35Z

Nice to see this merged. Thanks a lot for the work and discussions.

greschd · 2021-08-15T10:41:44Z

Thanks @greschd , and thanks @sphuber for pinging! After solving the conflicts, it's good to merge for me!

Rebased, and good to merge if checks pass from my side.

sphuber

Thanks @greschd

greschd added the topic/calc-jobs label Feb 4, 2021

greschd force-pushed the allow_setting_rerunnable branch from d99e589 to 439611a Compare February 4, 2021 15:14

greschd force-pushed the allow_setting_rerunnable branch from c6f7368 to a8719e9 Compare February 4, 2021 16:20

greschd mentioned this pull request Feb 4, 2021

Pytest fixture to allow capturing AiiDA logs #4708

Closed

greschd requested a review from giovannipizzi February 4, 2021 17:36

greschd force-pushed the allow_setting_rerunnable branch from 3868cff to 5efb456 Compare February 8, 2021 10:45

greschd force-pushed the allow_setting_rerunnable branch 3 times, most recently from e7e58f3 to 01bee2b Compare February 8, 2021 16:13

greschd force-pushed the allow_setting_rerunnable branch from 01bee2b to c6c797a Compare February 24, 2021 13:45

ramirezfranciscof self-assigned this Apr 7, 2021

giovannipizzi previously approved these changes Aug 13, 2021

View reviewed changes

Dominik Gresch added 4 commits August 15, 2021 12:28

Allow setting rerunnable from metadata['options'].

7749687

Add tests for rerunnable option in scheduler plugins.

3ed7ca0

Add documentation for the rerunnable flag.

42882b5

Fix the direct scheduler test.

dd51a4d

greschd dismissed giovannipizzi’s stale review via efe0318 August 15, 2021 10:34

greschd force-pushed the allow_setting_rerunnable branch from cfcf852 to efe0318 Compare August 15, 2021 10:34

Replace manual caplog handling with aiida_caplog fixture.

b116c34

greschd force-pushed the allow_setting_rerunnable branch from ba1fa63 to b116c34 Compare August 15, 2021 10:40

sphuber approved these changes Aug 15, 2021

View reviewed changes

sphuber merged commit 74571a1 into aiidateam:develop Aug 15, 2021

sphuber deleted the allow_setting_rerunnable branch August 15, 2021 12:23

chrisjsewell mentioned this pull request Mar 12, 2022

🔀 MERGE: Release v2.0.0b1 #5426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow setting `rerunnable` from `metadata['options']` #4707

Allow setting `rerunnable` from `metadata['options']` #4707

greschd commented Feb 4, 2021 •

edited

Loading

codecov bot commented Feb 4, 2021 •

edited

Loading

giovannipizzi commented Feb 4, 2021

greschd commented Feb 4, 2021 •

edited

Loading

giovannipizzi commented Feb 4, 2021

espenfl commented Feb 4, 2021 •

edited

Loading

greschd commented Feb 4, 2021 •

edited

Loading

greschd commented Feb 8, 2021

ramirezfranciscof commented May 18, 2021

greschd commented May 21, 2021 •

edited

Loading

ramirezfranciscof commented May 25, 2021 •

edited

Loading

greschd commented May 25, 2021 •

edited

Loading

greschd commented May 25, 2021 •

edited

Loading

chrisjsewell commented Jun 2, 2021

giovannipizzi commented Jun 2, 2021

atztogo commented Jun 2, 2021

atztogo commented Jun 3, 2021

giovannipizzi commented Jun 3, 2021

greschd commented Jun 3, 2021 •

edited

Loading

sphuber commented Aug 12, 2021

giovannipizzi left a comment

espenfl commented Aug 13, 2021

greschd commented Aug 15, 2021 •

edited

Loading

sphuber left a comment

Allow setting rerunnable from metadata['options'] #4707

Allow setting rerunnable from metadata['options'] #4707

Conversation

greschd commented Feb 4, 2021 • edited Loading

codecov bot commented Feb 4, 2021 • edited Loading

Codecov Report

giovannipizzi commented Feb 4, 2021

greschd commented Feb 4, 2021 • edited Loading

giovannipizzi commented Feb 4, 2021

espenfl commented Feb 4, 2021 • edited Loading

greschd commented Feb 4, 2021 • edited Loading

greschd commented Feb 8, 2021

ramirezfranciscof commented May 18, 2021

greschd commented May 21, 2021 • edited Loading

ramirezfranciscof commented May 25, 2021 • edited Loading

greschd commented May 25, 2021 • edited Loading

greschd commented May 25, 2021 • edited Loading

chrisjsewell commented Jun 2, 2021

giovannipizzi commented Jun 2, 2021

atztogo commented Jun 2, 2021

atztogo commented Jun 3, 2021

giovannipizzi commented Jun 3, 2021

greschd commented Jun 3, 2021 • edited Loading

sphuber commented Aug 12, 2021

giovannipizzi left a comment

Choose a reason for hiding this comment

espenfl commented Aug 13, 2021

greschd commented Aug 15, 2021 • edited Loading

sphuber left a comment

Choose a reason for hiding this comment

Allow setting `rerunnable` from `metadata['options']` #4707

Allow setting `rerunnable` from `metadata['options']` #4707

greschd commented Feb 4, 2021 •

edited

Loading

codecov bot commented Feb 4, 2021 •

edited

Loading

greschd commented Feb 4, 2021 •

edited

Loading

espenfl commented Feb 4, 2021 •

edited

Loading

greschd commented Feb 4, 2021 •

edited

Loading

greschd commented May 21, 2021 •

edited

Loading

ramirezfranciscof commented May 25, 2021 •

edited

Loading

greschd commented May 25, 2021 •

edited

Loading

greschd commented May 25, 2021 •

edited

Loading

greschd commented Jun 3, 2021 •

edited

Loading

greschd commented Aug 15, 2021 •

edited

Loading