Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting rerunnable from metadata['options'] #4707

Merged
merged 5 commits into from
Aug 15, 2021

Conversation

greschd
Copy link
Member

@greschd greschd commented Feb 4, 2021

The rerunnable option is currently implemented in the scheduler plugins (LSF, PBS, SGE, SLURM), but there was no way to activate it for a particular CalcJob. This PR adds a metadata.options.rerunnable flag that is forwarded to the JobTemplate.

This is a follow-up on a discussion on Slack with @giovannipizzi and @espenfl.

Important points raised there:

  • it might depend on scheduler settings of the cluster whether re-queueing gives the same job id
  • (depending again on the scheduler config) it's possible that a code would execute more than once, so it should only be applied for idempotent codes

I've now been running this in production for a while, and did not notice any adverse affects. In our case, submitting jobs to SLURM can trigger the allocation of new nodes; if that fails for some reason, SLURM needs to re-queue the jobs such that it can assign different nodes to the job.

Some open questions:

  • is there a good place to test this? I looked at test_calc_job.py, but since the option doesn't affect the direct scheduler there's no way to check it was used. Do we have a calcjob test with different schedulers?
  • related to the first point, should the direct scheduler warn when the (unused) rerunnable flag is set?
  • looking at the SGE implementation, I'm not sure if that expects a bool (as do all the other schedulers), or a str.

@codecov
Copy link

codecov bot commented Feb 4, 2021

Codecov Report

Merging #4707 (ba1fa63) into develop (3915f44) will increase coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head ba1fa63 differs from pull request most recent head b116c34. Consider uploading reports for the commit b116c34 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4707      +/-   ##
===========================================
+ Coverage    80.42%   80.43%   +0.02%     
===========================================
  Files          531      531              
  Lines        36969    36973       +4     
===========================================
+ Hits         29730    29737       +7     
+ Misses        7239     7236       -3     
Flag Coverage Δ
django 74.92% <100.00%> (+0.02%) ⬆️
sqlalchemy 73.87% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/engine/processes/calcjobs/calcjob.py 88.71% <100.00%> (+0.04%) ⬆️
aiida/schedulers/plugins/direct.py 62.88% <100.00%> (+1.67%) ⬆️
aiida/schedulers/plugins/sge.py 70.84% <100.00%> (+0.55%) ⬆️
aiida/engine/daemon/client.py 75.26% <0.00%> (-1.01%) ⬇️
aiida/transports/plugins/local.py 81.41% <0.00%> (-0.25%) ⬇️
aiida/schedulers/plugins/pbsbaseclasses.py 57.19% <0.00%> (+0.33%) ⬆️
aiida/schedulers/plugins/slurm.py 70.14% <0.00%> (+0.35%) ⬆️
aiida/schedulers/plugins/lsf.py 70.48% <0.00%> (+0.40%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3915f44...b116c34. Read the comment docs.

@giovannipizzi
Copy link
Member

Thanks @greschd !

A few comments:

  • it would be good do add documentation for this, and in particular also an explanation that this was tested with SLURM and it works fine, that it's not implemented in direct, and for the others it's untested (in particular, it might generate a new job ID and so AiiDA would not be able to 'pick' the rerunned job), and the other note on idempotent jobs. Maybe even you could open an issue, link to this PR, and in the docs you can point people to the issue, saying to report if they use a different scheduler with rerun so we can have this "tested", in the future

  • for testing, you could split the test in two units: one in the place you suggest, checking that the job template attribute is set, just calling presubmit; and (more than) one for each scheduler plugin, to check that they write the correct string in the header if the rerunnable flag is set to True

  • Regarding the direct scheduler - I think it's not crucial, but indeed since by default the flag is set to False, it's ok to issue a warning there if rerunnable is true, and continue

  • regarding SGE, here it says:

      -r y[es]|n[o]
        Available for qsub and qalter only.
    
        Identifies the ability of a job to be rerun or not.  If
        the  value of -r is 'yes', the job will be rerun if the
        job was  aborted  without  leaving  a  consistent  exit
        state.   (This  is  typically  the  case if the node on
        which the job is running crashes).  If -r is 'no',  the
        job will not be rerun under any circumstances.
        Interactive jobs submitted with qsh, qrsh or qlogin are
        not rerunnable.
    
        Qalter allows changing this option even while  the  job
        executes.
    
        The value specified with this option or the correspond-
        ing  value  specified  in  qmon  will only be passed to
        defined JSV instances if the value is yes.  The name of
        the  parameter will be r. The value will be y also when
        then long form yes  was  specified  during  submission.
        (see  -jsv  option  above or find more information con-
        cerning JSV in jsv(1))
    

    So my guess is that the current implementation might even be wrong (I think the combination SGE+rerunnable was never tested) - I suggest you just replace it with

         if job_tmpl.rerunnable:
             lines.append(f'#$ -r yes')

@greschd
Copy link
Member Author

greschd commented Feb 4, 2021

Thanks @giovannipizzi, great points. One quick question: what's the best place to put the documentation? From a quick glance, either in the options paragraph or a separate paragraph after dry run makes sense to me, but I'm not very familiar with the organization of the docs.

that this was tested with SLURM and it works fine

If I understood @espenfl's point correctly, we should say that it works fine with one specific SLURM setup. People should do their own testing because SLURM configurations can differ.

@giovannipizzi
Copy link
Member

What's the best place to put the documentation? From a quick glance, either in the options paragraph or a separate paragraph after dry run makes sense to me, but I'm not very familiar with the organization of the docs.

Indeed definitely in the options paragraph; if all the discussion fits in there, great. Otherwise a new paragraph with the more technical details (lined from the options paragraph) is also a good idea.

If I understood @espenfl's point correctly, we should say that it works fine with one specific SLURM setup. People should do their own testing because SLURM configurations can differ.

I see, good point - great to mention that, then

@espenfl
Copy link
Contributor

espenfl commented Feb 4, 2021

Yes, SLURM setups are different (as are the other schedulers as well). Also, maybe this is a good time to put some life into my backburner scheduler container work. It was more or less done, just needed to integrate them with the scheduler code in AiiDA and introduce some tests based on those. I will try to do some of that next week.

Also, thanks a lot for adding this @greschd.

@greschd
Copy link
Member Author

greschd commented Feb 4, 2021

Regarding SGE: Do you know if the default is no, or should we add '#$ -r no' if rerunnable is False? From that man page, it wasn't 100% clear to me.

@greschd
Copy link
Member Author

greschd commented Feb 8, 2021

Regarding SGE: Do you know if the default is no, or should we add '#$ -r no' if rerunnable is False? From that man page, it wasn't 100% clear to me.

Note that I've now added the #$ -r no if rerunnable is False, but that can easily be reverted if desired. The benefit of adding it is that we the rerunnable option is definitely off when set to False, but it may break a setup that inadvertently relied on it being turned off somewhere in the cluster's settings. I think that is ok, because the "correct" fix would then be to turn on rerunnable explicitly.

@greschd greschd force-pushed the allow_setting_rerunnable branch 3 times, most recently from e7e58f3 to 01bee2b Compare February 8, 2021 16:13
@ramirezfranciscof ramirezfranciscof self-assigned this Apr 7, 2021
@ramirezfranciscof
Copy link
Member

Hey @greschd , thanks for the contribution! Sorry it got a bit buried 😅

I've taken a look and it all seems good; I don't know if you wanted a more specific greenlight for someone in particular, but feel free to tag them if that was the case, otherwise I can approve once the branch is updated. The conflicts in tests/schedulers/test_lsf.py are because I recently moved the tests out of the old format into normal pytest functions, I think you can just copy your added test method as it is outside of its current class and it should work (it isn't even using self anywhere else). If you don't currently have the time, let me know and I can do the update.

The only thing I'm a bit wary about is this:

Note that I've now added the #$ -r no if rerunnable is False, but that can easily be reverted if desired. The benefit of adding it is that we the rerunnable option is definitely off when set to False, but it may break a setup that inadvertently relied on it being turned off somewhere in the cluster's settings. I think that is ok, because the "correct" fix would then be to turn on rerunnable explicitly.

This is true as long as the schedulers didn't have / will not have any sort of modification that will change how this option works or is provided (for example, perhaps previously they only admitted #$ --rerunable no/yes or #$ -r disable/enable, or maybe the option wasn't even there yet). If this was the case, a "no keyword" approach would still work with old versions but only allowing for explicit enabling/disabling may break without any possible solution for the user. I'm not sure how common this is with these schedulers and if this is a scenario worth "code-considering", but I thought at least it was worth a brief "design-considering" =P.

I guess the super-safe road would be to default these options to None, in which case they are ignored, and allow the user to set both False or True so they can make sure they are being applied in both ways.

@greschd
Copy link
Member Author

greschd commented May 21, 2021

This is true as long as the schedulers didn't have / will not have any sort of modification that will change how this option works or is provided (for example, perhaps previously they only admitted #$ --rerunable no/yes or #$ -r disable/enable, or maybe the option wasn't even there yet). If this was the case, a "no keyword" approach would still work with old versions but only allowing for explicit enabling/disabling may break without any possible solution for the user. I'm not sure how common this is with these schedulers and if this is a scenario worth "code-considering", but I thought at least it was worth a brief "design-considering" =P.

Yeah, that is exactly the point where I am not 100% sure. I think it would be good to get buy-in from @giovannipizzi on this.

I guess the super-safe road would be to default these options to None, in which case they are ignored, and allow the user to set both False or True so they can make sure they are being applied in both ways.

The main difficulty here is that not all schedulers treated None the same so far -- PBS didn't emit anything, whereas SLURM and LSF added what is now the False option. So it's a trade-off between consistency and strict backwards-compatibility.

We could just say that None is "legacy behavior", and do some deprecation path. But then you'll end up emitting warnings to users for what may end up being no good reason.

@ramirezfranciscof
Copy link
Member

ramirezfranciscof commented May 25, 2021

The main difficulty here is that not all schedulers treated None the same so far -- PBS didn't emit anything, whereas SLURM and LSF added what is now the False option. So it's a trade-off between consistency and strict backwards-compatibility.

We could just say that None is "legacy behavior", and do some deprecation path. But then you'll end up emitting warnings to users for what may end up being no good reason.

So to make sure I understood this correctly: you mean that some scheduler plugins received the option=None and wrote #$ option = False in the running script, whereas others didn't write anything with the same option=None input?

If this is the case, I would say there are two cases:

  1. The default behavior of the scheduler is the same when the option is not present or when it is set to False (or whatever the scheduler plugin defaults to). If this is the case I would argue there is no need to do a deprecation path for the change. I mean, it is true that we change what we write in the script, but if the resulting effect on the scheduler is the same then there is not really any observable change occurring from the point of view of the user. One could say that we need to deprecate because the result we have control over is the writing of the script, not what the scheduler does with it: I would counter that if the scheduler at some point changes what the default behavior is for the missing keyword, then it is the scheduler that is breaking backwards compatibility, not us. Not sure if one argument "objectively" wins over the other, but I like mine better 🤓 .
  2. The default behavior of the scheduler is NOT the same. Indeed here a warning should be issued and a deprecation path provided. This could generate some friction and is a little extra work, but I think consistency in internal behavior saves you many more headaches in the long run.

In any case, I now understand that perhaps this should be considered a separate discussion and we could open an issue to get other people's input on this. If you think it is a good idea, I may write that later and ping you to check if my description of the situation is correct (or if you think it would be easier to just write it yourself feel free he he 😅 ).

@greschd
Copy link
Member Author

greschd commented May 25, 2021

So to make sure I understood this correctly: you mean that some scheduler plugins received the option=None and wrote #$ option = False in the running script, whereas others didn't write anything with the same option=None input?

Exactly. Also agree with your analysis of the two options. Since we're only changing behavior for the PBS SGE plugin, we need to determine the defaults for the PBS SGE scheduler to see if we're in scenario 1. or 2.

@greschd
Copy link
Member Author

greschd commented May 25, 2021

According to its manual, at least the current default for SGE is -R n (case 1.):

     -R y[es]|n[o]
          Available for qsub, qrsh, qsh, qlogin and qalter.

          Indicates whether a reservation for this job should  be
          done.  Reservation  is  never  done for immediate jobs,
          i.e. jobs submitted using the -now yes option.   Please
          note  that  regardless  of the reservation request, job
          reservation might be disabled using max_reservation  in
          sched_conf(5)  and  might  be limited only to a certain
          number of high priority jobs.

          By default jobs are submitted with the -R n option.

          The value specified with this option or the correspond-
          ing  value  specified  in  qmon  will only be passed to
          defined JSV instances if the value is yes.  The name of
          the  parameter will be R. The value will be y also when
          then long form yes  was  specified  during  submission.
          (see  -jsv  option  above or find more information con-
          cerning JSV in jsv(1))

I'd wager to say that this means we can merge this unchanged, but it would be good to get an additional pair of eyes on this (e.g. @giovannipizzi, since he was involved in the initial discussion).

@chrisjsewell
Copy link
Member

@sphuber mentioned that @atztogo may use the SGE scheduler. Perhaps if he has time, he could quickly comment if he sees any reason that this would break anything for him?
Otherwise, I think this looks good to me thanks 👍

@giovannipizzi
Copy link
Member

I agree that, according to the docs we could find this looks good to me.
I'd wait a few days in case @atztogo can confirm that this (i.e. explicitly adding -R n in the SGE header) does not break anything in his workflows.

Otherwise OK to merge for me

@atztogo
Copy link
Contributor

atztogo commented Jun 2, 2021

I don't fully understand the discussion on SGE here (a little bit too technical for me), but I think there is no impact for me. The suggestion in the first @giovannipizzi's comment is probably most natural. I leave my SGE setting as below.

I didn't know that this option exists... Currently I use SGE installed on ubuntu 18.04 (debian package of 8.1.9+dfsg-4+deb9u2). man qsub gives (not -R but -r if I understand correctly.)

       -r y[es]|n[o]
              Available for qsub and qalter only.

              Specifies whether a job can be rerun or not.  If the value of -r
              is 'yes', the job will be rerun if it gets aborted without leav‐
              ing a consistent exit state.  (This is typically the case if the
              node  on  which the job is running crashes).  If -r is 'no' (the
              default), the job will not be rerun  under  such  circumstances.
              It  will  still be rerun if it finishes with exit code 99 unless
              FORBID_RESCHEDULE is set in qmaster_params in sge_conf(5).

              Interactive jobs submitted with qsh, qrsh,  or  qlogin  are  not
              rerunnable.

              Qalter allows changing this option even while the job executes.

              The  value specified with this option or the corresponding value
              specified in qmon will only be passed to defined  JSV  instances
              if  the  value is yes.  The name of the parameter will be r. The
              value will be y also when the long form yes was specified during
              submission.   (See  -jsv  option  above or find more information
              concerning JSV in jsv(1).)

The default behaviour (i.e., without writing #$ -r [yes|no] on each queue setting (a queue means a set of parameters such as resource, parallelization, etc) seems configurable. Some parameters are overrode by the job-script's header part. I haven't ever touched this option rerun or -r. So my current setting is FALSE as written in the man page, and could be confirmed by

% sudo qconf -sq queue-name
...
rerun                 FALSE
...

I tried running with #$ -r no and #$ -r yes to see how it affects to a job and looked for any log of SGE on this but could not find the information on the status of this option, though there is some way to find it.

@atztogo
Copy link
Contributor

atztogo commented Jun 3, 2021

The suggestion in the first @giovannipizzi's comment is probably most natural. I leave my SGE setting as below.

This was not very correct. Rather, the below with rerunnable=None as the default for SGE sounds nice to me,

So to make sure I understood this correctly: you mean that some scheduler plugins received the option=None and wrote #$ option = False in the running script, whereas others didn't write anything with the same option=None input?

     if job_tmpl.rerunnable is True:
         lines.append(f'#$ -r yes')
     elif job_tmpl.rerunnable is False:
         lines.append(f'#$ -r no')
     elif job_tmpl.rerunnable is None:
         pass

@giovannipizzi
Copy link
Member

Thanks @atztogo - I realise now that @greschd 's commment was misleading ;-D he reported indeed the -R option (reservation) rather than -r (rerunnable). So from Togo's comment it seems indeed that there is no SGE default.

So, just so it's clear to me: the "problem" with Togo's last suggestion, i.e. allowing also a None value to mean "don't specify => use the scheduler's default", would be that the behaviour is different than the other schedulers (where instead the default is enforced to be no?

Maybe it's better to do the opposite then, and do not specify at all the rerunnable option in other plugins, if the value is not specified or None? Or there is too much of a risk to create other subtle problems with the other schedulers?

@greschd
Copy link
Member Author

greschd commented Jun 3, 2021

Thanks @atztogo - I realise now that @greschd 's commment was misleading ;-D he reported indeed the -R option (reservation) rather than -r (rerunnable). So from Togo's comment it seems indeed that there is no SGE default.

Yeah, copied the wrong part 😅

Maybe it's better to do the opposite then, and do not specify at all the rerunnable option in other plugins, if the value is not specified or None? Or there is too much of a risk to create other subtle problems with the other schedulers?

This just creates the opposite problem, since other scheduler plugins did set a default for None. It's an inherent tradeoff between being consistent across schedulers or keeping exact backwards compatibility.

@sphuber
Copy link
Contributor

sphuber commented Aug 12, 2021

Since this has been open for a long time, I have given it a rough look. Please correct me if I'm wrong but I think that adding the new option does not change the current behavior, since the default is to have rerunnable = False. Since you also added documentation to explain what the pre-conditions are for this flag to work with AiiDA, I would be tempted to accept and merge this. @giovannipizzi and @ramirezfranciscof any objections?

giovannipizzi
giovannipizzi previously approved these changes Aug 13, 2021
Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @greschd , and thanks @sphuber for pinging! After solving the conflicts, it's good to merge for me!

@espenfl
Copy link
Contributor

espenfl commented Aug 13, 2021

Nice to see this merged. Thanks a lot for the work and discussions.

@greschd
Copy link
Member Author

greschd commented Aug 15, 2021

Thanks @greschd , and thanks @sphuber for pinging! After solving the conflicts, it's good to merge for me!

Rebased, and good to merge if checks pass from my side.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @greschd

@sphuber sphuber merged commit 74571a1 into aiidateam:develop Aug 15, 2021
@sphuber sphuber deleted the allow_setting_rerunnable branch August 15, 2021 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants