-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow setting rerunnable
from metadata['options']
#4707
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #4707 +/- ##
===========================================
+ Coverage 80.42% 80.43% +0.02%
===========================================
Files 531 531
Lines 36969 36973 +4
===========================================
+ Hits 29730 29737 +7
+ Misses 7239 7236 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Thanks @greschd ! A few comments:
|
Thanks @giovannipizzi, great points. One quick question: what's the best place to put the documentation? From a quick glance, either in the options paragraph or a separate paragraph after dry run makes sense to me, but I'm not very familiar with the organization of the docs.
If I understood @espenfl's point correctly, we should say that it works fine with one specific SLURM setup. People should do their own testing because SLURM configurations can differ. |
Indeed definitely in the options paragraph; if all the discussion fits in there, great. Otherwise a new paragraph with the more technical details (lined from the options paragraph) is also a good idea.
I see, good point - great to mention that, then |
Yes, SLURM setups are different (as are the other schedulers as well). Also, maybe this is a good time to put some life into my backburner scheduler container work. It was more or less done, just needed to integrate them with the scheduler code in AiiDA and introduce some tests based on those. I will try to do some of that next week. Also, thanks a lot for adding this @greschd. |
d99e589
to
439611a
Compare
Regarding SGE: Do you know if the default is |
c6f7368
to
a8719e9
Compare
3868cff
to
5efb456
Compare
Note that I've now added the |
e7e58f3
to
01bee2b
Compare
01bee2b
to
c6c797a
Compare
Hey @greschd , thanks for the contribution! Sorry it got a bit buried 😅 I've taken a look and it all seems good; I don't know if you wanted a more specific greenlight for someone in particular, but feel free to tag them if that was the case, otherwise I can approve once the branch is updated. The conflicts in The only thing I'm a bit wary about is this:
This is true as long as the schedulers didn't have / will not have any sort of modification that will change how this option works or is provided (for example, perhaps previously they only admitted I guess the super-safe road would be to default these options to |
Yeah, that is exactly the point where I am not 100% sure. I think it would be good to get buy-in from @giovannipizzi on this.
The main difficulty here is that not all schedulers treated We could just say that |
So to make sure I understood this correctly: you mean that some scheduler plugins received the If this is the case, I would say there are two cases:
In any case, I now understand that perhaps this should be considered a separate discussion and we could open an issue to get other people's input on this. If you think it is a good idea, I may write that later and ping you to check if my description of the situation is correct (or if you think it would be easier to just write it yourself feel free he he 😅 ). |
Exactly. Also agree with your analysis of the two options. Since we're only changing behavior for the |
According to its manual, at least the current default for SGE is
I'd wager to say that this means we can merge this unchanged, but it would be good to get an additional pair of eyes on this (e.g. @giovannipizzi, since he was involved in the initial discussion). |
I agree that, according to the docs we could find this looks good to me. Otherwise OK to merge for me |
I don't fully understand the discussion on SGE here (a little bit too technical for me), but I think there is no impact for me. The suggestion in the first @giovannipizzi's comment is probably most natural. I leave my SGE setting as below. I didn't know that this option exists... Currently I use SGE installed on ubuntu 18.04 (debian package of
The default behaviour (i.e., without writing
I tried running with |
This was not very correct. Rather, the below with
|
Thanks @atztogo - I realise now that @greschd 's commment was misleading ;-D he reported indeed the So, just so it's clear to me: the "problem" with Togo's last suggestion, i.e. allowing also a Maybe it's better to do the opposite then, and do not specify at all the rerunnable option in other plugins, if the value is not specified or |
Yeah, copied the wrong part 😅
This just creates the opposite problem, since other scheduler plugins did set a default for |
Since this has been open for a long time, I have given it a rough look. Please correct me if I'm wrong but I think that adding the new option does not change the current behavior, since the default is to have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see this merged. Thanks a lot for the work and discussions. |
cfcf852
to
efe0318
Compare
ba1fa63
to
b116c34
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @greschd
The
rerunnable
option is currently implemented in the scheduler plugins (LSF, PBS, SGE, SLURM), but there was no way to activate it for a particularCalcJob
. This PR adds ametadata.options.rerunnable
flag that is forwarded to theJobTemplate
.This is a follow-up on a discussion on Slack with @giovannipizzi and @espenfl.
Important points raised there:
I've now been running this in production for a while, and did not notice any adverse affects. In our case, submitting jobs to SLURM can trigger the allocation of new nodes; if that fails for some reason, SLURM needs to re-queue the jobs such that it can assign different nodes to the job.
Some open questions:
rerunnable
flag is set?bool
(as do all the other schedulers), or astr
.