Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Scheduler.parse_output for the SlurmScheduler plugin #3931

Merged
merged 1 commit into from
Sep 10, 2020

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Apr 14, 2020

This implements the Scheduler.parse_output method that allows parsing
the detailed job info that is retrieved from the scheduler when a job is
finished. For the time being, only an out-of-memory error is detected
and the corresponding exit code is returned.

@codecov
Copy link

codecov bot commented Apr 14, 2020

Codecov Report

Merging #3931 into develop will increase coverage by 0.02%.
The diff coverage is 95.84%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3931      +/-   ##
===========================================
+ Coverage    79.32%   79.33%   +0.02%     
===========================================
  Files          468      468              
  Lines        34713    34737      +24     
===========================================
+ Hits         27532    27555      +23     
- Misses        7181     7182       +1     
Flag Coverage Δ
#django 72.95% <95.84%> (+0.02%) ⬆️
#sqlalchemy 72.16% <95.84%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/schedulers/plugins/slurm.py 62.03% <95.84%> (+3.09%) ⬆️
aiida/engine/daemon/runner.py 79.32% <0.00%> (-3.44%) ⬇️
aiida/transports/plugins/local.py 81.54% <0.00%> (+0.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e974d3f...a1b5729. Read the comment docs.

@sphuber sphuber force-pushed the feature/2431/slurm-parse-output branch 3 times, most recently from 959da78 to 05dd0ae Compare April 15, 2020 21:24
@sphuber sphuber added the pr/blocked PR is blocked by another PR that should be merged first label Jun 4, 2020
@sphuber sphuber force-pushed the feature/2431/slurm-parse-output branch 2 times, most recently from b5c030e to 9a4d224 Compare June 7, 2020 17:26
@mbercx
Copy link
Member

mbercx commented Jun 17, 2020

@sphuber I've run some tests using the PwBaseWorkChain and was able to produce the expected behaviour for calculations that ran out of walltime. For the out of memory issues, I haven't been able to produce a run that finishes with State equal to OUT_OF_MEMORY, often the job will only print an indication of memory issues to the stderr (e.g. cudaErrorMemoryAllocation error). However, these can of course easily be added to the SlurmScheduler.parse_output method.

One question I had was whether we should not also allow the parse_output method to return parsed data. For example, to return the maximum memory used on the GPU (maxmem), which I'm now doing after the runs have completed by extracting the _scheduler-stdout.txt from the FolderData.

Regarding the discussion on whether to exit when the scheduler returns a non-zero exit code in #3906: I'd also vote to maintain backwards compatibility and only add a warning to the log for now.

@sphuber sphuber force-pushed the feature/2431/slurm-parse-output branch from 9a4d224 to fe31e42 Compare June 23, 2020 16:02
@sphuber
Copy link
Contributor Author

sphuber commented Jun 23, 2020

@sphuber I've run some tests using the PwBaseWorkChain and was able to produce the expected behaviour for calculations that ran out of walltime. For the out of memory issues, I haven't been able to produce a run that finishes with State equal to OUT_OF_MEMORY, often the job will only print an indication of memory issues to the stderr (e.g. cudaErrorMemoryAllocation error). However, these can of course easily be added to the SlurmScheduler.parse_output method.

The problem with this is that this error does not come from the SLURM scheduler, but from the GPU, specifically the Nvidia card that is used on the Piz Daint XC50 nodes. To me that means we should not be parsing for this in the SlurmScheduler plugin. It should only ever parse data that is specific to the SLURM scheduler and not dependent on the code or platform on which the job was run. Because if we don't, where do we stop? Do we start adding parsing of all sorts of different architectures and codes in all scheduler plugins?

One question I had was whether we should not also allow the parse_output method to return parsed data. For example, to return the maximum memory used on the GPU (maxmem), which I'm now doing after the runs have completed by extracting the _scheduler-stdout.txt from the FolderData.

Same point as before: this information has nothing to do with the SLURM scheduler and so I don't think the logic to parse this data should be baked in there. Somebody else running a job on a different machine that also uses a SLURM scheduler might not get this information.

@sphuber sphuber force-pushed the feature/2431/slurm-parse-output branch from fe31e42 to 57f0918 Compare August 27, 2020 14:53
@sphuber sphuber added pr/ready-for-review PR is ready to be reviewed and removed pr/blocked PR is blocked by another PR that should be merged first labels Aug 27, 2020
@mbercx mbercx self-requested a review September 9, 2020 07:49
@mbercx
Copy link
Member

mbercx commented Sep 9, 2020

Sorry for taking so long to pick this up again!

Same point as before: this information has nothing to do with the SLURM scheduler and so I don't think the logic to parse this data should be baked in there. Somebody else running a job on a different machine that also uses a SLURM scheduler might not get this information.

Fair enough! That all makes sense to me (now 😅). I suppose at some point we can introduce "architecture-related" parsing, but I agree that this should not be done here.

I've gone through the code again and have no further comments. I also have several calcjobs with 110 and 120 exit codes in my database right now, so it's been field-tested quite extensively. :) The ERROR_SCHEDULER_OUT_OF_MEMORY exit code is picked up pretty consistently when I run on the CPU's of Piz Daint's hybrid nodes.

This implements the `Scheduler.parse_output` method that allows parsing
the detailed job info that is retrieved from the scheduler when a job is
finished. For the time being, only an out-of-memory error is detected
and the corresponding exit code is returned.
@sphuber sphuber force-pushed the feature/2431/slurm-parse-output branch from 57f0918 to a1b5729 Compare September 10, 2020 16:26
@sphuber sphuber merged commit ec32fbb into aiidateam:develop Sep 10, 2020
@sphuber sphuber deleted the feature/2431/slurm-parse-output branch September 10, 2020 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/ready-for-review PR is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants