-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AEP: Allow CalcJob
s to be actively monitored and interrupted
#36
Conversation
ca315ba
to
43ca6be
Compare
43ca6be
to
9bbac77
Compare
This AEP is implemented and can be seen here: aiidateam/aiida-core#5659 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is all good in general, I just have 2 questions / comments
|
||
Since this would significantly complicate the implementation as well as the user-interface, and currently it is not clear whether this design goal addresses an actual use-case, it was decided to table this direction. | ||
|
||
### Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least as far as control flow is considered (e.g. shutting down / restarting a calculation based on something seen in the output), it may be worth mentioning here or in the introduction that there is a third alternative: implement the application-specific control flow in a python script and let AiiDA run that python script instead of the simulation code.
This is the path taken by @astamminger in aiida-cusp with custodian.
This approach has several important advantages:
- no additional load on aiida (even allows to do remote post-processing that scales with the number of jobs)
- maximum performance (no need to go through slow network connections)
- portability: the same custodian script can be used by aiida, fireworks or other workflow managers
- simplicity: no extra features needed in aiida
Of course, it also has some downsides:
- no way to adjust computational resources based on output (note: it would be important for an AiiDA-centric implementation to support this; this can be a selling point)
- python + dependencies required on the compute resource
The above only applies to the question of how to do control flow.
Making it more convenient for AiiDA users to monitor a calculation (e.g. a shortcut to stream the output file of a calculation) can be useful even in this scenario. But from the current implementation I understand that control flow is the main objective of the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the "Alternatives" section with a reference to aiida-cusp
. Unfortunately I am not super familiar with the implementation and intricate details of how Custodian and aiida-cusp
work. If there is anything that could be changed in aiida-core
that would make the concept of aiida-cusp
more generically applicable or easier to implement, then it would be great if you or @astamminger could comment on this. Then I could include a more in-depth discussion on aiida-cusp
in this AEP.
The reason I didn't discuss it in more depth for now is because:
- This solution is already possible without an AEP, as evidenced by
aiida-cusp
working. - This solution is not generic. Even though Custodian could be used for various codes and is (at least not as I understood) not specific to VASP (although it is mostly used for this since it comes with predefined handlers), it was still necessary to write a separate AiiDA plugin for it to make it work.
This last point is a significant disadvantage in my opinion. Not only does it make the barrier to start using the functionality high by having to write an entire package of plugins (CalcJob
, Parser
and maybe more?) it also means that is not compatible with the existing aiida-vasp
plugin. Unless I misunderstood, it is not possible to simply have custodian as a drop-in replacement for the aiida-vasp
plugin, and so all its workflows can also not be used with custodian. As a consequence, all the work of the aiida-common-workflows
is also not compatible with aiida-cusp
. Contrast this with this AEP, where a user can simply take any of the existing workflows (directly from aiida-vasp
or the common workflows) and simply add monitors in the steps where they would like. Due to the design of port exposing, it is possible to add monitors without the common workflows or aiida-vasp
having to change any of their workflows. I think this is pretty powerful.
Making it more convenient for AiiDA users to monitor a calculation (e.g. a shortcut to stream the output file of a calculation) can be useful even in this scenario. But from the current implementation I understand that control flow is the main objective of the implementation.
Do you mean that a user could launch a calcjob and then AiiDA provide some API to retrieve the output from one of the output files in the working directory of the job? If that is the case, then that is already possible:
node = load_node() # Load the calcjob node
with node.computer.get_transport() as transport:
workdir = node.get_remote_workdir()
transport.chdir(workdir)
with tempfile.NamedTemporaryFile() as tmpfile:
transport.get('stdout.txt', tmpfile.name)
print(tmpfile.read())
Admittedly this is very verbose, but it could easily be wrapped in a utility function. It would also already be made a lot simpler if the transport interface would supporting streaming, which I intend to add at some point.
Anyway, the point is, if that is the use case you had in mind, then no change in aiida-core
is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This solution is not generic. Even though Custodian could be used for various codes and is (at least not as I understood) not specific to VASP (although it is mostly used for this since it comes with predefined handlers), it was still necessary to write a separate AiiDA plugin for it to make it work.
This last point is a significant disadvantage in my opinion. Not only does it make the barrier to start using the functionality high by having to write an entire package of plugins (
CalcJob
,Parser
and maybe more?) it also means that is not compatible with the existingaiida-vasp
plugin
I think in the end it boils down to the question: which logic should be implemented at which level?
Let's call implementing everything in AiiDA route A.
Route B is to use a "smaller" (intentionally limited) workflow language like custodian (or maybe jobflow/...) to do basic job monitoring/management directly inside the compute job that is running the calculation, and to integrate that with AiiDA.
I would argue that, from an ecosystem perspective, route B is objectively more generic: the logic that can live directly inside the compute job is dealt with by the local workflow manager, and the higher-level workflow manager (e.g. AiiDA) integrates with it directly. Different high-level workflow managers will be able to reuse and benefit from this locally implemented logic (users can even run it manually stand-alone if they have no need for more fancy job management). This is what has enabled the aiida-cusp
plugin.
Route A (implementing everything in AiiDA) does not allow this type of reuse: e.g. there isn't a straightforward way to reuse the logic encoded in aiida-vasp
inside, say, fireworks or some other job orchestrator.
This is not to say that we should not offer the capability outlined in this AEP in AiiDA; also having everything in one software package has its own advantages.
In my view, however, we should give serious consideration to route B going forward, if we want to maximize the impact of our work ("coaxing" simulation codes to do what we want) on accelerating computational materials science for everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that a user could launch a calcjob and then AiiDA provide some API to retrieve the output from one of the output files in the working directory of the job? If that is the case, then that is already possible:
Cheers, yes - it's just what I intuitively thought of when I read "job monitoring".
Indeed it's possible; and while it might be useful to have a shortcut for this I understand that this is not the purpose of this AEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair points, and I am not saying that route B shouldn't be considered, but my point is that this is already possible, without any changes to AiiDA. Of course, if there are changes in AiiDA that would make it easier to implement that route, then that would be interesting for an AEP, but I am not familiar enough with those methods to see what might be missing. Feedback from @astamminger may be very valuable here.
I just want to restate though that if you go down route B, although you might develop workflows that can be run with multiple workflow managers, you will be abandoning the rest of the AiiDA ecosystem, as they will need their own plugins and won't be compatible with the plugins of the codes that they are running, e.g., aiida-quantumespresso
, aiida-vasp
etc. This is a choice of course, but we have already invested a lot of time to develop high level workflows with advanced functionality in those packages and you would be abandoning this. The same goes for the common workflow project.
Again, in the end it is a choice, but I think for the current state, it makes a lot of sense to have an integrated approach that works with all the code that is already out there in the current ecosystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good. I'll leave this thread open for others to see, but I consider my comment addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments have been dealt with and now this is ok to merge for me (I'll leave @ltalirz to give feedback on the corrections for his comments)
draft
README.md