Add AEP: Allow `CalcJob`s to be actively monitored and interrupted #36

sphuber · 2022-09-20T21:47:56Z

sphuber · 2022-09-26T11:05:32Z

This AEP is implemented and can be seen here: aiidateam/aiida-core#5659

ramirezfranciscof

I think it is all good in general, I just have 2 questions / comments

008_calcjob_monitors/readme.md

ltalirz · 2022-09-27T13:38:21Z

008_calcjob_monitors/readme.md

+
+Since this would significantly complicate the implementation as well as the user-interface, and currently it is not clear whether this design goal addresses an actual use-case, it was decided to table this direction.
+
+### Alternatives


At least as far as control flow is considered (e.g. shutting down / restarting a calculation based on something seen in the output), it may be worth mentioning here or in the introduction that there is a third alternative: implement the application-specific control flow in a python script and let AiiDA run that python script instead of the simulation code.

This is the path taken by @astamminger in aiida-cusp with custodian.

This approach has several important advantages:

no additional load on aiida (even allows to do remote post-processing that scales with the number of jobs)

maximum performance (no need to go through slow network connections)

portability: the same custodian script can be used by aiida, fireworks or other workflow managers

simplicity: no extra features needed in aiida

Of course, it also has some downsides:

no way to adjust computational resources based on output (note: it would be important for an AiiDA-centric implementation to support this; this can be a selling point)

python + dependencies required on the compute resource

The above only applies to the question of how to do control flow.

Making it more convenient for AiiDA users to monitor a calculation (e.g. a shortcut to stream the output file of a calculation) can be useful even in this scenario. But from the current implementation I understand that control flow is the main objective of the implementation.

I have updated the "Alternatives" section with a reference to aiida-cusp. Unfortunately I am not super familiar with the implementation and intricate details of how Custodian and aiida-cusp work. If there is anything that could be changed in aiida-core that would make the concept of aiida-cusp more generically applicable or easier to implement, then it would be great if you or @astamminger could comment on this. Then I could include a more in-depth discussion on aiida-cusp in this AEP.

The reason I didn't discuss it in more depth for now is because:

This solution is already possible without an AEP, as evidenced by aiida-cusp working.

This solution is not generic. Even though Custodian could be used for various codes and is (at least not as I understood) not specific to VASP (although it is mostly used for this since it comes with predefined handlers), it was still necessary to write a separate AiiDA plugin for it to make it work.

This last point is a significant disadvantage in my opinion. Not only does it make the barrier to start using the functionality high by having to write an entire package of plugins (CalcJob, Parser and maybe more?) it also means that is not compatible with the existing aiida-vasp plugin. Unless I misunderstood, it is not possible to simply have custodian as a drop-in replacement for the aiida-vasp plugin, and so all its workflows can also not be used with custodian. As a consequence, all the work of the aiida-common-workflows is also not compatible with aiida-cusp. Contrast this with this AEP, where a user can simply take any of the existing workflows (directly from aiida-vasp or the common workflows) and simply add monitors in the steps where they would like. Due to the design of port exposing, it is possible to add monitors without the common workflows or aiida-vasp having to change any of their workflows. I think this is pretty powerful.

Making it more convenient for AiiDA users to monitor a calculation (e.g. a shortcut to stream the output file of a calculation) can be useful even in this scenario. But from the current implementation I understand that control flow is the main objective of the implementation.

Do you mean that a user could launch a calcjob and then AiiDA provide some API to retrieve the output from one of the output files in the working directory of the job? If that is the case, then that is already possible:

node = load_node() # Load the calcjob node with node.computer.get_transport() as transport: workdir = node.get_remote_workdir() transport.chdir(workdir) with tempfile.NamedTemporaryFile() as tmpfile: transport.get('stdout.txt', tmpfile.name) print(tmpfile.read())

Admittedly this is very verbose, but it could easily be wrapped in a utility function. It would also already be made a lot simpler if the transport interface would supporting streaming, which I intend to add at some point.

Anyway, the point is, if that is the use case you had in mind, then no change in aiida-core is necessary.

This solution is not generic. Even though Custodian could be used for various codes and is (at least not as I understood) not specific to VASP (although it is mostly used for this since it comes with predefined handlers), it was still necessary to write a separate AiiDA plugin for it to make it work.

This last point is a significant disadvantage in my opinion. Not only does it make the barrier to start using the functionality high by having to write an entire package of plugins (CalcJob, Parser and maybe more?) it also means that is not compatible with the existing aiida-vasp plugin

I think in the end it boils down to the question: which logic should be implemented at which level?
Let's call implementing everything in AiiDA route A.
Route B is to use a "smaller" (intentionally limited) workflow language like custodian (or maybe jobflow/...) to do basic job monitoring/management directly inside the compute job that is running the calculation, and to integrate that with AiiDA.

I would argue that, from an ecosystem perspective, route B is objectively more generic: the logic that can live directly inside the compute job is dealt with by the local workflow manager, and the higher-level workflow manager (e.g. AiiDA) integrates with it directly. Different high-level workflow managers will be able to reuse and benefit from this locally implemented logic (users can even run it manually stand-alone if they have no need for more fancy job management). This is what has enabled the aiida-cusp plugin.

Route A (implementing everything in AiiDA) does not allow this type of reuse: e.g. there isn't a straightforward way to reuse the logic encoded in aiida-vasp inside, say, fireworks or some other job orchestrator.

This is not to say that we should not offer the capability outlined in this AEP in AiiDA; also having everything in one software package has its own advantages.
In my view, however, we should give serious consideration to route B going forward, if we want to maximize the impact of our work ("coaxing" simulation codes to do what we want) on accelerating computational materials science for everyone.

Do you mean that a user could launch a calcjob and then AiiDA provide some API to retrieve the output from one of the output files in the working directory of the job? If that is the case, then that is already possible:

Cheers, yes - it's just what I intuitively thought of when I read "job monitoring".
Indeed it's possible; and while it might be useful to have a shortcut for this I understand that this is not the purpose of this AEP.

Fair points, and I am not saying that route B shouldn't be considered, but my point is that this is already possible, without any changes to AiiDA. Of course, if there are changes in AiiDA that would make it easier to implement that route, then that would be interesting for an AEP, but I am not familiar enough with those methods to see what might be missing. Feedback from @astamminger may be very valuable here.

I just want to restate though that if you go down route B, although you might develop workflows that can be run with multiple workflow managers, you will be abandoning the rest of the AiiDA ecosystem, as they will need their own plugins and won't be compatible with the plugins of the codes that they are running, e.g., aiida-quantumespresso, aiida-vasp etc. This is a choice of course, but we have already invested a lot of time to develop high level workflows with advanced functionality in those packages and you would be abandoning this. The same goes for the common workflow project.

Again, in the end it is a choice, but I think for the current state, it makes a lot of sense to have an integrated approach that works with all the code that is already out there in the current ecosystem.

All good. I'll leave this thread open for others to see, but I consider my comment addressed.

ramirezfranciscof

My comments have been dealt with and now this is ok to merge for me (I'll leave @ltalirz to give feedback on the corrections for his comments)

sphuber mentioned this pull request Sep 24, 2022

AEP: CalcJob live monitoring aiidateam/aiida-core#5659

Merged

sphuber force-pushed the aep/calcjob-monitors branch from ca315ba to 43ca6be Compare September 25, 2022 21:12

Add AEP: Allow CalcJobs to be actively monitored and interrupted

9bbac77

sphuber force-pushed the aep/calcjob-monitors branch from 43ca6be to 9bbac77 Compare September 26, 2022 11:03

sphuber requested review from ramirezfranciscof and giovannipizzi September 26, 2022 11:05

ramirezfranciscof reviewed Sep 27, 2022

View reviewed changes

008_calcjob_monitors/readme.md Outdated Show resolved Hide resolved

008_calcjob_monitors/readme.md Outdated Show resolved Hide resolved

ltalirz reviewed Sep 27, 2022

View reviewed changes

Address review comments

22d1781

ramirezfranciscof approved these changes Sep 28, 2022

View reviewed changes

sphuber merged commit 19a7e4d into master Oct 13, 2022

sphuber deleted the aep/calcjob-monitors branch October 13, 2022 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AEP: Allow `CalcJob`s to be actively monitored and interrupted #36

Add AEP: Allow `CalcJob`s to be actively monitored and interrupted #36

sphuber commented Sep 20, 2022

sphuber commented Sep 26, 2022

ramirezfranciscof left a comment

ltalirz Sep 27, 2022 •

edited

Loading

sphuber Sep 27, 2022

ltalirz Sep 28, 2022 •

edited

Loading

ltalirz Sep 28, 2022

sphuber Sep 28, 2022

ltalirz Sep 29, 2022

ramirezfranciscof left a comment •

edited

Loading


		Since this would significantly complicate the implementation as well as the user-interface, and currently it is not clear whether this design goal addresses an actual use-case, it was decided to table this direction.

		### Alternatives

Add AEP: Allow CalcJobs to be actively monitored and interrupted #36

Add AEP: Allow CalcJobs to be actively monitored and interrupted #36

Conversation

sphuber commented Sep 20, 2022

sphuber commented Sep 26, 2022

ramirezfranciscof left a comment

Choose a reason for hiding this comment

ltalirz Sep 27, 2022 • edited Loading

Choose a reason for hiding this comment

sphuber Sep 27, 2022

Choose a reason for hiding this comment

ltalirz Sep 28, 2022 • edited Loading

Choose a reason for hiding this comment

ltalirz Sep 28, 2022

Choose a reason for hiding this comment

sphuber Sep 28, 2022

Choose a reason for hiding this comment

ltalirz Sep 29, 2022

Choose a reason for hiding this comment

ramirezfranciscof left a comment • edited Loading

Choose a reason for hiding this comment

Add AEP: Allow `CalcJob`s to be actively monitored and interrupted #36

Add AEP: Allow `CalcJob`s to be actively monitored and interrupted #36

ltalirz Sep 27, 2022 •

edited

Loading

ltalirz Sep 28, 2022 •

edited

Loading

ramirezfranciscof left a comment •

edited

Loading