Fix the unhandled failure mechanism of the `BaseRestartWorkChain` #4155

sphuber · 2020-06-06T07:38:53Z

Any BaseRestartWorkChain will fail if the sub process fails twice in a
row without being handled by a registered process handler. To monitor
this the unhandled_failure context variable is used, which is set to
True once a failed process was not handled. It should be unset as soon
as the next process finishes successfully, or the failed process is
handled. The reset was there, but was incorrectly setting the context
var to True instead of False, which went unnoticed due to a lack of
tests. Tests are now added, as well as for the case where the sub
process excepted or was killed.

codecov · 2020-06-06T07:57:47Z

Codecov Report

Merging #4155 into develop will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #4155      +/-   ##
===========================================
+ Coverage    78.86%   78.89%   +0.04%     
===========================================
  Files          467      467              
  Lines        34481    34481              
===========================================
+ Hits         27189    27201      +12     
+ Misses        7292     7280      -12

Flag	Coverage Δ
#django	`70.82% <100.00%> (+0.04%)`	⬆️
#sqlalchemy	`71.70% <100.00%> (+0.04%)`	⬆️

Impacted Files	Coverage Δ
aiida/engine/processes/workchains/restart.py	`54.73% <100.00%> (+7.44%)`	⬆️
aiida/transports/plugins/local.py	`80.47% <0.00%> (+0.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0e39c6...d807b67. Read the comment docs.

ramirezfranciscof

So, I think the only required change is the docstring on generate_calculation_node, the rest are just questions to learn better how workchain processes work. So I think I can just approve it so you can merge it in once you check that docstring and if nothing in the questions triggers any lateral idea on how to improve or clairfy the code (hey, sometimes it happens).

ramirezfranciscof · 2020-06-08T13:39:02Z

tests/conftest.py

+
+@pytest.fixture
+def generate_calculation_node():
+    """Generate an instance of a `WorkChain`."""


This docstring is not correct, no?

Yes, will adapt it. Thanks

ramirezfranciscof · 2020-06-09T07:59:23Z

tests/conftest.py

+        runner = get_manager().get_runner()
+        process = instantiate_process(runner, process_class, **inputs)


Sorry to be nuisance, but could I get some context on these couple of lines? I only read about the workchains and mostly the basic idea, but never really ran one. I tried going to the source in manager.py and utils.py but I feel like I need some extra general information on how the engine works to understand these functions/classes.

This is how a Process instance is created. Normally this is done automatically by the engine when you call run or submit, but in this case I need more control. Because instead of the engine actually starting to run it, for these tests I just want the process instance and "run" parts of it myself.

ramirezfranciscof · 2020-06-09T08:33:15Z

tests/engine/processes/workchains/test_restart.py


+from aiida.engine import CalcJob, BaseRestartWorkChain, process_handler, ProcessState, ProcessHandlerReport, ExitCode


So, this is a more general question.

I understand that from a core-user perspective, having the important methods available through the higher level modules insted of the specific ones (I mean using from aiida.engine import process_handler instead of using the complete from aiida.engine.processes.workchains.utils import process_handler) is helpful because they don't need to know the internal organization to have access to these.

However, wouldn't still be better for us to internally keep using the most specific path possible when importing inside aiida-core? I mean, I feel that if we need to debug or stuff like that it would be easier to know the exact location of the functions and classes being imported, and although I know most GUI's have some sort of automatic seach engine for this, but perhaps sometimes you need to use some simplified interface or maybe you are just grepping and it helps to have the information there.

As you said, most proper IDE's can automatically jump to the source and the increased brevity is a big plus in this case. Exception, in my opinion, is when you are importing something from a submodule. In that case the from .module import something is better than the absolute form

Additional point, it's important for us to use the "public" interface as much as possible in the tests - if only to make sure that we don't accidentally break it.

Ah, ok. Simultaneously checking in the tests that you are not breaking the public interface is a good point, I didn't think about that. I still think there is some advantages of knowing the exact location of the code you are using, and (at least for the non-test internals) it is not clear to me that the brevity in the "header" of the file (where there is not much logic to obfuscate) outweights this.

I really don't think using from aiida.orm.nodes.data.int import Int every time instead of from aiida.orm import Int does the understanding of the code any favors

ramirezfranciscof · 2020-06-09T08:42:33Z

tests/engine/processes/workchains/test_restart.py

+@pytest.mark.usefixtures('clear_database_before_test')
+def test_unhandled_failure(generate_work_chain, generate_calculation_node):
+    """Test the unhandled failure mechanism.
+
+    The workchain should be aborted if there are two consecutive failed sub processes that went unhandled. We simulate
+    it by setting `ctx.unhandled_failure` to True and append two failed process nodes in `ctx.children`.
+    """
+    process = generate_work_chain(SomeWorkChain, {})
+    process.setup()
+    process.ctx.unhandled_failure = True
+    process.ctx.iteration = 2
+    process.ctx.children = [generate_calculation_node(exit_status=100) for _ in range(2)]
+    assert process.inspect_process() == BaseRestartWorkChain.exit_codes.ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE  # pylint: disable=no-member
+


Could I ask you how these tests work? It seems like you are getting the work_chain process and then manually modifying the context of the workchain from outside of it. As a side note, I don't know why, but I was assuming the context was something that was only modifiable from inside the workchain methods: anyways this is just for testing and that should be the intended way to use it, right?

On the other hand, if you are just modifying things manually I'm not sure how does the workchain error code gets set because I don't see any run or submits, so I only see you setting up the workchain. Is it being sent to run when you instantiate_process with the runner in generate_work_chain? If that is the case, how are you sure that when you modify it the workchain has not already finished? I am not familiar with this way of running stuff...

The whole point of these tests is exactly that really "running" them is very difficult. This is because it often requires a lot of pre-requisites. In the case of the BaseRestartWorkChain, which I am testing here, it needs to run some CalcJob, which in turn needs a Code that then requires the corresponding binary to be installed. Since I don't want to test all of this in a big integration test, but want to write smaller unit tests of the individual methods of the BaseRestartWorkChain I have to mock certain things. Since the methods I am testing are class methods, I first need to have an instance. This is what the generate_work_chain is doing. It just gets me a WorkChain instance. There I can then call the methods I want to test. In this case, I want to test the inspect_process method. This is normally not the first method that is called, so I have to make sure to correctly "mock" the state of the workchain instance that it would normally have, when run fully automatically by the engine. This includes defining the ctx correctly, which is an internal dictionary of the WorkChain where it stores temporary data. I populate it such that it should trigger a certain code path in inspect_process and check that it behaves correctly.

Uhm, I see. I think that my main confussion was that I thought inspect_process just got you a result that was set by the engine when running the process, but now I understand this is not so and the inspect_process itself looks at the status of the calculations in the context to decide its output. Correct?

The inspect_process is just a custom WorkChain method, that is one of the steps that is called in its outline:

aiida-core/aiida/engine/processes/workchains/restart.py

Line 173 in a0e39c6

def inspect_process(self): # pylint: disable=inconsistent-return-statements,too-many-branches

So I am just testing the code that is in there.

ramirezfranciscof · 2020-06-09T08:47:39Z

tests/engine/processes/workchains/test_restart.py


-        class SomeWorkChain(BaseRestartWorkChain):
-            """Dummy class."""
+    _process_class = CalcJob


What is this? I don't remember needing to set a _process_class property inside workchains...is this new?

This is just a concept of the BaseRestartWorkChain. Note that this is an implementation of a WorkChain but not a concrete one, i.e., you cannot run a BaseRestartWorkChain. It is merely a baseclass for a very common type of WorkChain: one that runs a CalcJob and deals with errors. So to run an actual base restart workchain, one should subclass it and define this _process_class which will be the CalcJob that the workchain will launch. Look at the BaseRestartWorkChain docstring for more information

ramirezfranciscof · 2020-06-09T08:50:19Z

tests/engine/processes/workchains/test_restart.py

+    process.setup()
+    process.ctx.unhandled_failure = True
+    process.ctx.iteration = 2
+    process.ctx.children = [generate_calculation_node(exit_status=0), generate_calculation_node(exit_status=400)]


If you are just testing that the handler catches the exit_status=400 here, do you need the previous calculation with exit_status=0?

Technically no, since the code currently won't check the first, and I don't run inspect_process manually twice, after having appended a calculation to the ctx first each time. I will maybe change the test to do this and be more explicit, instead of setting the iteration manually

Any `BaseRestartWorkChain` will fail if the sub process fails twice in a row without being handled by a registered process handler. To monitor this the `unhandled_failure` context variable is used, which is set to `True` once a failed process was not handled. It should be unset as soon as the next process finishes successfully, or the failed process is handled. The reset was there, but was incorrectly setting the context var to `True` instead of `False`, which went unnoticed due to a lack of tests. Tests are now added, as well as for the case where the sub process excepted or was killed.

sphuber requested a review from ramirezfranciscof June 8, 2020 12:47

ramirezfranciscof approved these changes Jun 9, 2020

View reviewed changes

sphuber force-pushed the fix/4153/base-restart-unhandled-failure branch from a67a625 to d807b67 Compare June 9, 2020 09:38

sphuber merged commit 6f029df into aiidateam:develop Jun 9, 2020

sphuber deleted the fix/4153/base-restart-unhandled-failure branch June 9, 2020 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the unhandled failure mechanism of the `BaseRestartWorkChain` #4155

Fix the unhandled failure mechanism of the `BaseRestartWorkChain` #4155

sphuber commented Jun 6, 2020

codecov bot commented Jun 6, 2020 •

edited

Loading

ramirezfranciscof left a comment

ramirezfranciscof Jun 8, 2020

sphuber Jun 9, 2020

ramirezfranciscof Jun 9, 2020

sphuber Jun 9, 2020

ramirezfranciscof Jun 9, 2020

sphuber Jun 9, 2020

greschd Jun 9, 2020

ramirezfranciscof Jun 9, 2020 •

edited

Loading

sphuber Jun 9, 2020

ramirezfranciscof Jun 9, 2020

sphuber Jun 9, 2020

ramirezfranciscof Jun 9, 2020

sphuber Jun 9, 2020

ramirezfranciscof Jun 9, 2020

sphuber Jun 9, 2020

ramirezfranciscof Jun 9, 2020

sphuber Jun 9, 2020

		runner = get_manager().get_runner()
		process = instantiate_process(runner, process_class, **inputs)


		from aiida.engine import CalcJob, BaseRestartWorkChain, process_handler, ProcessState, ProcessHandlerReport, ExitCode

Fix the unhandled failure mechanism of the BaseRestartWorkChain #4155

Fix the unhandled failure mechanism of the BaseRestartWorkChain #4155

Conversation

sphuber commented Jun 6, 2020

codecov bot commented Jun 6, 2020 • edited Loading

Codecov Report

ramirezfranciscof left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramirezfranciscof Jun 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fix the unhandled failure mechanism of the `BaseRestartWorkChain` #4155

Fix the unhandled failure mechanism of the `BaseRestartWorkChain` #4155

codecov bot commented Jun 6, 2020 •

edited

Loading

ramirezfranciscof Jun 9, 2020 •

edited

Loading