-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the unhandled failure mechanism of the BaseRestartWorkChain
#4155
Fix the unhandled failure mechanism of the BaseRestartWorkChain
#4155
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #4155 +/- ##
===========================================
+ Coverage 78.86% 78.89% +0.04%
===========================================
Files 467 467
Lines 34481 34481
===========================================
+ Hits 27189 27201 +12
+ Misses 7292 7280 -12
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I think the only required change is the docstring on generate_calculation_node
, the rest are just questions to learn better how workchain processes work. So I think I can just approve it so you can merge it in once you check that docstring and if nothing in the questions triggers any lateral idea on how to improve or clairfy the code (hey, sometimes it happens).
tests/conftest.py
Outdated
|
||
@pytest.fixture | ||
def generate_calculation_node(): | ||
"""Generate an instance of a `WorkChain`.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring is not correct, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will adapt it. Thanks
runner = get_manager().get_runner() | ||
process = instantiate_process(runner, process_class, **inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be nuisance, but could I get some context on these couple of lines? I only read about the workchains and mostly the basic idea, but never really ran one. I tried going to the source in manager.py
and utils.py
but I feel like I need some extra general information on how the engine works to understand these functions/classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how a Process
instance is created. Normally this is done automatically by the engine when you call run
or submit
, but in this case I need more control. Because instead of the engine actually starting to run it, for these tests I just want the process instance and "run" parts of it myself.
|
||
from aiida.engine import CalcJob, BaseRestartWorkChain, process_handler, ProcessState, ProcessHandlerReport, ExitCode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is a more general question.
I understand that from a core-user perspective, having the important methods available through the higher level modules insted of the specific ones (I mean using from aiida.engine import process_handler
instead of using the complete from aiida.engine.processes.workchains.utils import process_handler
) is helpful because they don't need to know the internal organization to have access to these.
However, wouldn't still be better for us to internally keep using the most specific path possible when importing inside aiida-core? I mean, I feel that if we need to debug or stuff like that it would be easier to know the exact location of the functions and classes being imported, and although I know most GUI's have some sort of automatic seach engine for this, but perhaps sometimes you need to use some simplified interface or maybe you are just grepping and it helps to have the information there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said, most proper IDE's can automatically jump to the source and the increased brevity is a big plus in this case. Exception, in my opinion, is when you are importing something from a submodule. In that case the from .module import something
is better than the absolute form
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional point, it's important for us to use the "public" interface as much as possible in the tests - if only to make sure that we don't accidentally break it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Simultaneously checking in the tests that you are not breaking the public interface is a good point, I didn't think about that. I still think there is some advantages of knowing the exact location of the code you are using, and (at least for the non-test internals) it is not clear to me that the brevity in the "header" of the file (where there is not much logic to obfuscate) outweights this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't think using from aiida.orm.nodes.data.int import Int
every time instead of from aiida.orm import Int
does the understanding of the code any favors
@pytest.mark.usefixtures('clear_database_before_test') | ||
def test_unhandled_failure(generate_work_chain, generate_calculation_node): | ||
"""Test the unhandled failure mechanism. | ||
|
||
The workchain should be aborted if there are two consecutive failed sub processes that went unhandled. We simulate | ||
it by setting `ctx.unhandled_failure` to True and append two failed process nodes in `ctx.children`. | ||
""" | ||
process = generate_work_chain(SomeWorkChain, {}) | ||
process.setup() | ||
process.ctx.unhandled_failure = True | ||
process.ctx.iteration = 2 | ||
process.ctx.children = [generate_calculation_node(exit_status=100) for _ in range(2)] | ||
assert process.inspect_process() == BaseRestartWorkChain.exit_codes.ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE # pylint: disable=no-member | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I ask you how these tests work? It seems like you are getting the work_chain
process and then manually modifying the context of the workchain from outside of it. As a side note, I don't know why, but I was assuming the context was something that was only modifiable from inside the workchain methods: anyways this is just for testing and that should be the intended way to use it, right?
On the other hand, if you are just modifying things manually I'm not sure how does the workchain error code gets set because I don't see any run or submits, so I only see you setting up the workchain. Is it being sent to run when you instantiate_process
with the runner
in generate_work_chain
? If that is the case, how are you sure that when you modify it the workchain has not already finished? I am not familiar with this way of running stuff...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of these tests is exactly that really "running" them is very difficult. This is because it often requires a lot of pre-requisites. In the case of the BaseRestartWorkChain
, which I am testing here, it needs to run some CalcJob
, which in turn needs a Code
that then requires the corresponding binary to be installed. Since I don't want to test all of this in a big integration test, but want to write smaller unit tests of the individual methods of the BaseRestartWorkChain
I have to mock certain things. Since the methods I am testing are class methods, I first need to have an instance. This is what the generate_work_chain
is doing. It just gets me a WorkChain
instance. There I can then call the methods I want to test. In this case, I want to test the inspect_process
method. This is normally not the first method that is called, so I have to make sure to correctly "mock" the state of the workchain instance that it would normally have, when run fully automatically by the engine. This includes defining the ctx
correctly, which is an internal dictionary of the WorkChain
where it stores temporary data. I populate it such that it should trigger a certain code path in inspect_process
and check that it behaves correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I see. I think that my main confussion was that I thought inspect_process
just got you a result that was set by the engine when running the process, but now I understand this is not so and the inspect_process
itself looks at the status of the calculations in the context to decide its output. Correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inspect_process
is just a custom WorkChain
method, that is one of the steps that is called in its outline:
def inspect_process(self): # pylint: disable=inconsistent-return-statements,too-many-branches |
So I am just testing the code that is in there.
|
||
class SomeWorkChain(BaseRestartWorkChain): | ||
"""Dummy class.""" | ||
_process_class = CalcJob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this? I don't remember needing to set a _process_class
property inside workchains...is this new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a concept of the BaseRestartWorkChain
. Note that this is an implementation of a WorkChain
but not a concrete one, i.e., you cannot run a BaseRestartWorkChain
. It is merely a baseclass for a very common type of WorkChain
: one that runs a CalcJob
and deals with errors. So to run an actual base restart workchain, one should subclass it and define this _process_class
which will be the CalcJob
that the workchain will launch. Look at the BaseRestartWorkChain
docstring for more information
process.setup() | ||
process.ctx.unhandled_failure = True | ||
process.ctx.iteration = 2 | ||
process.ctx.children = [generate_calculation_node(exit_status=0), generate_calculation_node(exit_status=400)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are just testing that the handler catches the exit_status=400
here, do you need the previous calculation with exit_status=0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically no, since the code currently won't check the first, and I don't run inspect_process
manually twice, after having appended a calculation to the ctx first each time. I will maybe change the test to do this and be more explicit, instead of setting the iteration
manually
Any `BaseRestartWorkChain` will fail if the sub process fails twice in a row without being handled by a registered process handler. To monitor this the `unhandled_failure` context variable is used, which is set to `True` once a failed process was not handled. It should be unset as soon as the next process finishes successfully, or the failed process is handled. The reset was there, but was incorrectly setting the context var to `True` instead of `False`, which went unnoticed due to a lack of tests. Tests are now added, as well as for the case where the sub process excepted or was killed.
a67a625
to
d807b67
Compare
Fixes #4153
Any
BaseRestartWorkChain
will fail if the sub process fails twice in arow without being handled by a registered process handler. To monitor
this the
unhandled_failure
context variable is used, which is set toTrue
once a failed process was not handled. It should be unset as soonas the next process finishes successfully, or the failed process is
handled. The reset was there, but was incorrectly setting the context
var to
True
instead ofFalse
, which went unnoticed due to a lack oftests. Tests are now added, as well as for the case where the sub
process excepted or was killed.