Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the unhandled failure mechanism of the BaseRestartWorkChain #4155

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jun 6, 2020

Fixes #4153

Any BaseRestartWorkChain will fail if the sub process fails twice in a
row without being handled by a registered process handler. To monitor
this the unhandled_failure context variable is used, which is set to
True once a failed process was not handled. It should be unset as soon
as the next process finishes successfully, or the failed process is
handled. The reset was there, but was incorrectly setting the context
var to True instead of False, which went unnoticed due to a lack of
tests. Tests are now added, as well as for the case where the sub
process excepted or was killed.

@codecov
Copy link

codecov bot commented Jun 6, 2020

Codecov Report

Merging #4155 into develop will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4155      +/-   ##
===========================================
+ Coverage    78.86%   78.89%   +0.04%     
===========================================
  Files          467      467              
  Lines        34481    34481              
===========================================
+ Hits         27189    27201      +12     
+ Misses        7292     7280      -12     
Flag Coverage Δ
#django 70.82% <100.00%> (+0.04%) ⬆️
#sqlalchemy 71.70% <100.00%> (+0.04%) ⬆️
Impacted Files Coverage Δ
aiida/engine/processes/workchains/restart.py 54.73% <100.00%> (+7.44%) ⬆️
aiida/transports/plugins/local.py 80.47% <0.00%> (+0.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0e39c6...d807b67. Read the comment docs.

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think the only required change is the docstring on generate_calculation_node, the rest are just questions to learn better how workchain processes work. So I think I can just approve it so you can merge it in once you check that docstring and if nothing in the questions triggers any lateral idea on how to improve or clairfy the code (hey, sometimes it happens).


@pytest.fixture
def generate_calculation_node():
"""Generate an instance of a `WorkChain`."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is not correct, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will adapt it. Thanks

Comment on lines +113 to +114
runner = get_manager().get_runner()
process = instantiate_process(runner, process_class, **inputs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be nuisance, but could I get some context on these couple of lines? I only read about the workchains and mostly the basic idea, but never really ran one. I tried going to the source in manager.py and utils.py but I feel like I need some extra general information on how the engine works to understand these functions/classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how a Process instance is created. Normally this is done automatically by the engine when you call run or submit, but in this case I need more control. Because instead of the engine actually starting to run it, for these tests I just want the process instance and "run" parts of it myself.


from aiida.engine import CalcJob, BaseRestartWorkChain, process_handler, ProcessState, ProcessHandlerReport, ExitCode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is a more general question.

I understand that from a core-user perspective, having the important methods available through the higher level modules insted of the specific ones (I mean using from aiida.engine import process_handler instead of using the complete from aiida.engine.processes.workchains.utils import process_handler) is helpful because they don't need to know the internal organization to have access to these.

However, wouldn't still be better for us to internally keep using the most specific path possible when importing inside aiida-core? I mean, I feel that if we need to debug or stuff like that it would be easier to know the exact location of the functions and classes being imported, and although I know most GUI's have some sort of automatic seach engine for this, but perhaps sometimes you need to use some simplified interface or maybe you are just grepping and it helps to have the information there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said, most proper IDE's can automatically jump to the source and the increased brevity is a big plus in this case. Exception, in my opinion, is when you are importing something from a submodule. In that case the from .module import something is better than the absolute form

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional point, it's important for us to use the "public" interface as much as possible in the tests - if only to make sure that we don't accidentally break it.

Copy link
Member

@ramirezfranciscof ramirezfranciscof Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Simultaneously checking in the tests that you are not breaking the public interface is a good point, I didn't think about that. I still think there is some advantages of knowing the exact location of the code you are using, and (at least for the non-test internals) it is not clear to me that the brevity in the "header" of the file (where there is not much logic to obfuscate) outweights this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't think using from aiida.orm.nodes.data.int import Int every time instead of from aiida.orm import Int does the understanding of the code any favors

Comment on lines 61 to 76
@pytest.mark.usefixtures('clear_database_before_test')
def test_unhandled_failure(generate_work_chain, generate_calculation_node):
"""Test the unhandled failure mechanism.

The workchain should be aborted if there are two consecutive failed sub processes that went unhandled. We simulate
it by setting `ctx.unhandled_failure` to True and append two failed process nodes in `ctx.children`.
"""
process = generate_work_chain(SomeWorkChain, {})
process.setup()
process.ctx.unhandled_failure = True
process.ctx.iteration = 2
process.ctx.children = [generate_calculation_node(exit_status=100) for _ in range(2)]
assert process.inspect_process() == BaseRestartWorkChain.exit_codes.ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE # pylint: disable=no-member

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I ask you how these tests work? It seems like you are getting the work_chain process and then manually modifying the context of the workchain from outside of it. As a side note, I don't know why, but I was assuming the context was something that was only modifiable from inside the workchain methods: anyways this is just for testing and that should be the intended way to use it, right?

On the other hand, if you are just modifying things manually I'm not sure how does the workchain error code gets set because I don't see any run or submits, so I only see you setting up the workchain. Is it being sent to run when you instantiate_process with the runner in generate_work_chain? If that is the case, how are you sure that when you modify it the workchain has not already finished? I am not familiar with this way of running stuff...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of these tests is exactly that really "running" them is very difficult. This is because it often requires a lot of pre-requisites. In the case of the BaseRestartWorkChain, which I am testing here, it needs to run some CalcJob, which in turn needs a Code that then requires the corresponding binary to be installed. Since I don't want to test all of this in a big integration test, but want to write smaller unit tests of the individual methods of the BaseRestartWorkChain I have to mock certain things. Since the methods I am testing are class methods, I first need to have an instance. This is what the generate_work_chain is doing. It just gets me a WorkChain instance. There I can then call the methods I want to test. In this case, I want to test the inspect_process method. This is normally not the first method that is called, so I have to make sure to correctly "mock" the state of the workchain instance that it would normally have, when run fully automatically by the engine. This includes defining the ctx correctly, which is an internal dictionary of the WorkChain where it stores temporary data. I populate it such that it should trigger a certain code path in inspect_process and check that it behaves correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, I see. I think that my main confussion was that I thought inspect_process just got you a result that was set by the engine when running the process, but now I understand this is not so and the inspect_process itself looks at the status of the calculations in the context to decide its output. Correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inspect_process is just a custom WorkChain method, that is one of the steps that is called in its outline:

def inspect_process(self): # pylint: disable=inconsistent-return-statements,too-many-branches

So I am just testing the code that is in there.


class SomeWorkChain(BaseRestartWorkChain):
"""Dummy class."""
_process_class = CalcJob
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? I don't remember needing to set a _process_class property inside workchains...is this new?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a concept of the BaseRestartWorkChain. Note that this is an implementation of a WorkChain but not a concrete one, i.e., you cannot run a BaseRestartWorkChain. It is merely a baseclass for a very common type of WorkChain: one that runs a CalcJob and deals with errors. So to run an actual base restart workchain, one should subclass it and define this _process_class which will be the CalcJob that the workchain will launch. Look at the BaseRestartWorkChain docstring for more information

process.setup()
process.ctx.unhandled_failure = True
process.ctx.iteration = 2
process.ctx.children = [generate_calculation_node(exit_status=0), generate_calculation_node(exit_status=400)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are just testing that the handler catches the exit_status=400 here, do you need the previous calculation with exit_status=0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically no, since the code currently won't check the first, and I don't run inspect_process manually twice, after having appended a calculation to the ctx first each time. I will maybe change the test to do this and be more explicit, instead of setting the iteration manually

Any `BaseRestartWorkChain` will fail if the sub process fails twice in a
row without being handled by a registered process handler. To monitor
this the `unhandled_failure` context variable is used, which is set to
`True` once a failed process was not handled. It should be unset as soon
as the next process finishes successfully, or the failed process is
handled. The reset was there, but was incorrectly setting the context
var to `True` instead of `False`, which went unnoticed due to a lack of
tests. Tests are now added, as well as for the case where the sub
process excepted or was killed.
@sphuber sphuber force-pushed the fix/4153/base-restart-unhandled-failure branch from a67a625 to d807b67 Compare June 9, 2020 09:38
@sphuber sphuber merged commit 6f029df into aiidateam:develop Jun 9, 2020
@sphuber sphuber deleted the fix/4153/base-restart-unhandled-failure branch June 9, 2020 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BaseRestartWorkChain: unhandled_failure flag reset
3 participants