Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code: add validate_remote_exec_path method to check executable #5184

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Oct 19, 2021

Fixes #5179
Fixes #868

A common problem is that the filepath of the executable for remote codes
is mistyped by accident. The user often doesn't realize until they
launch a calculation and it mysteriously fails with a non-descript
error. They have to look into the output files to find that the
executable could not be found.

At that point, it is not trivial to correct the mistake because the
Code cannot be edited nor can it be deleted, without first deleting
the calculation that was just run first. Therefore, it would be nice to
warn the user at the time of the code creation or storing.

However, the check requires opening a connection to the associated
computer which carries both significant overhead, and it may not always
be available at time of the code creation. Setup scripts for automated
environments may want to configure the computers and codes at a time
when they cannot be necessarily reached. Therefore, preventing codes
from being created in this case is not acceptable.

The compromise is to implement the check in validate_remote_exec_path
which can then freely be called by a user to check if the executable of
the remote code is usable. The method is added to the CLI through the
addition of the command verdi code test. Also here, we decide to not
add the check by default to verdi code setup as that should be able to
function without internet connection and with minimal overhead. The docs
are updated to encourage the user to run verdi code test before using
it in any calculations if they want to make sure it is functioning. In
the future, additional checks can be added to this command.

A common problem is that the filepath of the executable for remote codes
is mistyped by accident. The user often doesn't realize until they
launch a calculation and it mysteriously fails with a non-descript
error. They have to look into the output files to find that the
executable could not be found.

At that point, it is not trivial to correct the mistake because the
Code cannot be edited nor can it be deleted, without first deleting
the calculation that was just run first.

It would be nice to warn the user at the time of the code creation.
Therefore a check is added to Code._validate, which is called
automatically when the code is stored, which checks whether the
specified executable exists on the remote computer. However, we need to
account for the fact that for a code on a remote computer, the computer
may not actually be reachable at the time of the code setup. When this
is the case and opening the transport fails, instead of failing the
command, a warning is logged instead, stating that the presence of the
code could not be verified. This is preferred to making the command fail
entirely as this would make it impossible to setup codes for computers
that are not currently reachable, which is undesirable.

This is a first quick implementation and would need tests before merging but there are first some open questions to be answered:

  • Should this be done in verdi code setup or in Code constructor?
    I would definitely say in verdi code setup because it goes beyond the scope of the constructor. Downside is that it puts the burden of adding the check in other interfaces, but this is to be expected and the correct way IMO.
    This is now added in the Code.validate_remote_exec_path method which is automatically called by store().
  • Should the implementation catch just Exception instead of TransportOpenException to account for custom transport plugins that don't raise the "right" exception? It's not nice to catch a bare Exception but maybe in this case it is necessary. Reverted this change and just catch broad-except
  • Should we check the permissions of the file to make sure it is executable? This is something we could potentially check, but maybe not all transport types properly implement it. Maybe this is going to far, because there are plenty of other things to still check that would prevent the binary from running. Won't implement this for now

Another note: this fix still doesn't address a calculation failing if it is run with a binary that doesn't exist and having a uniform exit code. We could try this, but it would depend on parsing the _scheduler_stderr.txt for the error message of the missing binary and I am not sure how deterministic we can make this. At the very least this should depend on the used shell, which I believe we require to be bash, so at least that would be something. But does this also depend on anything else of the environment, for example with what MPI library it is called, e.g. mpirun vs srun etc.?

@sphuber
Copy link
Contributor Author

sphuber commented Oct 19, 2021

Pinging @giovannipizzi @chrisjsewell and @louisponet for comments

@chrisjsewell
Copy link
Member

thanks @sphuber

Should this be done in verdi code setup or in Code constructor?

Why does it need to be in the constructor? it just needs to be run before storing the Code, not constructing it.
I would add this as a method on the Code, like Code.check_executable. Then you can use it as part of the CLI and API; definitely good to check on setup, but you may also want to check before running calcs (obviously just because it is present at setup, doesn't mean it will be when you come to running it)

@sphuber
Copy link
Contributor Author

sphuber commented Oct 20, 2021

Why does it need to be in the constructor? it just needs to be run before storing the Code, not constructing it. I would add this as a method on the Code, like Code.check_executable. Then you can use it as part of the CLI and API;

Fair, will move it there.

but you may also want to check before running calcs (obviously just because it is present at setup, doesn't mean it will be when you come to running it)

Here I am not so sure. The downside here is that checking adds overhead. Ideally you do this at submission time, as to prevent the creation of the node and task if it is going to crash anyway, but this would require opening a transport which is a significant overhead and especially in high-throughput mode this slowdown is unacceptable. We could move it to a transport task where the transport will be opened anyway (and properly pooled) but than the calculation will already exist and can be properly excepted at best. If part of a workflow, it will still topple the entire workflow.

Edit: Maybe we could add verdi code test (like verdi computer setup) that runs this check and so people can run this before submitting a or many calculations, but I wonder how often this would actually be used.

@chrisjsewell
Copy link
Member

Here I am not so sure. The downside here is that checking adds overhead.

Oh indeed; I'm not suggesting adding to the core code, just that people could use it if they wish for this purpose.

Maybe we could add verdi code test

Yep thats exactly what I was going to suggest 👍

@sphuber sphuber force-pushed the fix/5179/code-setup-check-binary-existance branch 2 times, most recently from 1215fdf to 8762b37 Compare October 20, 2021 07:37
@sphuber sphuber changed the title verdi code setup: check existence of executable for remote code Code: check existence of executable for remote code when storing Oct 20, 2021
@sphuber
Copy link
Contributor Author

sphuber commented Oct 20, 2021

@chrisjsewell Have moved the check to a validation method on Code. This now also guarantees it is checked in normal API use as well as verdi code duplicate. Plus it is now easy to reuse in the verdi code test that I added. If the solution is approved I will add the tests.

On a side note, since your recent PR that you merged to add the pylint_aiida plugin, my pre-commit is failing with:


Traceback (most recent call last):
  File "/home/sph/.virtualenvs/aiida_dev/lib/python3.9/site-packages/pylint/lint/pylinter.py", line 577, in load_plugin_configuration
    module = astroid.modutils.load_module_from_name(modname)
  File "/home/sph/.virtualenvs/aiida_dev/lib/python3.9/site-packages/astroid/modutils.py", line 218, in load_module_from_name
    return importlib.import_module(dotted_name)
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'utils.pylint_aiida'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sph/.virtualenvs/aiida_dev/bin/pylint", line 8, in <module>
    sys.exit(run_pylint())
  File "/home/sph/.virtualenvs/aiida_dev/lib/python3.9/site-packages/pylint/__init__.py", line 24, in run_pylint
    PylintRun(sys.argv[1:])
  File "/home/sph/.virtualenvs/aiida_dev/lib/python3.9/site-packages/pylint/lint/run.py", line 359, in __init__
    linter.load_plugin_configuration()
  File "/home/sph/.virtualenvs/aiida_dev/lib/python3.9/site-packages/pylint/lint/pylinter.py", line 581, in load_plugin_configuration
    self.add_message("bad-plugin-value", args=(modname, e), line=0)
  File "/home/sph/.virtualenvs/aiida_dev/lib/python3.9/site-packages/pylint/message/message_handler_mix_in.py", line 244, in add_message
    self.add_one_message(
  File "/home/sph/.virtualenvs/aiida_dev/lib/python3.9/site-packages/pylint/message/message_handler_mix_in.py", line 306, in add_one_message
    self.stats[msg_cat] += 1
KeyError: 'error'

I reinstalled with pip install -e .[pre-commit] in case this was necessary to install the plugin from utils/pylint_aiida.py but to no avail. What needs to be done for this to work?

@sphuber sphuber force-pushed the fix/5179/code-setup-check-binary-existance branch from 8762b37 to 0d2578f Compare October 20, 2021 07:48
@chrisjsewell
Copy link
Member

How are you calling pre-commit? At present, you have to call it from the repo folder, so it can find the module file to load. I was thinking though, either to add it internal to aiida, e.g. something like aiida.tools.aiida_pylint, or as a separate package (and add it to the pre-commit extra)

@sphuber
Copy link
Contributor Author

sphuber commented Oct 20, 2021

How are you calling pre-commit? At present, you have to call it from the repo folder, so it can find the module file to load. I was thinking though, either to add it internal to aiida, e.g. something like aiida.tools.aiida_pylint, or as a separate package (and add it to the pre-commit extra)

It is installed as a pre-commit hook through pre-commit install, so it gets called whenever I git commit.

@@ -293,6 +294,34 @@ def _validate(self):
if not self.get_remote_exec_path():
raise exceptions.ValidationError('You did not specify a remote executable')

if not self.is_local():
self.validate_remote_exec_path()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blocks the node from being stored on failure yeh?
Do we really want to be this strict, i.e. you cannot create a Code unless you are connected to the Computer.
Perhaps we just run validate_remote_exec_path in verdi code setup with catch/except, and have a user prompt if they definitely still want to store the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if you cannot connect to the computer, it simply logs a warning. However, if you can connect to the machine and the binary doesn't exist, then it will raise an exception and prevent storing. Should we allow this to be possible though? Is there going to be a use case to define a code whose binary doesn't exist and is going to be added later? I guess it is possible. We can remove the call from _validate and handle it in verdi code setup but then we are back to the original problem where this check is not performed when creating directly through ORM or any other means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As agreed upon in discussions, this is now removed entirely from the _validate method and thus from storing. It has to be called manually and is plumbed into the CLI through verdi code test.

@louisponet
Copy link

I think this is a great solution, I think it will catch most of the suspected cases when this may pop up (i.e. users manually setting things up).

As a suggestion, usually the module load commands that support a given executable will be specified in the Prepend Text right?
An additional check that could be done is actually run that Prepend text followed by an ldd on the executable, and then parse whether there are certain libraries that couldn't be found. I think this is another very commonly occurring failure case and when I implemented it before it seemed to work quite well.

I can ref some code if you'd like but it's not very hard to implement anyway.

@ltalirz
Copy link
Member

ltalirz commented Oct 21, 2021

I haven't read the discussion; just wanted to mention that there has been an open issue for this since quite a while (actually opened by @sphuber ;-) ) #868

Great that you're now going to fix it!

@giovannipizzi
Copy link
Member

Thanks @sphuber for the implementation, and thanks all for the feedback, and @ltalirz for finding the older issue!

I think I confirm what I mentioned there: I would not run any check on verdi code setup (no connection opening, no way for it to fail because of the state of the code on the remote); and instead in the docs suggest to run verdi code test after having setup the code, for a few reasons:

  • this mirrors what is done for computers, so the user is used to it (and I don't see a major issue, except that users need to run one step more if they want to check - it's maybe annoying if they did a mistake in the code path, but then we can just suggest in the error message of verdi code test: "if the path is wrong, use verdi code duplicate to crete a new one with the correct name, and delete the old one"
  • the AiiDA code is just a representation of what executable you wanted to use - it's actually a perfectly valid situation that the code does not exist (anymore) from a provenance point of view, unless you want to run with it. Actually, if you e.g. import data from a colleague, you could be pointing to a code that does not exist anymore, or it's only accessible to them (e.g. in their home, even if you have access to the computer), or it's even in a non-accessible or not-more-existing computer.
  • Another example (just to show a usecase): I could decide to create codes in a single AiiDA instance using some yaml files, then export them to and archive file, and then distribute this file to be imported by colleagues e.g. in a collaboration (so all use the same UUID for the codes). E.g. the yaml files are provided by each person, but a single person takes care of generating the archive file. But e.g. the person creating the codes has access to the computer but not to some executables (e.g. one is in a user's home folder, and only a subgroup has access to it; or they are preparing this first AiiDA archive file in a machine that has no direct access to the computer because it's behind a VPN; ... - just to give some examples). Decoupling creation (that can be quick and does not open connections, similarly to verdi computer setup/configure) and testing (verdi computer/code test, that can have many more features in the future like optionally the one that @louisponet mentions) is, for me, a better strategy.

BTW, an added benefit of having a new verdi code setup is the it will realise if the computer is setup but not configured (another very common error :-) ). (We should actually check that the error in this case is clear and points to `verdi code configure - probably it does, but mentioning just in case).

@sphuber
Copy link
Contributor Author

sphuber commented Oct 21, 2021

I haven't read the discussion; just wanted to mention that there has been an open issue for this since quite a while (actually opened by @sphuber ;-) ) #868

I remembered it, but thought we had already closed it and so didn't bother to look again. Thanks for fishing it up. It will be closed by in whatever form we will decide to merge this PR.

BTW, an added benefit of having a new verdi code setup is the it will realise if the computer is setup but not configured (another very common error :-) )

Do you mean "a new verdi code test" here?

We should actually check that the error in this case is clear and points to `verdi code configure - probably it does, but mentioning just in case).

And here you mean "verdi computer configure"?

Another example (just to show a usecase):

I can see the point, but an archive is maybe not the best example to use for this story as in that case I don't think it is guaranteed they go through the front-end layer and get constructed directly. This would really be more of a flaw in the import code.

In the end, I agree though that we shouldn't be performing this test in the code creation or storing and I tried to explain the cases you mention to @louisponet as a justification but he thought they were hypothetical. I see the point that for certain users, having the check built in is nicer and easier as it prevents additional work from a silly mistake. That is the difficulty of designing software with a variety of use-cases; there will have to be compromises and nothing is ever as simple and clear cut as it seems.

So @louisponet , is it ok for you to remove the check and have a standalone check that is exposed through verdi code test and we add this to the docs?

@louisponet
Copy link

this mirrors what is done for computers

I didn't get why we couldn't do a similar check for testing computers.

I also don't really understand what issues such trial tests lead to in the usecases/situations mentioned above.

Is the overhead of checking whether a given ssh is reachable that serious??

@chrisjsewell
Copy link
Member

I also don't really understand what issues such trial tests lead to in the usecases/situations mentioned above.
Is the overhead of checking whether a given ssh is reachable that serious??

Its not the overhead, its the fact that we should not prohibit the creation of computers/codes if you don't happen to have a connection to the HPC; you might want to create it offline, or you are creating a test/immigrant computation.
We can certainly warn people of possible issues, but not be "stubborn" about disallowing them

@sphuber
Copy link
Contributor Author

sphuber commented Oct 22, 2021

Its not the overhead, its the fact that we should not prohibit the creation of computers/codes if you don't happen to have a connection to the HPC; you might want to create it offline, or you are creating a test/immigrant computation.
We can certainly warn people of possible issues, but not be "stubborn" about disallowing them

☝️ what he said

@louisponet
Copy link

louisponet commented Oct 22, 2021 via email

@giovannipizzi
Copy link
Member

@sphuber yes, I had mistyped some of the commands.

Regarding the points above, I think there is a technical aspect: I think it wasn't obvious how to trigger, from verdi computer setup, directly also the configure (and, possibly, the test). If we find a way to do this automatically (as soon as the setup is over, the user continues with the configure, and then with the test) then I'm ok in making this a single process. Otherwise, while I see Louis's comments on making it easier for his usecase, I would vote for now to already add it as a different code test command (it's already much better than not having it). We can always improve it in the future and automate it a bit more.

@sphuber
Copy link
Contributor Author

sphuber commented Oct 25, 2021

Regarding the points above, I think there is a technical aspect: I think it wasn't obvious how to trigger, from verdi computer setup, directly also the configure (and, possibly, the test). If we find a way to do this automatically (as soon as the setup is over, the user continues with the configure, and then with the test) then I'm ok in making this a single process.

The problem was and still is (I believe) with click. The way to call a command from another is to use Context.invoke, however, that doesn't go through the normal pipeline as a normal CLI invocation and prompting and argument validation/conversion, is not triggered the same way. Only real alternative would maybe to literally call the next command as a subprocess call, but that is very ugly and may have other problems.

Otherwise, while I see Louis's comments on making it easier for his usecase, I would vote for now to already add it as a different code test command (it's already much better than not having it). We can always improve it in the future and automate it a bit more.

I think this is the consensus now. Once I find the time I will update the PR accordingly.

@louisponet
Copy link

Strange consensus. Anyhow, another note is that computer test doesn't actually test the computer since the mpirun command can still fail silently.

On the click front, I don't get why it's so ugly since it would be a single command that would try calling another function (code test) which if it fails the first command would print a warning.

@sphuber
Copy link
Contributor Author

sphuber commented Oct 29, 2021

Anyhow, another note is that computer test doesn't actually test the computer since the mpirun command can still fail silently.

It does test the computer, it just doesn't test everything. Feel free to open a PR if you think this can and should be added to verdi computer test.

On the click front, I don't get why it's so ugly since it would be a single command that would try calling another function (code test) which if it fails the first command would print a warning.

The problem is exactly that: in click you can call one command from another, however, that won't trigger the prompting nor default/callbacks of the parameters, which we rely on. So that is not an option. This is a design decision from click and there is not much that we can do.

@louisponet
Copy link

Yea I will open an issue about the computer, I don't believe it makes a lot of sense to have a fixed mpirun command to a whole computer to start.

But I assume you can take out part of the functionality, put it in a function and call that function from click, no?

@sphuber sphuber force-pushed the fix/5179/code-setup-check-binary-existance branch from 0d2578f to 1c6978e Compare November 3, 2021 08:57
@sphuber sphuber changed the title Code: check existence of executable for remote code when storing Code: add validate_remote_exec_path method to check executable Nov 3, 2021
@sphuber sphuber force-pushed the fix/5179/code-setup-check-binary-existance branch from 1c6978e to 8992de2 Compare November 3, 2021 15:00
@sphuber sphuber force-pushed the fix/5179/code-setup-check-binary-existance branch from 8992de2 to c1e86d5 Compare November 16, 2021 13:41
@codecov
Copy link

codecov bot commented Nov 16, 2021

Codecov Report

Merging #5184 (37b79ad) into develop (632ab72) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #5184      +/-   ##
===========================================
+ Coverage    81.23%   81.24%   +0.01%     
===========================================
  Files          533      533              
  Lines        37356    37377      +21     
===========================================
+ Hits         30344    30364      +20     
- Misses        7012     7013       +1     
Flag Coverage Δ
django 76.12% <100.00%> (+0.01%) ⬆️
sqlalchemy 75.18% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/cmdline/commands/cmd_code.py 90.13% <100.00%> (+0.45%) ⬆️
aiida/orm/nodes/data/code.py 91.43% <100.00%> (+0.48%) ⬆️
aiida/transports/plugins/local.py 81.41% <0.00%> (-0.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 632ab72...37b79ad. Read the comment docs.

@sphuber sphuber force-pushed the fix/5179/code-setup-check-binary-existance branch 2 times, most recently from e991418 to a77777b Compare November 17, 2021 11:40
@sphuber
Copy link
Contributor Author

sphuber commented Nov 17, 2021

@ramirezfranciscof @giovannipizzi any of you want to give a final review? This is ready to be merged.

Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me! I think we can extend it (e.g. check if the file is also executable), but I think we can already merge now and we can extend/improve later!

A common problem is that the filepath of the executable for remote codes
is mistyped by accident. The user often doesn't realize until they
launch a calculation and it mysteriously fails with a non-descript
error. They have to look into the output files to find that the
executable could not be found.

At that point, it is not trivial to correct the mistake because the
`Code` cannot be edited nor can it be deleted, without first deleting
the calculation that was just run first. Therefore, it would be nice to
warn the user at the time of the code creation or storing.

However, the check requires opening a connection to the associated
computer which carries both significant overhead, and it may not always
be available at time of the code creation. Setup scripts for automated
environments may want to configure the computers and codes at a time
when they cannot be necessarily reached. Therefore, preventing codes
from being created in this case is not acceptable.

The compromise is to implement the check in `validate_remote_exec_path`
which can then freely be called by a user to check if the executable of
the remote code is usable. The method is added to the CLI through the
addition of the command `verdi code test`. Also here, we decide to not
add the check by default to `verdi code setup` as that should be able to
function without internet connection and with minimal overhead. The docs
are updated to encourage the user to run `verdi code test` before using
it in any calculations if they want to make sure it is functioning. In
the future, additional checks can be added to this command.
@sphuber sphuber force-pushed the fix/5179/code-setup-check-binary-existance branch from a77777b to 37b79ad Compare November 18, 2021 08:58
@sphuber sphuber merged commit 2f5170d into aiidateam:develop Nov 18, 2021
@sphuber sphuber deleted the fix/5179/code-setup-check-binary-existance branch November 18, 2021 09:24
@ltalirz ltalirz mentioned this pull request Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants