-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code
: add validate_remote_exec_path
method to check executable
#5184
Code
: add validate_remote_exec_path
method to check executable
#5184
Conversation
Pinging @giovannipizzi @chrisjsewell and @louisponet for comments |
thanks @sphuber
Why does it need to be in the constructor? it just needs to be run before storing the Code, not constructing it. |
Fair, will move it there.
Here I am not so sure. The downside here is that checking adds overhead. Ideally you do this at submission time, as to prevent the creation of the node and task if it is going to crash anyway, but this would require opening a transport which is a significant overhead and especially in high-throughput mode this slowdown is unacceptable. We could move it to a transport task where the transport will be opened anyway (and properly pooled) but than the calculation will already exist and can be properly excepted at best. If part of a workflow, it will still topple the entire workflow. Edit: Maybe we could add |
Oh indeed; I'm not suggesting adding to the core code, just that people could use it if they wish for this purpose.
Yep thats exactly what I was going to suggest 👍 |
1215fdf
to
8762b37
Compare
verdi code setup
: check existence of executable for remote codeCode
: check existence of executable for remote code when storing
@chrisjsewell Have moved the check to a validation method on On a side note, since your recent PR that you merged to add the
I reinstalled with |
8762b37
to
0d2578f
Compare
How are you calling pre-commit? At present, you have to call it from the repo folder, so it can find the module file to load. I was thinking though, either to add it internal to aiida, e.g. something like |
It is installed as a pre-commit hook through |
aiida/orm/nodes/data/code.py
Outdated
@@ -293,6 +294,34 @@ def _validate(self): | |||
if not self.get_remote_exec_path(): | |||
raise exceptions.ValidationError('You did not specify a remote executable') | |||
|
|||
if not self.is_local(): | |||
self.validate_remote_exec_path() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This blocks the node from being stored on failure yeh?
Do we really want to be this strict, i.e. you cannot create a Code
unless you are connected to the Computer
.
Perhaps we just run validate_remote_exec_path
in verdi code setup
with catch/except, and have a user prompt if they definitely still want to store the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, if you cannot connect to the computer, it simply logs a warning. However, if you can connect to the machine and the binary doesn't exist, then it will raise an exception and prevent storing. Should we allow this to be possible though? Is there going to be a use case to define a code whose binary doesn't exist and is going to be added later? I guess it is possible. We can remove the call from _validate
and handle it in verdi code setup
but then we are back to the original problem where this check is not performed when creating directly through ORM or any other means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As agreed upon in discussions, this is now removed entirely from the _validate
method and thus from storing. It has to be called manually and is plumbed into the CLI through verdi code test
.
I think this is a great solution, I think it will catch most of the suspected cases when this may pop up (i.e. users manually setting things up). As a suggestion, usually the module load commands that support a given executable will be specified in the Prepend Text right? I can ref some code if you'd like but it's not very hard to implement anyway. |
Thanks @sphuber for the implementation, and thanks all for the feedback, and @ltalirz for finding the older issue! I think I confirm what I mentioned there: I would not run any check on
BTW, an added benefit of having a new |
I remembered it, but thought we had already closed it and so didn't bother to look again. Thanks for fishing it up. It will be closed by in whatever form we will decide to merge this PR.
Do you mean "a new
And here you mean "
I can see the point, but an archive is maybe not the best example to use for this story as in that case I don't think it is guaranteed they go through the front-end layer and get constructed directly. This would really be more of a flaw in the import code. In the end, I agree though that we shouldn't be performing this test in the code creation or storing and I tried to explain the cases you mention to @louisponet as a justification but he thought they were hypothetical. I see the point that for certain users, having the check built in is nicer and easier as it prevents additional work from a silly mistake. That is the difficulty of designing software with a variety of use-cases; there will have to be compromises and nothing is ever as simple and clear cut as it seems. So @louisponet , is it ok for you to remove the check and have a standalone check that is exposed through |
I didn't get why we couldn't do a similar check for testing computers. I also don't really understand what issues such trial tests lead to in the usecases/situations mentioned above. Is the overhead of checking whether a given ssh is reachable that serious?? |
Its not the overhead, its the fact that we should not prohibit the creation of computers/codes if you don't happen to have a connection to the HPC; you might want to create it offline, or you are creating a test/immigrant computation. |
☝️ what he said |
Oh yea sure, I didn't mean that throwing an error was the way to go, but at
least try to do the check and throw a warning if it doesn't go through.
It's just about trying to minimise unexpected results.
…On Fri, 22 Oct 2021, 15:24 Sebastiaan Huber, ***@***.***> wrote:
Its not the overhead, its the fact that we should not prohibit the
creation of computers/codes if you don't happen to have a connection to the
HPC; you might want to create it offline, or you are creating a
test/immigrant computation.
We can certainly warn people of possible issues, but not be "stubborn"
about disallowing them
☝️ what he said
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5184 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFCD32MS76XRMMON45PUOGTUIFQYPANCNFSM5GKATAGA>
.
|
@sphuber yes, I had mistyped some of the commands. Regarding the points above, I think there is a technical aspect: I think it wasn't obvious how to trigger, from |
The problem was and still is (I believe) with
I think this is the consensus now. Once I find the time I will update the PR accordingly. |
Strange consensus. Anyhow, another note is that On the click front, I don't get why it's so ugly since it would be a single command that would try calling another function (code test) which if it fails the first command would print a warning. |
It does test the computer, it just doesn't test everything. Feel free to open a PR if you think this can and should be added to
The problem is exactly that: in |
Yea I will open an issue about the computer, I don't believe it makes a lot of sense to have a fixed mpirun command to a whole computer to start. But I assume you can take out part of the functionality, put it in a function and call that function from click, no? |
0d2578f
to
1c6978e
Compare
Code
: check existence of executable for remote code when storing Code
: add validate_remote_exec_path
method to check executable
1c6978e
to
8992de2
Compare
8992de2
to
c1e86d5
Compare
Codecov Report
@@ Coverage Diff @@
## develop #5184 +/- ##
===========================================
+ Coverage 81.23% 81.24% +0.01%
===========================================
Files 533 533
Lines 37356 37377 +21
===========================================
+ Hits 30344 30364 +20
- Misses 7012 7013 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
e991418
to
a77777b
Compare
@ramirezfranciscof @giovannipizzi any of you want to give a final review? This is ready to be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me! I think we can extend it (e.g. check if the file is also executable), but I think we can already merge now and we can extend/improve later!
A common problem is that the filepath of the executable for remote codes is mistyped by accident. The user often doesn't realize until they launch a calculation and it mysteriously fails with a non-descript error. They have to look into the output files to find that the executable could not be found. At that point, it is not trivial to correct the mistake because the `Code` cannot be edited nor can it be deleted, without first deleting the calculation that was just run first. Therefore, it would be nice to warn the user at the time of the code creation or storing. However, the check requires opening a connection to the associated computer which carries both significant overhead, and it may not always be available at time of the code creation. Setup scripts for automated environments may want to configure the computers and codes at a time when they cannot be necessarily reached. Therefore, preventing codes from being created in this case is not acceptable. The compromise is to implement the check in `validate_remote_exec_path` which can then freely be called by a user to check if the executable of the remote code is usable. The method is added to the CLI through the addition of the command `verdi code test`. Also here, we decide to not add the check by default to `verdi code setup` as that should be able to function without internet connection and with minimal overhead. The docs are updated to encourage the user to run `verdi code test` before using it in any calculations if they want to make sure it is functioning. In the future, additional checks can be added to this command.
a77777b
to
37b79ad
Compare
Fixes #5179
Fixes #868
A common problem is that the filepath of the executable for remote codes
is mistyped by accident. The user often doesn't realize until they
launch a calculation and it mysteriously fails with a non-descript
error. They have to look into the output files to find that the
executable could not be found.
At that point, it is not trivial to correct the mistake because the
Code
cannot be edited nor can it be deleted, without first deletingthe calculation that was just run first. Therefore, it would be nice to
warn the user at the time of the code creation or storing.
However, the check requires opening a connection to the associated
computer which carries both significant overhead, and it may not always
be available at time of the code creation. Setup scripts for automated
environments may want to configure the computers and codes at a time
when they cannot be necessarily reached. Therefore, preventing codes
from being created in this case is not acceptable.
The compromise is to implement the check in
validate_remote_exec_path
which can then freely be called by a user to check if the executable of
the remote code is usable. The method is added to the CLI through the
addition of the command
verdi code test
. Also here, we decide to notadd the check by default to
verdi code setup
as that should be able tofunction without internet connection and with minimal overhead. The docs
are updated to encourage the user to run
verdi code test
before usingit in any calculations if they want to make sure it is functioning. In
the future, additional checks can be added to this command.
A common problem is that the filepath of the executable for remote codesis mistyped by accident. The user often doesn't realize until they
launch a calculation and it mysteriously fails with a non-descript
error. They have to look into the output files to find that the
executable could not be found.
At that point, it is not trivial to correct the mistake because the
Code
cannot be edited nor can it be deleted, without first deletingthe calculation that was just run first.
It would be nice to warn the user at the time of the code creation.Therefore a check is added to
Code._validate
, which is calledautomatically when the code is stored, which checks whether the
specified executable exists on the remote computer. However, we need to
account for the fact that for a code on a remote computer, the computer
may not actually be reachable at the time of the code setup. When this
is the case and opening the transport fails, instead of failing the
command, a warning is logged instead, stating that the presence of the
code could not be verified. This is preferred to making the command fail
entirely as this would make it impossible to setup codes for computers
that are not currently reachable, which is undesirable.
This is a first quick implementation and would need tests before merging but there are first some open questions to be answered:
Should this be done inThis is now added in theverdi code setup
or inCode
constructor?I would definitely say in
verdi code setup
because it goes beyond the scope of the constructor. Downside is that it puts the burden of adding the check in other interfaces, but this is to be expected and the correct way IMO.Code.validate_remote_exec_path
method which is automatically called bystore()
.Should the implementation catch justReverted this change and just catch broad-exceptException
instead ofTransportOpenException
to account for custom transport plugins that don't raise the "right" exception? It's not nice to catch a bareException
but maybe in this case it is necessary.Should we check the permissions of the file to make sure it is executable? This is something we could potentially check, but maybe not all transport types properly implement it. Maybe this is going to far, because there are plenty of other things to still check that would prevent the binary from running.Won't implement this for nowAnother note: this fix still doesn't address a calculation failing if it is run with a binary that doesn't exist and having a uniform exit code. We could try this, but it would depend on parsing the
_scheduler_stderr.txt
for the error message of the missing binary and I am not sure how deterministic we can make this. At the very least this should depend on the used shell, which I believe we require to be bash, so at least that would be something. But does this also depend on anything else of the environment, for example with what MPI library it is called, e.g.mpirun
vssrun
etc.?