Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid deadlocks when retrieving stdout/stderr via SSH #3787

Merged
merged 5 commits into from
Jun 9, 2021

Conversation

giovannipizzi
Copy link
Member

Fixes #1525

The simple shift of the line retval = channel.recv_exit_status()
below stdout.read() and stderr.read(), while partially improving
the situation, is not enough (see discussion in #1525).
This instead reads in chunks both stdout and stderr, alternating between them.
Tests (added) show that this does not hang anymore for large files.

@ltalirz
Copy link
Member

ltalirz commented Feb 21, 2020

Just one minor comment: since this is touching the ssh connection, it might be good to test this in actual production runs before merging.

@giovannipizzi
Copy link
Member Author

Agreed - this was also my comment I gave in person to @sphuber - it this breaks, this break most of the AiiDA functionality :-D

But also if someone has the willingness to think if the logic is correct, this would be already great to at least have the feeling that we know why it is supposed to work (I am quite convinced that it's ok, but I might be missing something)

ConradJohnston
ConradJohnston previously approved these changes Feb 24, 2020
Copy link
Contributor

@ConradJohnston ConradJohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic seems to check out, assuming that the Paramiko methods are honest and we find out the answer to @greschd's question.

Also, did you benchmark the speed of this approach for small stdout/stderr?

@giovannipizzi
Copy link
Member Author

Also, did you benchmark the speed of this approach for small stdout/stderr?

From a quick benchmark on small files, the time seemed to be the same (see #3787: 10kB file transfer: 0.53s [new] vs 0.53s [old])

@sphuber
Copy link
Contributor

sphuber commented Feb 28, 2020

Also, did you benchmark the speed of this approach for small stdout/stderr?

From a quick benchmark on small files, the time seemed to be the same (see #3787: 10kB file transfer: 0.53s [new] vs 0.53s [old])

What happened to the big difference we were seeing originally, where the new buffer switching method was even a lot faster than the old naive implementation. Was that a measuring error?

@giovannipizzi
Copy link
Member Author

This is still there, again, see issue (e.g. 16MB file transfer: 1.2s vs 16s)

@giovannipizzi
Copy link
Member Author

giovannipizzi commented Feb 28, 2020

Here is a code to test out if it is working for you (remember of course to check out this branch):
[EDIT - updated using the exec_command_wait_bytes new function in the most recent commits]

def get_file_md5(computer_name, remote_file_name):
    """
    Given an AiiDA computer name and a remote file name, returns the MD5 of its content as
    a string, returned
    """
    import hashlib
    from aiida import orm
    from aiida.common.escaping import escape_for_bash

    computer = orm.Computer.get(label=computer_name)
    transport = computer.get_transport()

    # this opens the SSH connection, for SSH transports
    with transport:
        retcode, stdout, stderr = transport.exec_command_wait_bytes("cat {}".format(escape_for_bash(remote_file_name)))
    assert retcode == 0, f'wrong retcode {retcode}; stderr={stderr}, stdout={stdout}'
    assert stderr == b'', f'non-zero stderr: stderr={stderr}, stdout={stdout}'
    return str(hashlib.md5(stdout).hexdigest())

if __name__ == "__main__":
    import sys

    print(get_file_md5(sys.argv[1], sys.argv[2]))

Save it into a file e.g. check_ssh.py, then here is an example result in my case:

$ verdi run check_ssh.py daint rur.out
5c17d1a5ef891343517c19fd5841efe1

$ ssh daint.cscs.ch md5sum rur.out
5c17d1a5ef891343517c19fd5841efe1  rur.out

(note that in the first case you need to pass the AiiDA computer name).

Great if you could perform some tests on some SSH remotes you have. , on large text (non-binary, see below) files.

IMPORTANT (see also comment in the function above). Currently we have the exec_command_wait return a string rather than bytes. I think this is a problem, e.g. if you cat a PDF file. See discussion in #3814 (so we don't clutter this PR with the discussion on bytes vs. strings). However note that if you test the script above, you can test on big text files, but it will probably crash on big binary files.

@sphuber
Copy link
Contributor

sphuber commented Mar 2, 2020

I have run one of my benchmark suites (558 PwBaseWorkChains) on this branch and I haven't noticed any problems out of the ordinary. There were some failed calculations, but those seem to have just been failures on the remote. Everything seems retrieved just fine on AiiDA's side as far as I have been able to check. The retrieved files may not have been particularly big so it may never have triggered the problem that this PR is supposed to fix, but in any case, this would be a relatively common use-case of high-throughput'ish use of AiiDA.

@giovannipizzi
Copy link
Member Author

ToDo: update this PR according to the comments in #3814

Also, as a comment to @sphuber test - we should stress test in particular the part of reading from stdout/stderr; most of the AiiDA submissions will actually transfer files and this part is not touched by this PR.

@giovannipizzi
Copy link
Member Author

I added you as a collaborator on my fork - once you accept it, you can just push a new commit in my branch

@ramirezfranciscof
Copy link
Member

@giovannipizzi uhm, so, who do you mean by "you"?

@giovannipizzi
Copy link
Member Author

@ramirezfranciscof ahah I had multiple tabs open and I put the comment in wrong tab/issue :-) I had been wondering where this comment had gone, I knew I had written it but couldn't find it anymore

@codecov
Copy link

codecov bot commented Feb 8, 2021

Codecov Report

Merging #3787 (406a3bc) into develop (124fece) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3787      +/-   ##
===========================================
+ Coverage    80.10%   80.11%   +0.02%     
===========================================
  Files          515      515              
  Lines        36626    36665      +39     
===========================================
+ Hits         29335    29370      +35     
- Misses        7291     7295       +4     
Flag Coverage Δ
django 74.57% <100.00%> (+0.02%) ⬆️
sqlalchemy 73.50% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/transports/plugins/local.py 81.41% <100.00%> (-0.13%) ⬇️
aiida/transports/plugins/ssh.py 80.14% <100.00%> (+1.00%) ⬆️
aiida/transports/transport.py 66.22% <100.00%> (+0.47%) ⬆️
aiida/engine/daemon/client.py 75.13% <0.00%> (-1.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 124fece...406a3bc. Read the comment docs.

@dev-zero
Copy link
Contributor

dev-zero commented Feb 8, 2021

maybe as a note for the future: for Python 3.6+ there's an alternative to Paramiko: AsyncSSH which may tie in better to the current architecture of AiiDA, and avoid some of problems encountered with Paramiko (while probably causing others, as usual).

@giovannipizzi
Copy link
Member Author

Ok, I optimised the parameters to have a reasonably good throughput (I can go to 100MB/s over SSH on the local host, a bit slower than pure SSH+md5 but still reasonably good) - see also comments and commit message in my last commit.
I also checked (by modifying a bit the script provided in the comment above) that a similar speed is obtained (and correct results are obtained) if the data is sent over the stderr rather than stdout.

Therefore, as soon as I get some feedback from @pzarabadip and @adegomme, I can convert this draft into a PR

@giovannipizzi
Copy link
Member Author

As a further piece of info, I tried transferring actual data over ssh between my workstation in the office and my laptop at home.
I turned off the wifi and connected by cable. I started the connection, then physically disconnected the cable for ~6-7s, then reconnected it.
I tried this a few times and it worked fine, at the end I could get the correct MD5 of a ~330MB file.

This below is the measured network speed; the 'flat' parts are my combined network speed of ~12MB/s; the "hole" in the middle is the moment where I physically disconnected the cable (every data point is 1s):
data

@ezpzbz
Copy link
Member

ezpzbz commented Feb 8, 2021

Hi @giovannipizzi thanks for this. It seems quite okay with me. Unfortunately, my access to metavo machines has finished and cannot test it. I'm not aware of any other user to be honest to probably face with backward compatibility. I might get access again in future and then, I'll modify the plugin to reflect these changes (of course, if it is not to be done now).

@giovannipizzi
Copy link
Member Author

giovannipizzi commented Feb 8, 2021

Thanks @pzarabadip! I suggest that in that case either you apply the fix anyway (and open an issue to say to remove the copy-pasted function at some point), or just open an issue pointing here, but add a limit on aiida-core < 1.6aiida-core < 2.0 already (with a comment on why, pointing to your issue). This to avoid that in the future, when someone will get back to the code, it will take quite some time to figure out this subtle change.

@ezpzbz
Copy link
Member

ezpzbz commented Feb 8, 2021

Thanks @giovannipizzi . I just did limit the version and opened an issue for future traceback. Cheers.

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting work! I can probably not provide as deep technical reviews as you have already received, but I detected a couple of typos and had a couple of comments that might be of some use.

aiida/transports/plugins/local.py Outdated Show resolved Hide resolved
aiida/transports/plugins/local.py Outdated Show resolved Hide resolved
aiida/transports/plugins/ssh.py Outdated Show resolved Hide resolved
docs/source/topics/transport_template.py Show resolved Hide resolved
tests/transports/test_all_plugins.py Outdated Show resolved Hide resolved
tests/transports/test_all_plugins.py Outdated Show resolved Hide resolved
tests/transports/test_all_plugins.py Outdated Show resolved Hide resolved
tests/transports/test_all_plugins.py Outdated Show resolved Hide resolved
aiida/transports/plugins/ssh.py Outdated Show resolved Hide resolved
aiida/transports/plugins/ssh.py Outdated Show resolved Hide resolved
@adegomme
Copy link
Contributor

Hi,
Sorry about the long delay in my answer. The reason we had our own version was actually to integrate this fix early, as it was needed. As #4363 was already merged, this override should now become redundant and be removed. So I think all is needed is to make the next version of the plugin depend on aiida>1.6 and remove it altogether.
I will try to run this this afternoon and confirm it works on this branch.

@adegomme
Copy link
Contributor

I can confirm that it seems to be working fine. To avoid breaking support, I will keep the ssh.py file for now, but add a version check at import, to use the aiida_core's one on for aiida > 1.6. Thanks a lot.

adegomme added a commit to adegomme/aiida-sshonly that referenced this pull request Feb 24, 2021
@giovannipizzi giovannipizzi added this to the v2.0.0 milestone Mar 24, 2021
giovannipizzi and others added 4 commits June 4, 2021 12:32
Fixes aiidateam#1525

The simple shift of the line `retval = channel.recv_exit_status()`
below `stdout.read()` and `stderr.read()`, while partially improving
the situation, is not enough (see discussion in aiidateam#1525).
This instead reads in chunks both stdout and stderr, alternating.
Tests show that this does not hang anymore for large files.
Now, I add a exec_commmand_wait_bytes that actually does the job, and this
needs to be implemented by plugins. The two plugins implemented by AiiDA
already define instead the new function.
Also, I improved in both the API for the stdin, so that these commands
can accept also bytes ot BytesIO objects, that makes more sense in general.

I tried to keep the API (when calling them) backward-compatible, so using e.g.
a str/StringIO as stdin still works, and similarly `exec_command_wait` works
as before, but adds an optional `encoding` parameter (the default is `utf-8` to
give the same old behavior, but give more flexibility now in using different
encodings).
Tests have been added for these usecases.

However, note that the interface of plugins have changed (plugins now have to define
instead the `exec_command_wait_bytes` function instead). The change is minimal
(change the function name and avoid decoding the bytes into strings) but plugins
extending the transport functionality should take care of this (and, if they drop
the `exec_command_wait`, they should be depending on AiiDA >=1.6; otherwise they
should copy the implementation from the general Transport class if they want to remain
backward-compatible).

Also, I changed a couple of places to call directly the _bytes version, to avoid
possible issues with strange encodings on the remote computer.
Others are left in to avoid too many changes in the code, since until now noone complained
because they had a weird encoding on the supercomputer.
When such issue will arise we'll fix it, and thanks to this PR now the code is all ready
to move to treat bytes directly (or use a custom encoding, that e.g. might need to be
defined in the `verdi computer setup` or `verdi computer configure` steps, in the future).
I could get ~100MB/s when not using compression, ~33MB/s when using it
(the second case is capped by the compression speed; the time is only
marginally larger of the one I get if I run, on the command line,
`time ssh -C localhost cat /path/to/remote/file | md5sum`;
the first one is not as good as running md5 (via SSH) on the command line
(which can go ~3x faster of a 4GB file)
but it's still acceptable for most usecases.

(Note that here we're speaking of differences measurable only when
sending hundreds of MB sent over stdout, which should not be anyway
the default way of transferring a lot of data).
Co-authored-by: Francisco Ramirez <[email protected]>
@giovannipizzi
Copy link
Member Author

Thanks @ramirezfranciscof and sorry for the long time it took! If tests pass, I think this is ready to be reviewed!

@ramirezfranciscof
Copy link
Member

Great, thanks!

Thanks @ramirezfranciscof and sorry for the long time it took! If tests pass, I think this is ready to be reviewed!

@giovannipizzi you mean merged? Bah, unless you wanted someone else in particular to take a look at it.

Also there is the fact that the branch is a but outdated now. I'll try to merge develop into this, hopefully it will manage to do it automatically, but if it breaks then I think you might have to rebase or merge locally and manually fix any problem.

@giovannipizzi
Copy link
Member Author

Yeah, I meant just a final check + merge. I had rebased it, so the changes are minimal (and the merge seems to have worked).
I'm converting to a proper PR, feel free to merge!

@giovannipizzi giovannipizzi marked this pull request as ready for review June 5, 2021 21:00
Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good to go then @giovannipizzi ! I don't know if you can/want to write a more descriptive commit message for this, otherwise I can merge with the message on the OP.

@giovannipizzi giovannipizzi merged commit 68df47e into aiidateam:develop Jun 9, 2021
@giovannipizzi giovannipizzi deleted the fix_1525_hanging_ssh branch June 9, 2021 09:05
@giovannipizzi
Copy link
Member Author

giovannipizzi commented Jun 9, 2021

Commit message clarified, and added migration notes to the wiki

@pzarabadip @adegomme as discussed above, this will affect your plugins so you might want to take note of this and decide if/when to take action.
(Note that v1.6 still has the "old" syntax and this PR will be included in the upcoming AiiDA version, probably 2.0)

@ezpzbz
Copy link
Member

ezpzbz commented Jun 9, 2021

Thanks @giovannipizzi. I will adopt the plugins to reflect these changes. Cheers.

giovannipizzi added a commit to giovannipizzi/aiida-sshonly that referenced this pull request Jun 9, 2021
Now that the [corresponding PR in aiida-core](aiidateam/aiida-core#3787) has been merged, we know that 1.6 still has the old syntax, and the new changes will appear in 2.x only so I suggest to change this check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SSH transport hangs for big stdout content
9 participants