Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH launch fails when host file has more than 64 hosts #6198

Open
bwbarrett opened this issue Dec 14, 2018 · 17 comments
Open

SSH launch fails when host file has more than 64 hosts #6198

bwbarrett opened this issue Dec 14, 2018 · 17 comments
Assignees

Comments

@bwbarrett
Copy link
Member

We're seeing launch failures when the host file has more than 64 hosts, which is resolved with --mca routed direct MCA parameter. Platform was x86_64 Linux in EC2. Each instance has 2 cores (4 hyperthreads). Hostfile looked like:

172.31.16.122
172.31.16.222
172.31.16.67
172.31.16.80
172.31.17.114
172.31.17.135
172.31.17.173
172.31.17.178
172.31.17.181
172.31.17.235
172.31.17.244
172.31.17.254
172.31.17.26
172.31.17.7
172.31.18.106
172.31.18.143
172.31.18.187
172.31.18.28
172.31.18.36
172.31.18.82
172.31.19.153
172.31.19.31
172.31.19.64
172.31.19.99
172.31.20.109
172.31.20.139
172.31.20.45
172.31.20.48
172.31.20.54
172.31.20.92
172.31.21.198
172.31.21.247
172.31.21.35
172.31.21.49
172.31.22.105
172.31.22.187
172.31.22.233
172.31.22.96
172.31.22.97
172.31.23.139
172.31.23.15
172.31.23.17
172.31.23.176
172.31.23.18
172.31.23.197
172.31.23.226
172.31.23.240
172.31.23.46
172.31.23.59
172.31.24.106
172.31.24.125
172.31.24.134
172.31.24.153
172.31.24.159
172.31.24.190
172.31.24.59
172.31.24.64
172.31.25.105
172.31.25.147
172.31.25.204
172.31.25.205
172.31.25.74
172.31.26.126
172.31.26.146
172.31.26.232
172.31.26.254
172.31.26.65
172.31.26.69
172.31.27.129
172.31.27.148
172.31.27.184
172.31.27.198
172.31.27.234
172.31.27.28
172.31.27.35
172.31.28.13
172.31.28.22
172.31.28.221
172.31.28.30
172.31.28.38
172.31.28.75
172.31.29.20
172.31.29.232
172.31.29.40
172.31.29.42
172.31.29.46
172.31.29.63
172.31.29.78
172.31.30.21
172.31.30.245
172.31.30.31
172.31.30.48
172.31.30.82
172.31.31.126
172.31.31.159
172.31.31.82
@rhc54
Copy link
Contributor

rhc54 commented Dec 14, 2018

Something has borked the routed setup as the default radix is 64. Either we aren't computing the routes or the table is wrong.

@rhc54
Copy link
Contributor

rhc54 commented Dec 14, 2018

BTW: easiest way to test with only a couple of nodes is to add --mca routed_radix 1 to your cmd line - this basically creates a linear "tree".

@bwbarrett
Copy link
Member Author

Yeah :(. "Humorously", I still have your email from 12/19/17 with instructions on configuring routed so we catch these issues in MTT / CI. Guess I should have acted on that.

@rhc54
Copy link
Contributor

rhc54 commented Jan 15, 2019

Any progress on this? I should think it a blocker for the branches.

@gpaulsen
Copy link
Member

@rhc54 mentioned on the call today, that he may have a fix for this.

@mkre
Copy link

mkre commented Jul 3, 2019

We're also running into this problem on AWS, but luckily there as an easy workaround (-mca routed direct). I've got two questions, though:

  1. We don't see this issue on an Infiniband cluster. Is it possible that this only affects Ethernet clusters?
  2. Are there any negative consequences (such as higher startup times) to be feared from the workaround? We are trying to find out under which circumstances we should employ it.

@jsquyres
Copy link
Member

jsquyres commented Jul 3, 2019

@rhc54 Is pretty sure that he fixed this on master.

@mkre @bwbarrett @dfaraj can you try a nightly snapshot from master and see if the problem is resolved? See https://www.open-mpi.org/nightly/master/

According to #6786 (comment), it looks like it is still broken on the v4.0 branch as of 29 June 2019. If it is, indeed, fixed on master, @rhc54 graciously said he'd try to track down a list of commits that fixed the issue for us so that we can port them to the v4.0.x branch.

@mkre
Copy link

mkre commented Jul 4, 2019

@jsquyres, we'll test this and report back, but it may take us a couple of days.

@mkre
Copy link

mkre commented Jul 8, 2019

Does anyone have an idea under which circumstances this issue appears? As I said, so far we couldn't see this issue on one of our InfiniBand clusters, but only an AWS. Could it be the case that Open MPI takes a different code path on those systems, or are we just lucky with the IB system?

@mkre
Copy link

mkre commented Jul 9, 2019

@mkre @bwbarrett @dfaraj can you try a nightly snapshot from master and see if the problem is resolved? See https://www.open-mpi.org/nightly/master/

@jsquyres, we have tested this and I can confirm that the hang is resolved with the nightly snapshot.

@jjhursey
Copy link
Member

I posted a fix to the plm/rsh component that resolves a mismatch between the tree spawn and the remote routed component (see Issue #6618 for details). PR #6944 fixes the issue for the v4.0.x branch. Can you give that a try to see if it resolves this issue? I think it might.

@mkre
Copy link

mkre commented Dec 16, 2019

@jjhursey, sorry for the late answer. I can confirm that the issue is fixed in Open MPI 4.0.2, but still persists on 3.1.5.

@jjhursey
Copy link
Member

Looks like the 3.1.5 issue is reported in Issue #7087 as well.

@gpaulsen
Copy link
Member

Removing Target: Master and Target: v4.0.x labels, as this issue is now fixed in those branches.

@gpaulsen
Copy link
Member

FYI @mwheinz may also be interested in this fix on v3.1.x

@mwheinz
Copy link

mwheinz commented Feb 28, 2020

Do we know what change fixed this in the 4.0.x branch? If we knew that I could try to back-port it myself...

@rhc54
Copy link
Contributor

rhc54 commented Feb 28, 2020

PR #6944 fixes the issue for the v4.0.x branch.

as stated above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants