ORTE has lost communication with a remote daemon. #6618

tingweiwu · 2019-04-26T09:18:48Z

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

 mpirun --version
mpirun.real (OpenRTE) 3.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Install Open MPI

 mkdir /tmp/openmpi && \
    cd /tmp/openmpi && \
    wget https://www.open-mpi.org/software/ompi/v3.1/downloads/openmpi-3.1.2.tar.gz && \
    tar zxf openmpi-3.1.2.tar.gz && \
    cd openmpi-3.1.2 && \
    ./configure --enable-orterun-prefix-by-default && \
    make -j $(nproc) all && \
    make install && \
    ldconfig && \
    rm -rf /tmp/openmpi

Please describe the system on which you are running

Operating system/version:
ubuntu16.04
Computer hardware:
V100GPU+InfiniBand
Network type:
docker cni networker

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

I got this error frequently, not everytime. but it occures both when the process starting or running.

I have check the network bewteen i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd and i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0 is ok, and OOM haven't occured.

do you have any suggestion to find the reason?

+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 1 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 3 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 2 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 4 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 5 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 6 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 7 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 8 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
command terminated with exit code 137
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[11377,0],0] on node i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd
Remote daemon: [[11377,0],1] on node i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[11377,0],0] on node i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd
Remote daemon: [[11377,0],1] on node i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

tingweiwu · 2019-04-28T06:42:13Z

how could I get more infomation about this error?

zrss · 2019-04-29T00:56:01Z

i met a similar case, but i found my problem is related to the number of np, as i run

mpirun -np 192 hostname

it works fine, but once i increased the np number to 200, it comes out the problem

ORTE has lost communication with a remote daemon.

  HNP daemon   : [[2448,0],0] on node job1b-train-pub-v100nv2-ueaxr
  Remote daemon: [[2448,0],6] on node jobbf317b09-worker-13-6pg-862-gdwfm

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

and i can confirm these nodes can connect each other as i switch the hostfile to select nodes that mpirun run

the mpi hostfile sample

jobbf317b09-worker-31-q8d-188-d5d9p slots=8
jobbf317b09-worker-4-ob8-102-9srjx slots=8
jobbf317b09-worker-5-c9p-866-nc6pb slots=8
jobbf317b09-worker-6-aqj-472-mdglq slots=8
jobbf317b09-worker-7-95f-872-667dz slots=8
jobbf317b09-worker-8-vci-856-7lsgq slots=8
jobbf317b09-worker-9-0hj-7-m9vgz slots=8

... hope that someone know how to debug it 👽

tingweiwu · 2019-04-29T02:02:13Z

It's possible to configure the openmpi to not terminate the job if ORTE has lost communication with a remote daemon?

tingweiwu · 2019-05-13T02:05:04Z

io time cost (batch 2640):  252.29486799240112
epoch 4/100, batch: 2620/3461, lr: 0.00048600003, loss: 7.65098391976442, duration per batch: 25062.119s, memory: 6.0GB
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[37833,0],0] on node orte-mpijob-0-launcher-6cskl
  Remote daemon: [[37833,0],2] on node orte-mpijob-0-worker-1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[orte-mpijob-0-launcher-6cskl:00009] Job UNKNOWN has launched
[orte-mpijob-0-launcher-6cskl:00009] [[37833,0],0] Releasing job data for [37833,1]
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_finalize: proc session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_finalize: job session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_finalize: jobfam session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_finalize: jobfam session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_finalize: top session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_cleanup: job session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_cleanup: top session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] [[37833,0],0] Releasing job data for [37833,0]
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_cleanup: job session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_cleanup: top session dir does not exist
exiting with status -51

I add --debug-devel to mpirun. no more usefull infomation shows when it failed.

jsquyres · 2019-05-15T20:52:03Z

If Open MPI hasn't even finished launching yet and something goes wrong, that's a fatal error -- we don't have any other option except to abort.

Your job is relatively small -- 8 nodes. You shouldn't be bumping up against any TCP sockets or file descriptor limits.

I'm assuming you're launching in Kubernetes -- is each one of these /opt/kube/kubectl launching into a unique container? If so, are these containers fully distinct from each other -- e.g., different IP address, different PID space, ...?

zrss · 2019-05-27T01:57:38Z

@tingweiwu @jsquyres , hi, as to my problem, mpirun hostname failed with np 256, i found out a workaround, i.e. run mpirun with --mca plm_rsh_no_tree_spawn true or specific the routed module (radix, binomial, debruijn or direct)

#6691

jjhursey · 2019-08-28T21:03:26Z

I'm going to pick up this issue (instead of opening a new one) as I think it is the same issue I've been chasing today.

I'm using a recent HEAD of v4.0.x (390e0bc) on a bare metal system of 4 nodes if I force the nodemap to be communicated instead of placed on the command line.

shell$ mpirun  -npernode 2 --host c712f6n01:2,c712f6n02:2,c712f6n03:2,c712f6n05:2 \
   --mca plm_base_node_regex_threshold 1  hostname
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[15408,0],0] on node c712f6n01
  Remote daemon: [[15408,0],3] on node c712f6n05

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

The problem goes away if I do either:
A. --mca routed binomial or --mca routed radix but not --mca routed radix,binomial. So there is a problem with these two components interacting.
B. -mca plm_rsh_no_tree_spawn true which indicates a problem with how the daemons are launched vs routed.

There are a few problems here to work through on the v4.0.x branch (note that this works correctly on master for reasons I'll try to highlight below).

If you have odd hostnames (ones that involve hashes) you might be hitting a problem with the regx framework in the v4.0.x series. See PR v4.0.x: regx/naive: add regx/naive component #6915 for a backup solution.
- On master we moved to zlib compression instead of a custom algorithm. So this problem should not occur on master.
If you have long hostnames you probably exceeded the default plm_base_node_regex_threshold (1024 characters) so have fallen into this direct communicated mode (instead of having the node regex placed on the command line). In my example above, I am forcing that value to 1 to activate this code path for easier debugging.
- On master we always communicate the list. So this MCA parameter doesn't exist there since there is no longer a threshold.
In the v4.0.x branch the routed framework is multi-select. So both radix and binomial are active at the same time.
- In the HNP the plm/rsh tree launch mechanism uses the routed component to structure the tree launch. It uses the highest priority component here which is radix. For the 4 node example, the tree is flat so all remote daemons are parented by the root (0).
- The remote orted will update their routing tree here (which is how this is related to the communication of the node regex). In doing so the radix component will first set the ORTE_PROC_MY_PARENT->vpid = 0 - which is the correct value. Since the routed framework is multi-select then the binomial component is called next setting ORTE_PROC_MY_PARENT->vpid = 1 in the vpid=3 child (since in a binomial tree 3 is the child of 1 not 0). When this child tries to send to the root in report_orted it is using the wrong vpid for its parent and does not have the contact information for orted vpid=1 (it was launched with contact information for the HNP at index 0).
- PR Fix tree spawn at scale #6714 fixed part of the problem by removing the debrujin component, but it sounds like the interaction between these two components is another dimension of the same symptom.
- On master the routed component became single select in this change. So there is no chance for a lower priority routed component (binomial = 30) to stomp over the value set by a higher priority component (radix = 70).

Now we have a root cause:

The discontinunity between the routed component used for tree launch and the multi-select of the routed component overwriting the parent vpid on the remote orted.

Repair options

Make the routed components not adjust global state
- Not really viable since breaking this assumption probably touches a bunch of code, and would be a different direction than what master took.
Make the routed framework single select.
- This more matches master
- However, this is likely too big of a change for a release branch and might break other things.
When using tree spawn force the routed component used on the remote orted's to match that used by the HNP.
- I have coded this up and can confirm that it fixes the problem. I can PR tomorrow (it's 4 lines of code in plm/rsh).

@rhc54 You know this code probably the best. Do you foresee any issue with going with option (3)?

rhc54 · 2019-08-28T22:07:22Z

Either two or three is fine - no need for multi-select anyway

* Fix open-mpi#6618 - See comments on Issue open-mpi#6618 for finer details. * The `plm/rsh` component uses the highest priority `routed` component to construct the launch tree. The remote orted's will activate all available `routed` components when updating routes. This allows the opportunity for the parent vpid on the remote `orted` to not match that which was expected in the tree launch. The result is that the remote orted tries to contact their parent with the wrong contact information and orted wireup will fail. * This fix forces the orteds to use the same `routed` component as the HNP used when contructing the tree, if tree launch is enabled. Signed-off-by: Joshua Hursey <[email protected]>

jjhursey · 2019-08-29T20:40:40Z

@zrss I believe that PR #6944 (maybe with PR #6915 for your environment) should fix this issue. If you have an opportunity can you try that patch to see if it resolves the issue that you were seeing with tree spawn?

jjhursey · 2019-08-30T01:43:52Z

Note: I checked the v3.0.x and v3.1.x branches and they do not show this specific problem. It must be unique to the v4.0.x branch.

rhc54 · 2019-08-30T01:55:58Z

Yes, it is - the reason is that the v4.0 branch came during a point in time where we were trying to use libfabric for ORTE collectives during launch. This meant that the mgmt Ethernet OOB connections required a different routing pattern from the libfabric ones - and hence we made the routed framework multi-select. We decided not to pursue that path after the v4.0 branch and offered to remove that code from the release branch, but it was deemed too big a change.

If it were me, I'd just make routed single-select and completely resolve the problem. Nothing will break because you cannot use the RML/OFI component unless you explicitly request it. However, there is nothing wrong with this approach as it effectively makes the routed framework single-select on the remote orteds by setting the MCA param to a single component.

gpaulsen · 2019-09-09T18:16:31Z

@tingweiwu This fix was just merged to v4.0.x for eventual inclusion in v4.0.2. Could you please verify that this is fixed on v4.0.x branch?

gpaulsen · 2020-02-16T14:39:54Z

mwheinz verified this appears to be fixed in master and v4.0.x ( #7087 ).
Since this issue specifically refers to v4.0.x, I'm closing this as fixed.

* Fix open-mpi#6618 - See comments on Issue open-mpi#6618 for finer details. * The `plm/rsh` component uses the highest priority `routed` component to construct the launch tree. The remote orted's will activate all available `routed` components when updating routes. This allows the opportunity for the parent vpid on the remote `orted` to not match that which was expected in the tree launch. The result is that the remote orted tries to contact their parent with the wrong contact information and orted wireup will fail. * This fix forces the orteds to use the same `routed` component as the HNP used when contructing the tree, if tree launch is enabled. Signed-off-by: Joshua Hursey <[email protected]> (cherry picked from commit ca0f4d4d32bff55e04841dea5055147661866b83)

jsquyres added the question label Apr 26, 2019

tingweiwu mentioned this issue May 6, 2019

mpi_abort_delay doesn't work #6639

Closed

jjhursey self-assigned this Aug 28, 2019

jjhursey mentioned this issue Aug 29, 2019

Fix tree spawn routed component issue #6944

Merged

This was referenced Aug 29, 2019

SSH launch fails when host file has more than 64 hosts #6198

Open

OMPI 4.0.1 TCP connection errors beyond 86 nodes #6786

Closed

jsquyres added the Target: v4.0.x label Aug 30, 2019

gpaulsen removed the question label Sep 2, 2019

mwheinz mentioned this issue Oct 11, 2019

SSH launch silently hangs with certain numbers of hosts in machine file #7087

Closed

jsquyres added the State-Awaiting user information label Oct 12, 2019

gpaulsen closed this as completed Feb 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORTE has lost communication with a remote daemon. #6618

ORTE has lost communication with a remote daemon. #6618

tingweiwu commented Apr 26, 2019 •

edited by jsquyres

Loading

tingweiwu commented Apr 28, 2019

zrss commented Apr 29, 2019

tingweiwu commented Apr 29, 2019

tingweiwu commented May 13, 2019

jsquyres commented May 15, 2019

zrss commented May 27, 2019

jjhursey commented Aug 28, 2019

rhc54 commented Aug 28, 2019

jjhursey commented Aug 29, 2019 •

edited

Loading

jjhursey commented Aug 30, 2019

rhc54 commented Aug 30, 2019 •

edited

Loading

gpaulsen commented Sep 9, 2019

gpaulsen commented Feb 16, 2020

ORTE has lost communication with a remote daemon. #6618

ORTE has lost communication with a remote daemon. #6618

Comments

tingweiwu commented Apr 26, 2019 • edited by jsquyres Loading

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

tingweiwu commented Apr 28, 2019

zrss commented Apr 29, 2019

tingweiwu commented Apr 29, 2019

tingweiwu commented May 13, 2019

jsquyres commented May 15, 2019

zrss commented May 27, 2019

jjhursey commented Aug 28, 2019

rhc54 commented Aug 28, 2019

jjhursey commented Aug 29, 2019 • edited Loading

jjhursey commented Aug 30, 2019

rhc54 commented Aug 30, 2019 • edited Loading

gpaulsen commented Sep 9, 2019

gpaulsen commented Feb 16, 2020

tingweiwu commented Apr 26, 2019 •

edited by jsquyres

Loading

jjhursey commented Aug 29, 2019 •

edited

Loading

rhc54 commented Aug 30, 2019 •

edited

Loading