Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORTE has lost communication with a remote daemon. #6618

Closed
tingweiwu opened this issue Apr 26, 2019 · 13 comments
Closed

ORTE has lost communication with a remote daemon. #6618

tingweiwu opened this issue Apr 26, 2019 · 13 comments

Comments

@tingweiwu
Copy link

tingweiwu commented Apr 26, 2019

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

 mpirun --version
mpirun.real (OpenRTE) 3.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Install Open MPI

 mkdir /tmp/openmpi && \
    cd /tmp/openmpi && \
    wget https://www.open-mpi.org/software/ompi/v3.1/downloads/openmpi-3.1.2.tar.gz && \
    tar zxf openmpi-3.1.2.tar.gz && \
    cd openmpi-3.1.2 && \
    ./configure --enable-orterun-prefix-by-default && \
    make -j $(nproc) all && \
    make install && \
    ldconfig && \
    rm -rf /tmp/openmpi

Please describe the system on which you are running

  • Operating system/version:
    ubuntu16.04
  • Computer hardware:
    V100GPU+InfiniBand
  • Network type:
    docker cni networker

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

I got this error frequently, not everytime. but it occures both when the process starting or running.

I have check the network bewteen i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd and i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0 is ok, and OOM haven't occured.

do you have any suggestion to find the reason?

+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 1 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 3 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 2 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 4 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 5 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 6 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 7 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7
+ shift
+ /opt/kube/kubectl exec i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "745603072" -mca ess_base_vpid 8 -mca ess_base_num_procs "9" -mca orte_node_regex "i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-1,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-2,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-3,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-4,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-5,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-6,i[5:39030]a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "745603072.0;tcp://192.168.237.70:59121" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
command terminated with exit code 137
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[11377,0],0] on node i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd
Remote daemon: [[11377,0],1] on node i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[11377,0],0] on node i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-launcher-ksczd
Remote daemon: [[11377,0],1] on node i39030a4a38d4a3abcb17e488a3141ef-mpijob-0-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
@tingweiwu
Copy link
Author

how could I get more infomation about this error?

@zrss
Copy link

zrss commented Apr 29, 2019

i met a similar case, but i found my problem is related to the number of np, as i run

mpirun -np 192 hostname

it works fine, but once i increased the np number to 200, it comes out the problem

ORTE has lost communication with a remote daemon.

  HNP daemon   : [[2448,0],0] on node job1b-train-pub-v100nv2-ueaxr
  Remote daemon: [[2448,0],6] on node jobbf317b09-worker-13-6pg-862-gdwfm

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

and i can confirm these nodes can connect each other as i switch the hostfile to select nodes that mpirun run

the mpi hostfile sample

jobbf317b09-worker-31-q8d-188-d5d9p slots=8
jobbf317b09-worker-4-ob8-102-9srjx slots=8
jobbf317b09-worker-5-c9p-866-nc6pb slots=8
jobbf317b09-worker-6-aqj-472-mdglq slots=8
jobbf317b09-worker-7-95f-872-667dz slots=8
jobbf317b09-worker-8-vci-856-7lsgq slots=8
jobbf317b09-worker-9-0hj-7-m9vgz slots=8

... hope that someone know how to debug it 👽

@tingweiwu
Copy link
Author

It's possible to configure the openmpi to not terminate the job if ORTE has lost communication with a remote daemon?

@tingweiwu
Copy link
Author

io time cost (batch 2640):  252.29486799240112
epoch 4/100, batch: 2620/3461, lr: 0.00048600003, loss: 7.65098391976442, duration per batch: 25062.119s, memory: 6.0GB
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[37833,0],0] on node orte-mpijob-0-launcher-6cskl
  Remote daemon: [[37833,0],2] on node orte-mpijob-0-worker-1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[orte-mpijob-0-launcher-6cskl:00009] Job UNKNOWN has launched
[orte-mpijob-0-launcher-6cskl:00009] [[37833,0],0] Releasing job data for [37833,1]
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_finalize: proc session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_finalize: job session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_finalize: jobfam session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_finalize: jobfam session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_finalize: top session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_cleanup: job session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_cleanup: top session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] [[37833,0],0] Releasing job data for [37833,0]
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_cleanup: job session dir does not exist
[orte-mpijob-0-launcher-6cskl:00009] sess_dir_cleanup: top session dir does not exist
exiting with status -51

I add --debug-devel to mpirun. no more usefull infomation shows when it failed.

@jsquyres
Copy link
Member

If Open MPI hasn't even finished launching yet and something goes wrong, that's a fatal error -- we don't have any other option except to abort.

Your job is relatively small -- 8 nodes. You shouldn't be bumping up against any TCP sockets or file descriptor limits.

I'm assuming you're launching in Kubernetes -- is each one of these /opt/kube/kubectl launching into a unique container? If so, are these containers fully distinct from each other -- e.g., different IP address, different PID space, ...?

@zrss
Copy link

zrss commented May 27, 2019

@tingweiwu @jsquyres , hi, as to my problem, mpirun hostname failed with np 256, i found out a workaround, i.e. run mpirun with --mca plm_rsh_no_tree_spawn true or specific the routed module (radix, binomial, debruijn or direct)

#6691

@jjhursey jjhursey self-assigned this Aug 28, 2019
@jjhursey
Copy link
Member

I'm going to pick up this issue (instead of opening a new one) as I think it is the same issue I've been chasing today.

I'm using a recent HEAD of v4.0.x (390e0bc) on a bare metal system of 4 nodes if I force the nodemap to be communicated instead of placed on the command line.

shell$ mpirun  -npernode 2 --host c712f6n01:2,c712f6n02:2,c712f6n03:2,c712f6n05:2 \
   --mca plm_base_node_regex_threshold 1  hostname
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[15408,0],0] on node c712f6n01
  Remote daemon: [[15408,0],3] on node c712f6n05

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

The problem goes away if I do either:
A. --mca routed binomial or --mca routed radix but not --mca routed radix,binomial. So there is a problem with these two components interacting.
B. -mca plm_rsh_no_tree_spawn true which indicates a problem with how the daemons are launched vs routed.

There are a few problems here to work through on the v4.0.x branch (note that this works correctly on master for reasons I'll try to highlight below).

  1. If you have odd hostnames (ones that involve hashes) you might be hitting a problem with the regx framework in the v4.0.x series. See PR v4.0.x: regx/naive: add regx/naive component #6915 for a backup solution.
    • On master we moved to zlib compression instead of a custom algorithm. So this problem should not occur on master.
  2. If you have long hostnames you probably exceeded the default plm_base_node_regex_threshold (1024 characters) so have fallen into this direct communicated mode (instead of having the node regex placed on the command line). In my example above, I am forcing that value to 1 to activate this code path for easier debugging.
    • On master we always communicate the list. So this MCA parameter doesn't exist there since there is no longer a threshold.
  3. In the v4.0.x branch the routed framework is multi-select. So both radix and binomial are active at the same time.
    • In the HNP the plm/rsh tree launch mechanism uses the routed component to structure the tree launch. It uses the highest priority component here which is radix. For the 4 node example, the tree is flat so all remote daemons are parented by the root (0).
    • The remote orted will update their routing tree here (which is how this is related to the communication of the node regex). In doing so the radix component will first set the ORTE_PROC_MY_PARENT->vpid = 0 - which is the correct value. Since the routed framework is multi-select then the binomial component is called next setting ORTE_PROC_MY_PARENT->vpid = 1 in the vpid=3 child (since in a binomial tree 3 is the child of 1 not 0). When this child tries to send to the root in report_orted it is using the wrong vpid for its parent and does not have the contact information for orted vpid=1 (it was launched with contact information for the HNP at index 0).
    • PR Fix tree spawn at scale #6714 fixed part of the problem by removing the debrujin component, but it sounds like the interaction between these two components is another dimension of the same symptom.
    • On master the routed component became single select in this change. So there is no chance for a lower priority routed component (binomial = 30) to stomp over the value set by a higher priority component (radix = 70).

Now we have a root cause:

  • The discontinunity between the routed component used for tree launch and the multi-select of the routed component overwriting the parent vpid on the remote orted.

Repair options

  1. Make the routed components not adjust global state
    • Not really viable since breaking this assumption probably touches a bunch of code, and would be a different direction than what master took.
  2. Make the routed framework single select.
    • This more matches master
    • However, this is likely too big of a change for a release branch and might break other things.
  3. When using tree spawn force the routed component used on the remote orted's to match that used by the HNP.
    • I have coded this up and can confirm that it fixes the problem. I can PR tomorrow (it's 4 lines of code in plm/rsh).

@rhc54 You know this code probably the best. Do you foresee any issue with going with option (3)?

@rhc54
Copy link
Contributor

rhc54 commented Aug 28, 2019

Either two or three is fine - no need for multi-select anyway

jjhursey added a commit to jjhursey/ompi that referenced this issue Aug 29, 2019
 * Fix open-mpi#6618
   - See comments on Issue open-mpi#6618 for finer details.
 * The `plm/rsh` component uses the highest priority `routed` component
   to construct the launch tree. The remote orted's will activate all
   available `routed` components when updating routes. This allows the
   opportunity for the parent vpid on the remote `orted` to not match
   that which was expected in the tree launch. The result is that the
   remote orted tries to contact their parent with the wrong contact
   information and orted wireup will fail.
 * This fix forces the orteds to use the same `routed` component as
   the HNP used when contructing the tree, if tree launch is enabled.

Signed-off-by: Joshua Hursey <[email protected]>
@jjhursey
Copy link
Member

jjhursey commented Aug 29, 2019

@zrss I believe that PR #6944 (maybe with PR #6915 for your environment) should fix this issue. If you have an opportunity can you try that patch to see if it resolves the issue that you were seeing with tree spawn?

@jjhursey
Copy link
Member

Note: I checked the v3.0.x and v3.1.x branches and they do not show this specific problem. It must be unique to the v4.0.x branch.

@rhc54
Copy link
Contributor

rhc54 commented Aug 30, 2019

Yes, it is - the reason is that the v4.0 branch came during a point in time where we were trying to use libfabric for ORTE collectives during launch. This meant that the mgmt Ethernet OOB connections required a different routing pattern from the libfabric ones - and hence we made the routed framework multi-select. We decided not to pursue that path after the v4.0 branch and offered to remove that code from the release branch, but it was deemed too big a change.

If it were me, I'd just make routed single-select and completely resolve the problem. Nothing will break because you cannot use the RML/OFI component unless you explicitly request it. However, there is nothing wrong with this approach as it effectively makes the routed framework single-select on the remote orteds by setting the MCA param to a single component.

@gpaulsen
Copy link
Member

gpaulsen commented Sep 9, 2019

@tingweiwu This fix was just merged to v4.0.x for eventual inclusion in v4.0.2. Could you please verify that this is fixed on v4.0.x branch?

@gpaulsen
Copy link
Member

mwheinz verified this appears to be fixed in master and v4.0.x ( #7087 ).
Since this issue specifically refers to v4.0.x, I'm closing this as fixed.

markalle pushed a commit to markalle/ompi that referenced this issue Sep 12, 2020
 * Fix open-mpi#6618
   - See comments on Issue open-mpi#6618 for finer details.
 * The `plm/rsh` component uses the highest priority `routed` component
   to construct the launch tree. The remote orted's will activate all
   available `routed` components when updating routes. This allows the
   opportunity for the parent vpid on the remote `orted` to not match
   that which was expected in the tree launch. The result is that the
   remote orted tries to contact their parent with the wrong contact
   information and orted wireup will fail.
 * This fix forces the orteds to use the same `routed` component as
   the HNP used when contructing the tree, if tree launch is enabled.

Signed-off-by: Joshua Hursey <[email protected]>
(cherry picked from commit ca0f4d4d32bff55e04841dea5055147661866b83)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants