Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountering issues while using the UCX plugin #1261

Open
clearsky07 opened this issue Jul 17, 2024 · 0 comments
Open

Encountering issues while using the UCX plugin #1261

clearsky07 opened this issue Jul 17, 2024 · 0 comments
Assignees

Comments

@clearsky07
Copy link

When using rccl rdma sharp plugin, I encountered a program crash with the following log:

`[root@node01 ~]# mpirun \
>     -np 2\
>     --oversubscribe \
>     --allow-run-as-root\
>     -H node01,node02 \
>     -x NCCL_DEBUG=INFO \
>     -x UCX_PROTO_ENABLE=n\
>     -x NCCL_P2P_LEVEL=5 \
>     -x NCCL_NET_GDR_LEVEL=5 \
>     -x HSA_FORCE_FINE_GRAIN_PCIE=1\
>     -x NCCL_PLUGIN_P2P=UCX\
>     -x LD_LIBRARY_PATH=/root/hsm/rccl-rdma-sharp-plugins-master/install/lib:$LD_LIBRARY_PATH\
>     /root/hsm/rccl-tests-develop/build/reduce_perf -g 1 -n 20 -b 1024 -e 512M -f 2
# nThreads: 1 nGpus: 1 nRanks: 1 minBytes: 1024 maxBytes: 536870912 step: 2(factor) warmupIters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
node01:5646:5646 [0] NCCL INFO Bootstrap : Using ens52np0:192.168.2.11<0>
node01:5646:5646 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
node01:5646:5646 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
node01:5646:5646 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
node01:5646:5646 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).

node01:5646:5646 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:115 NCCL WARN NUMA auto balancing enabled which can lead to variability in the RCCL performance! Disable by "sudo sysctl kernel.numa_balancing=0"
node01:5646:5646 [0] NCCL INFO Kernel version: 4.18.0-305.3.1.el8.x86_64

node01:5646:5646 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:136 NCCL WARN Missing "iommu=pt" from kernel command line which can lead to system instablity or hang!
node01:5646:5646 [0] NCCL INFO ROCr version 1.1
node01:5646:5646 [0] NCCL INFO Dmabuf feature disabled without NCCL_ENABLE_DMABUF_SUPPORT=1
RCCL version 2.18.3+hip5.7 HEAD:b502725
node01:5646:5657 [0] NCCL INFO Plugin Path : /root/hsm/rccl-rdma-sharp-plugins-master/install/lib/librccl-net.so
node01:5646:5657 [0] NCCL INFO P2P plugin UCX
node01:5646:5657 [0] NCCL INFO NET/IB : Using [0]bnxt_re0:1/RoCE ; OOB ens52np0:192.168.2.11<0>
node01:5646:5657 [0] NCCL INFO Using network UCX
node02:10366:10366 [0] NCCL INFO ROCr version 1.1
node02:10366:10366 [0] NCCL INFO Dmabuf feature disabled without NCCL_ENABLE_DMABUF_SUPPORT=1
node02:10366:10366 [0] NCCL INFO Bootstrap : Using ens7np0:192.168.2.12<0>
node02:10366:10366 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
node02:10366:10366 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
node02:10366:10366 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
node02:10366:10366 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
node02:10366:10366 [0] NCCL INFO Kernel version: 4.18.0-305.3.1.el8.x86_64

node02:10366:10366 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:136 NCCL WARN Missing "iommu=pt" from kernel command line which can lead to system instablity or hang!
node02:10366:10376 [0] NCCL INFO Plugin Path : /root/hsm/rccl-rdma-sharp-plugins-master/install/lib/librccl-net.so
node02:10366:10376 [0] NCCL INFO P2P plugin UCX
node02:10366:10376 [0] NCCL INFO NET/IB : Using [0]bnxt_re0:1/RoCE ; OOB ens7np0:192.168.2.12<0>
node02:10366:10376 [0] NCCL INFO Using network UCX
node01:5646:5657 [0] NCCL INFO comm 0x22fcfb0 rank 0 nranks 2 cudaDev 0 busId b7000 commId 0xf503b714f0d84435 - Init START
node02:10366:10376 [0] NCCL INFO comm 0x2197130 rank 1 nranks 2 cudaDev 0 busId 7000 commId 0xf503b714f0d84435 - Init START
node02:10366:10376 [0] NCCL INFO rocm_smi_lib: version 2.8.0.0
node01:5646:5657 [0] NCCL INFO rocm_smi_lib: version 2.8.0.0
node02:10366:10376 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to SYS
node02:10366:10376 [0] NCCL INFO PXN Disabled as plugin is v4
node02:10366:10376 [0] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS
node01:5646:5657 [0] NCCL INFO NCCL_TOPO_DUMP_FILE set by environment to /root/hsm/topo/topo2.xml
node02:10366:10376 [0] NCCL INFO Setting affinity for GPU 0 to 01,00000000,00000001
node01:5646:5657 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to SYS
node01:5646:5657 [0] NCCL INFO PXN Disabled as plugin is v4
node01:5646:5657 [0] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS
node01:5646:5657 [0] NCCL INFO Channel 00/04 :    0   1
node01:5646:5657 [0] NCCL INFO Channel 01/04 :    0   1
node01:5646:5657 [0] NCCL INFO Channel 02/04 :    0   1
node01:5646:5657 [0] NCCL INFO Channel 03/04 :    0   1
node01:5646:5657 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1 comm 0x22fcfb0 nRanks 02 busId b7000
node01:5646:5657 [0] NCCL INFO P2P Chunksize set to 131072
node02:10366:10376 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1 comm 0x2197130 nRanks 02 busId 7000
node02:10366:10376 [0] NCCL INFO P2P Chunksize set to 131072
node01:5646:5657 [0] NCCL INFO Channel 00/0 : 1[7000] -> 0[b7000] [receive] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 01/0 : 1[7000] -> 0[b7000] [receive] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 02/0 : 1[7000] -> 0[b7000] [receive] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 03/0 : 1[7000] -> 0[b7000] [receive] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 00/0 : 0[b7000] -> 1[7000] [send] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 01/0 : 0[b7000] -> 1[7000] [send] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 02/0 : 0[b7000] -> 1[7000] [send] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 03/0 : 0[b7000] -> 1[7000] [send] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 00/0 : 0[b7000] -> 1[7000] [receive] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 01/0 : 0[b7000] -> 1[7000] [receive] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 02/0 : 0[b7000] -> 1[7000] [receive] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 03/0 : 0[b7000] -> 1[7000] [receive] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 00/0 : 1[7000] -> 0[b7000] [send] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 01/0 : 1[7000] -> 0[b7000] [send] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 02/0 : 1[7000] -> 0[b7000] [send] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 03/0 : 1[7000] -> 0[b7000] [send] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
[1721186746.658016] [node02:10366:1]          rcache.c:985  UCX  ERROR failed to insert region 0x1479900ba210 [0x0..0x0]: Invalid parameter

node02:10366:10377 [0] ucx_plugin.c:498 NCCL WARN Failed: UCX error ucx_plugin.c:498 '-5' Invalid parameter

node02:10366:10377 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/transport/net.cc:858 -> 3
node02:10366:10377 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1311 -> 3
node02:10366:10377 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1382 -> 3

node02:10366:10377 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1524 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3

node02:10366:10376 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/misc/socket.cc:57 NCCL WARN socketProgress: Connection closed by remote peer node02<35187>
node02:10366:10376 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/misc/socket.cc:791 -> 6

node02:10366:10376 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1148 NCCL WARN Socket recv failed while polling for opId=0x147999b41d80
node02:10366:10376 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/transport/net.cc:311 -> 3
node02:10366:10376 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/transport.cc:164 -> 3
node02:10366:10376 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:1448 -> 3
node02:10366:10376 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:1758 -> 3
node02:10366:10376 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/group.cc:69 -> 3 [Async thread]
node02:10366:10366 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/group.cc:431 -> 3
node02:10366:10366 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/group.cc:116 -> 3
node02: Test NCCL failure common.cu:1158 'internal error - please report this issue to the NCCL developers / '
 .. node02 pid 10366: Test failure common.cu:1000
[1721186744.314330] [node01:5646 :0]          rcache.c:985  UCX  ERROR failed to insert region 0x15045c0b4ad0 [0x0..0x0]: Invalid parameter

node01:5646:5659 [0] ucx_plugin.c:498 NCCL WARN Failed: UCX error ucx_plugin.c:498 '-5' Invalid parameter

node01:5646:5659 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/transport/net.cc:858 -> 3
node01:5646:5659 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1311 -> 3
node01:5646:5659 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1382 -> 3

node01:5646:5659 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1524 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

node01:5646:5657 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/misc/socket.cc:57 NCCL WARN socketProgress: Connection closed by remote peer node01<51339>
node01:5646:5657 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/misc/socket.cc:791 -> 6

node01:5646:5657 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1148 NCCL WARN Socket recv failed while polling for opId=0x1503f5b41a28
node01:5646:5657 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/transport/net.cc:385 -> 3
node01:5646:5657 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/transport.cc:184 -> 3
node01:5646:5657 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:1448 -> 3
node01:5646:5657 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:1758 -> 3
node01:5646:5657 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/group.cc:69 -> 3 [Async thread]
node01:5646:5646 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/group.cc:431 -> 3
node01:5646:5646 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/group.cc:116 -> 3
node01: Test NCCL failure common.cu:1158 'internal error - please report this issue to the NCCL developers / '
 .. node01 pid 5646: Test failure common.cu:1000
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[2403,1],1]
  Exit code:    3
`

It seems that there is an issue with enabling GDR when using the UCX plugin with the parameter HSA_FORCE. FINE-GRAIN-PCIE=1. However, when HSA_FORCE. FINE-GRAIN-PCIE=0, it can run, but performance may decrease due to the inability to use GDR. Without using the UCX plugin (calling IB), there is no such issue. May I ask if there is a better solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants