You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using rccl rdma sharp plugin, I encountered a program crash with the following log:
`[root@node01 ~]# mpirun \
> -np 2\
> --oversubscribe \
> --allow-run-as-root\
> -H node01,node02 \
> -x NCCL_DEBUG=INFO \
> -x UCX_PROTO_ENABLE=n\
> -x NCCL_P2P_LEVEL=5 \
> -x NCCL_NET_GDR_LEVEL=5 \
> -x HSA_FORCE_FINE_GRAIN_PCIE=1\
> -x NCCL_PLUGIN_P2P=UCX\
> -x LD_LIBRARY_PATH=/root/hsm/rccl-rdma-sharp-plugins-master/install/lib:$LD_LIBRARY_PATH\
> /root/hsm/rccl-tests-develop/build/reduce_perf -g 1 -n 20 -b 1024 -e 512M -f 2
# nThreads: 1 nGpus: 1 nRanks: 1 minBytes: 1024 maxBytes: 536870912 step: 2(factor) warmupIters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
node01:5646:5646 [0] NCCL INFO Bootstrap : Using ens52np0:192.168.2.11<0>
node01:5646:5646 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
node01:5646:5646 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
node01:5646:5646 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
node01:5646:5646 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
node01:5646:5646 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:115 NCCL WARN NUMA auto balancing enabled which can lead to variability in the RCCL performance! Disable by "sudo sysctl kernel.numa_balancing=0"
node01:5646:5646 [0] NCCL INFO Kernel version: 4.18.0-305.3.1.el8.x86_64
node01:5646:5646 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:136 NCCL WARN Missing "iommu=pt" from kernel command line which can lead to system instablity or hang!
node01:5646:5646 [0] NCCL INFO ROCr version 1.1
node01:5646:5646 [0] NCCL INFO Dmabuf feature disabled without NCCL_ENABLE_DMABUF_SUPPORT=1
RCCL version 2.18.3+hip5.7 HEAD:b502725
node01:5646:5657 [0] NCCL INFO Plugin Path : /root/hsm/rccl-rdma-sharp-plugins-master/install/lib/librccl-net.so
node01:5646:5657 [0] NCCL INFO P2P plugin UCX
node01:5646:5657 [0] NCCL INFO NET/IB : Using [0]bnxt_re0:1/RoCE ; OOB ens52np0:192.168.2.11<0>
node01:5646:5657 [0] NCCL INFO Using network UCX
node02:10366:10366 [0] NCCL INFO ROCr version 1.1
node02:10366:10366 [0] NCCL INFO Dmabuf feature disabled without NCCL_ENABLE_DMABUF_SUPPORT=1
node02:10366:10366 [0] NCCL INFO Bootstrap : Using ens7np0:192.168.2.12<0>
node02:10366:10366 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
node02:10366:10366 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
node02:10366:10366 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
node02:10366:10366 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
node02:10366:10366 [0] NCCL INFO Kernel version: 4.18.0-305.3.1.el8.x86_64
node02:10366:10366 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:136 NCCL WARN Missing "iommu=pt" from kernel command line which can lead to system instablity or hang!
node02:10366:10376 [0] NCCL INFO Plugin Path : /root/hsm/rccl-rdma-sharp-plugins-master/install/lib/librccl-net.so
node02:10366:10376 [0] NCCL INFO P2P plugin UCX
node02:10366:10376 [0] NCCL INFO NET/IB : Using [0]bnxt_re0:1/RoCE ; OOB ens7np0:192.168.2.12<0>
node02:10366:10376 [0] NCCL INFO Using network UCX
node01:5646:5657 [0] NCCL INFO comm 0x22fcfb0 rank 0 nranks 2 cudaDev 0 busId b7000 commId 0xf503b714f0d84435 - Init START
node02:10366:10376 [0] NCCL INFO comm 0x2197130 rank 1 nranks 2 cudaDev 0 busId 7000 commId 0xf503b714f0d84435 - Init START
node02:10366:10376 [0] NCCL INFO rocm_smi_lib: version 2.8.0.0
node01:5646:5657 [0] NCCL INFO rocm_smi_lib: version 2.8.0.0
node02:10366:10376 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to SYS
node02:10366:10376 [0] NCCL INFO PXN Disabled as plugin is v4
node02:10366:10376 [0] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS
node01:5646:5657 [0] NCCL INFO NCCL_TOPO_DUMP_FILE set by environment to /root/hsm/topo/topo2.xml
node02:10366:10376 [0] NCCL INFO Setting affinity for GPU 0 to 01,00000000,00000001
node01:5646:5657 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to SYS
node01:5646:5657 [0] NCCL INFO PXN Disabled as plugin is v4
node01:5646:5657 [0] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS
node01:5646:5657 [0] NCCL INFO Channel 00/04 : 0 1
node01:5646:5657 [0] NCCL INFO Channel 01/04 : 0 1
node01:5646:5657 [0] NCCL INFO Channel 02/04 : 0 1
node01:5646:5657 [0] NCCL INFO Channel 03/04 : 0 1
node01:5646:5657 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1 comm 0x22fcfb0 nRanks 02 busId b7000
node01:5646:5657 [0] NCCL INFO P2P Chunksize set to 131072
node02:10366:10376 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1 comm 0x2197130 nRanks 02 busId 7000
node02:10366:10376 [0] NCCL INFO P2P Chunksize set to 131072
node01:5646:5657 [0] NCCL INFO Channel 00/0 : 1[7000] -> 0[b7000] [receive] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 01/0 : 1[7000] -> 0[b7000] [receive] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 02/0 : 1[7000] -> 0[b7000] [receive] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 03/0 : 1[7000] -> 0[b7000] [receive] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 00/0 : 0[b7000] -> 1[7000] [send] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 01/0 : 0[b7000] -> 1[7000] [send] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 02/0 : 0[b7000] -> 1[7000] [send] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node01:5646:5657 [0] NCCL INFO Channel 03/0 : 0[b7000] -> 1[7000] [send] via NET/UCX/0/GDRDMA comm 0x22fcfb0 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 00/0 : 0[b7000] -> 1[7000] [receive] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 01/0 : 0[b7000] -> 1[7000] [receive] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 02/0 : 0[b7000] -> 1[7000] [receive] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 03/0 : 0[b7000] -> 1[7000] [receive] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 00/0 : 1[7000] -> 0[b7000] [send] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 01/0 : 1[7000] -> 0[b7000] [send] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 02/0 : 1[7000] -> 0[b7000] [send] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
node02:10366:10376 [0] NCCL INFO Channel 03/0 : 1[7000] -> 0[b7000] [send] via NET/UCX/0/GDRDMA comm 0x2197130 nRanks 02
[1721186746.658016] [node02:10366:1] rcache.c:985 UCX ERROR failed to insert region 0x1479900ba210 [0x0..0x0]: Invalid parameter
node02:10366:10377 [0] ucx_plugin.c:498 NCCL WARN Failed: UCX error ucx_plugin.c:498 '-5' Invalid parameter
node02:10366:10377 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/transport/net.cc:858 -> 3
node02:10366:10377 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1311 -> 3
node02:10366:10377 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1382 -> 3
node02:10366:10377 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1524 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3
node02:10366:10376 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/misc/socket.cc:57 NCCL WARN socketProgress: Connection closed by remote peer node02<35187>
node02:10366:10376 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/misc/socket.cc:791 -> 6
node02:10366:10376 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1148 NCCL WARN Socket recv failed while polling for opId=0x147999b41d80
node02:10366:10376 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/transport/net.cc:311 -> 3
node02:10366:10376 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/transport.cc:164 -> 3
node02:10366:10376 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:1448 -> 3
node02:10366:10376 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:1758 -> 3
node02:10366:10376 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/group.cc:69 -> 3 [Async thread]
node02:10366:10366 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/group.cc:431 -> 3
node02:10366:10366 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/group.cc:116 -> 3
node02: Test NCCL failure common.cu:1158 'internal error - please report this issue to the NCCL developers / '
.. node02 pid 10366: Test failure common.cu:1000
[1721186744.314330] [node01:5646 :0] rcache.c:985 UCX ERROR failed to insert region 0x15045c0b4ad0 [0x0..0x0]: Invalid parameter
node01:5646:5659 [0] ucx_plugin.c:498 NCCL WARN Failed: UCX error ucx_plugin.c:498 '-5' Invalid parameter
node01:5646:5659 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/transport/net.cc:858 -> 3
node01:5646:5659 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1311 -> 3
node01:5646:5659 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1382 -> 3
node01:5646:5659 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1524 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
node01:5646:5657 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/misc/socket.cc:57 NCCL WARN socketProgress: Connection closed by remote peer node01<51339>
node01:5646:5657 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/misc/socket.cc:791 -> 6
node01:5646:5657 [0] /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/proxy.cc:1148 NCCL WARN Socket recv failed while polling for opId=0x1503f5b41a28
node01:5646:5657 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/transport/net.cc:385 -> 3
node01:5646:5657 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/transport.cc:184 -> 3
node01:5646:5657 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:1448 -> 3
node01:5646:5657 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/init.cc:1758 -> 3
node01:5646:5657 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/group.cc:69 -> 3 [Async thread]
node01:5646:5646 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/group.cc:431 -> 3
node01:5646:5646 [0] NCCL INFO /data/jenkins_workspace/workspace/rccl_release/build/hipify/src/group.cc:116 -> 3
node01: Test NCCL failure common.cu:1158 'internal error - please report this issue to the NCCL developers / '
.. node01 pid 5646: Test failure common.cu:1000
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[2403,1],1]
Exit code: 3
`
It seems that there is an issue with enabling GDR when using the UCX plugin with the parameter HSA_FORCE. FINE-GRAIN-PCIE=1. However, when HSA_FORCE. FINE-GRAIN-PCIE=0, it can run, but performance may decrease due to the inability to use GDR. Without using the UCX plugin (calling IB), there is no such issue. May I ask if there is a better solution?
The text was updated successfully, but these errors were encountered:
When using rccl rdma sharp plugin, I encountered a program crash with the following log:
It seems that there is an issue with enabling GDR when using the UCX plugin with the parameter HSA_FORCE. FINE-GRAIN-PCIE=1. However, when HSA_FORCE. FINE-GRAIN-PCIE=0, it can run, but performance may decrease due to the inability to use GDR. Without using the UCX plugin (calling IB), there is no such issue. May I ask if there is a better solution?
The text was updated successfully, but these errors were encountered: