You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I have 4 AI servers, each equipped with 8 MI300x and 8 ConnectX-6 (RoCEv2).
System configuration
All CX6 NICs are connected through a single L2 Ethernet switch.
GPUDirect RDMA is being used.
ROCm 6.2.0 is installed
I have assigned different subnets to different NICs within a node to avoid ARP flux issues.
According to this issue (link), RCCL has PXN disabled by default. However, when I run AI training workloads, if PXN is disabled, RCCL initialization fails with an error: NCCL WARN Call to ibv_modify_qp failed with error. When I enable PXN, the workload runs fine.
Here, I confirmed that when PXN is disabled, RCCL attempts to create a Queue pair (QP) between two NICs that have different subnets. For example, a QP is created between NIC3 of node1 and NIC7 of node2. However, since the subnets of these NICs are different, the QP connection does not function properly, and it seems that RCCL initialization fails as a result. I also tried configuring all NICs in the system to have the same subnet, but when multiple NICs within the same node share the same subnet, ARP flux causes the system to still not function properly.
Therefore, my question is whether this error is expected behavior and whether I must always run my workload with PXN enabled. Based on the figure below describing PXN, it seems that such cross-rail communication is indeed expected. If that's the case, is there any way to avoid ibv_modify_qp failed with error when PXN is disabled?
The text was updated successfully, but these errors were encountered:
cold2stone
changed the title
RoCE routing problem when PXN is disabled (ibv_modify_qp failed with error)
RCCL initialization fails when PXN is disabled (RoCEv2, ibv_modify_qp failed with error)
Aug 19, 2024
In NCCL environments we have to set NCCL_IB_GID_INDEX=3 for communications to work between subnets. I haven't tried with RCCL yet, but I suspect it is the same.
Hello, I have 4 AI servers, each equipped with 8 MI300x and 8 ConnectX-6 (RoCEv2).
System configuration
According to this issue (link), RCCL has PXN disabled by default. However, when I run AI training workloads, if PXN is disabled, RCCL initialization fails with an error:
NCCL WARN Call to ibv_modify_qp failed with error
. When I enable PXN, the workload runs fine.Here, I confirmed that when PXN is disabled, RCCL attempts to create a Queue pair (QP) between two NICs that have different subnets. For example, a QP is created between NIC3 of node1 and NIC7 of node2. However, since the subnets of these NICs are different, the QP connection does not function properly, and it seems that RCCL initialization fails as a result. I also tried configuring all NICs in the system to have the same subnet, but when multiple NICs within the same node share the same subnet, ARP flux causes the system to still not function properly.
Therefore, my question is whether this error is expected behavior and whether I must always run my workload with PXN enabled. Based on the figure below describing PXN, it seems that such cross-rail communication is indeed expected. If that's the case, is there any way to avoid
ibv_modify_qp failed with error
when PXN is disabled?The text was updated successfully, but these errors were encountered: