Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCCL initialization fails when PXN is disabled (RoCEv2, ibv_modify_qp failed with error) #1303

Open
cold2stone opened this issue Aug 19, 2024 · 1 comment

Comments

@cold2stone
Copy link

cold2stone commented Aug 19, 2024

Hello, I have 4 AI servers, each equipped with 8 MI300x and 8 ConnectX-6 (RoCEv2).

System configuration

  • All CX6 NICs are connected through a single L2 Ethernet switch.
  • GPUDirect RDMA is being used.
  • ROCm 6.2.0 is installed
  • I have assigned different subnets to different NICs within a node to avoid ARP flux issues.

image

According to this issue (link), RCCL has PXN disabled by default. However, when I run AI training workloads, if PXN is disabled, RCCL initialization fails with an error: NCCL WARN Call to ibv_modify_qp failed with error. When I enable PXN, the workload runs fine.

Here, I confirmed that when PXN is disabled, RCCL attempts to create a Queue pair (QP) between two NICs that have different subnets. For example, a QP is created between NIC3 of node1 and NIC7 of node2. However, since the subnets of these NICs are different, the QP connection does not function properly, and it seems that RCCL initialization fails as a result. I also tried configuring all NICs in the system to have the same subnet, but when multiple NICs within the same node share the same subnet, ARP flux causes the system to still not function properly.

Therefore, my question is whether this error is expected behavior and whether I must always run my workload with PXN enabled. Based on the figure below describing PXN, it seems that such cross-rail communication is indeed expected. If that's the case, is there any way to avoid ibv_modify_qp failed with error when PXN is disabled?

image

@cold2stone cold2stone changed the title RoCE routing problem when PXN is disabled (ibv_modify_qp failed with error) RCCL initialization fails when PXN is disabled (RoCEv2, ibv_modify_qp failed with error) Aug 19, 2024
@jlochhead
Copy link

In NCCL environments we have to set NCCL_IB_GID_INDEX=3 for communications to work between subnets. I haven't tried with RCCL yet, but I suspect it is the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants