You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
UCX 1.16 is defaulting to protov2 (UCX_PROTO_ENABLE=y). We recently found out that there's an issue with TCP wireup in systems with multiple NICs, for example a DGX-1, where this is observable if InfiniBand is not available or disabled (UCX_TLS=^rc). This has been fixed in UCX 1.17 in openucx/ucx#9424 but the fix cannot be backported to 1.16. The issue presents itself with errors such as below:
...
[1712469215.430150] [dgx13:59368:0] wireup.c:407 UCX ERROR ep 0x7f6c716e7080: no remote ep address for lane[3]->remote_lane[3]
[1712469215.438999] [dgx13:59363:0] wireup.c:407 UCX ERROR ep 0x7f45d42b00c0: no remote ep address for lane[3]->remote_lane[3]
...
To mitigate this issue there are a couple of options:
Switch back to protov1 (preferred) by setting UCX_PROTO_ENABLE=n; or
Downgrade to UCX 1.15.
The text was updated successfully, but these errors were encountered:
pentschev
changed the title
UCX 1.16 issue with TCP in nodes with multiple NICsno remote ep address for lane[3]->remote_lane[3] error with UCX 1.16
Apr 8, 2024
UCX 1.16 is defaulting to protov2 (
UCX_PROTO_ENABLE=y
). We recently found out that there's an issue with TCP wireup in systems with multiple NICs, for example a DGX-1, where this is observable if InfiniBand is not available or disabled (UCX_TLS=^rc
). This has been fixed in UCX 1.17 in openucx/ucx#9424 but the fix cannot be backported to 1.16. The issue presents itself with errors such as below:To mitigate this issue there are a couple of options:
UCX_PROTO_ENABLE=n
; orThe text was updated successfully, but these errors were encountered: