Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no remote ep address for lane[3]->remote_lane[3] error with UCX 1.16 #1037

Open
pentschev opened this issue Apr 8, 2024 · 0 comments
Open

Comments

@pentschev
Copy link
Member

pentschev commented Apr 8, 2024

UCX 1.16 is defaulting to protov2 (UCX_PROTO_ENABLE=y). We recently found out that there's an issue with TCP wireup in systems with multiple NICs, for example a DGX-1, where this is observable if InfiniBand is not available or disabled (UCX_TLS=^rc). This has been fixed in UCX 1.17 in openucx/ucx#9424 but the fix cannot be backported to 1.16. The issue presents itself with errors such as below:

...
[1712469215.430150] [dgx13:59368:0]          wireup.c:407  UCX  ERROR   ep 0x7f6c716e7080: no remote ep address for lane[3]->remote_lane[3]
[1712469215.438999] [dgx13:59363:0]          wireup.c:407  UCX  ERROR   ep 0x7f45d42b00c0: no remote ep address for lane[3]->remote_lane[3]
...

To mitigate this issue there are a couple of options:

  1. Switch back to protov1 (preferred) by setting UCX_PROTO_ENABLE=n; or
  2. Downgrade to UCX 1.15.
@pentschev pentschev changed the title UCX 1.16 issue with TCP in nodes with multiple NICs no remote ep address for lane[3]->remote_lane[3] error with UCX 1.16 Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant