-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Change default NCCL_SOCKET_IFNAME
to blacklist veth
#31824
Conversation
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
Passing distributed GPU training test on |
Signed-off-by: amogkam <[email protected]>
nice! From the logs, looks like NCCL is still seeing veth in some of the nodes. Any idea why? I see 12 worker logs with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline. we're not sure if that line is printed before or after blacklist. we should catch it in nightly if the order is nondeterministic.
can we file an issue against product to remove the virtual ethernet interface? then Ray Train can get out of the way of users completely here (default to NCCL behavior).
python/ray/train/constants.py
Outdated
# "en". | ||
DEFAULT_NCCL_SOCKET_IFNAME = "en,eth,bond" | ||
# Blacklist virtualized networking. | ||
DEFAULT_NCCL_SOCKET_IFNAME = "^lo,docker,vethc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh actually the reason is that it should be veth
not vethc
. c
happened to be the first hex character of the ID I sent you.
that also explains why it was present in only 12 or the 16 -- the other 4 must have had c
as the first character of the id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch...then now I'm wondering why this passed. Maybe this is not needed at all 🤔. Let me run the test with this default removed.
Signed-off-by: amogkam <[email protected]>
Signed-off-by: amogkam <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create an issue against product to remove the virtual network interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question, the title says we blacklist vethc
- instead we whitelist veth
. Is that what we want? Edit: Sorry misread the line
NCCL_SOCKET_IFNAME
to blacklist vethc
NCCL_SOCKET_IFNAME
to blacklist veth
Updated the title-- it should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
When running distributed Pytorch without GPUs, Pytorch selects a localhost interface for gloo (i.e. 127.0.0.1:XXX), breaking distributed training. This method in Pytorch can yield the incorrect interface when a) the the hostname resolves locally to the loopback address or b) when hostname lookups fail. This is scoped to DBR specifically because eth0 is guaranteed to exist there. Pytorch+Gloo does not support deny-listing like NCCL (as we do in #31824) because Pytorch directly uses the environment variable GLOO_SOCKET_IFNAME as the interface to use https://github.com/pytorch/pytorch/blob/7956ca16e649d86cbf11b6e122090fa05678fac3/torch/csrc/distributed/c10d/init.cpp#L2243. Signed-off-by: Ian Rodney <[email protected]>
…2202) When running distributed Pytorch without GPUs, Pytorch selects a localhost interface for gloo (i.e. 127.0.0.1:XXX), breaking distributed training. This method in Pytorch can yield the incorrect interface when a) the the hostname resolves locally to the loopback address or b) when hostname lookups fail. This is scoped to DBR specifically because eth0 is guaranteed to exist there. Pytorch+Gloo does not support deny-listing like NCCL (as we do in ray-project#31824) because Pytorch directly uses the environment variable GLOO_SOCKET_IFNAME as the interface to use https://github.com/pytorch/pytorch/blob/7956ca16e649d86cbf11b6e122090fa05678fac3/torch/csrc/distributed/c10d/init.cpp#L2243. Signed-off-by: Ian Rodney <[email protected]>
Signed-off-by: amogkam [email protected]
Closes #30333.
Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: https://github.com/anyscale/product/issues/8310.
However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: #30333.
Instead, we change to a blacklist so that NCCL does not use
veth
interface which resolves both issues (thanks @cadedaniel for identifying this!)Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.