You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
Ray Component
Ray Train
What happened + What you expected to happen
If the user does not manually specify the NICs in the HorovodConfig when running in a multi-node cluster, Horovod will attempt to discover the compatible NICs automatically. However, this process is currently broken in Ray Train, as the HorovodConfig object does not conform to the expected Settings interface.
Versions / Dependencies
Ray 1.7.0
Reproduction script
Run any Ray Train script with Horovod and multiple nodes.
I'll put together a fix in Ray Train based on these changes.
I also want to make some changes to the Horovod on Ray code in Horovod itself, as some aspects of these changes are pretty hacky. Ideally, we should try to move as much of this into Ray Train as possible, so there is not this circular dependency between Ray Train and Horovod. Ideally, there should be a strict separation between Horovod as the dependency for Ray Train, similar to how other backends are managed.
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
tgaddair
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Oct 19, 2021
Search before asking
Ray Component
Ray Train
What happened + What you expected to happen
If the user does not manually specify the NICs in the
HorovodConfig
when running in a multi-node cluster, Horovod will attempt to discover the compatible NICs automatically. However, this process is currently broken in Ray Train, as the HorovodConfig object does not conform to the expected Settings interface.Versions / Dependencies
Ray 1.7.0
Reproduction script
Run any Ray Train script with Horovod and multiple nodes.
Anything else
I have a fix for this in Ludwig here:
ludwig-ai/ludwig@6cd6a87
I'll put together a fix in Ray Train based on these changes.
I also want to make some changes to the Horovod on Ray code in Horovod itself, as some aspects of these changes are pretty hacky. Ideally, we should try to move as much of this into Ray Train as possible, so there is not this circular dependency between Ray Train and Horovod. Ideally, there should be a strict separation between Horovod as the dependency for Ray Train, similar to how other backends are managed.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: