-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Fixed HorovodBackend to automatically detect network interfaces #19533
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this fix!
Would you be able to share a little more details around the expected/observed behavior for the nics detection? It looks like this might have been caused by #18728 but we didn't catch it in the tests, so I'd love to be able to add a simple test to make sure we're able to catch any potential regressions in the future!
@dataclass | ||
class HorovodWorkerWrapper: | ||
w: Worker | ||
|
||
@property | ||
def execute(self): | ||
w = self.w | ||
|
||
class ExecuteHandle: | ||
def remote(self, func, *args, **kwargs): | ||
_ = None | ||
return w.actor._BaseWorkerMixin__execute.remote( | ||
func, _, *args, **kwargs) | ||
|
||
return ExecuteHandle() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, wondering if we could do something simpler like:
@dataclass
class HorovodWorkerWrapper:
w: Worker
@property
execute = self.w.actor._BaseWorkerMixin__execute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, not quite because Horovod's current API assumes that an additional parameter will be passed in to the remote function containing the class instance. This is why we have the _ = None
line. I plan to make some changes to Horovod and potentially move more of this code into Ray Train to avoid this hack, but for now this is what we need to do to make it work.
Co-authored-by: matthewdeng <[email protected]>
Hey @matthewdeng, thanks for the review. The issue here was the the the Hope that makes sense, happy to provide more context as needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing tests are unrelated, going to merge. |
…es (#19533) * Moved Horovod into package * Move in Ludwig fix * Undo git mv * Cleanup * Cleanup * flake8 * Update python/ray/train/backends/horovod.py Co-authored-by: matthewdeng <[email protected]> * Whitespace Co-authored-by: matthewdeng <[email protected]>
Why are these changes needed?
Fixes API incompatibilities in HorovodBackend to properly detect NICs when not provided by the user. Without these changes, Horovod in Ray Train will not work unless the user knows in advance which NICs they wish to use for training.
Related issue number
Fixes #19516.
Checks
scripts/format.sh
to lint the changes in this PR.