-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train] Sort workers by node ID rather than by node IP #46163
Conversation
Thank you @KyleKoon, in general it looks good. Can we add some tests to reproduce the failure case we want to resolve (multiple workers nodes shares the same IP)? You can refer to this example to construct virtual clusters: ray/python/ray/train/tests/conftest.py Line 45 in 450df60
|
def test_local_world_size_with_same_ip_nodes(ray_2_node_2_cpu): | ||
config = TestConfig() | ||
with patch.object( | ||
WorkerGroup, "add_workers", mock_add_workers_to_nodes_with_same_ip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2 virtual nodes will have the same node_ip. Can we also have a test without patching the add_worker
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. Left one comment
Signed-off-by: kkoon <[email protected]>
Signed-off-by: kkoon <[email protected]>
58c92f7
to
b88f7ed
Compare
Signed-off-by: kkoon <[email protected]>
Signed-off-by: kkoon <[email protected]>
Signed-off-by: kkoon <[email protected]>
b88f7ed
to
27d8773
Compare
Signed-off-by: kkoon <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Why are these changes needed?
We should not sort workers by IP, since it may be the case that multiple nodes are placed on the same IP host. If multiple nodes are collocated, the local rank and node rank of the workers will be incorrect. We should instead use a node's unique ID to map to a set of workers.
Example:
Current Behavior:
Expected Behavior:
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.