Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Sort workers by node ID rather than by node IP #46163

Merged
merged 7 commits into from
Jul 24, 2024

Conversation

KyleKoon
Copy link
Contributor

@KyleKoon KyleKoon commented Jun 20, 2024

Why are these changes needed?

We should not sort workers by IP, since it may be the case that multiple nodes are placed on the same IP host. If multiple nodes are collocated, the local rank and node rank of the workers will be incorrect. We should instead use a node's unique ID to map to a set of workers.

Example:

Worker 0: node 0, ip 0
Worker 1: node 0, ip 0
Worker 2: node 1, ip 0
Worker 3: node 1, ip 0

Current Behavior:

worker 0: local_rank 0, node_rank 0
worker 1: local_rank 1, node_rank 0
worker 2: local_rank 2, node_rank 0
worker 3: local_rank 3, node_rank 0

Expected Behavior:

worker 0: local_rank 0, node_rank 0
worker 1: local_rank 1, node_rank 0
worker 2: local_rank 0, node_rank 1
worker 3: local_rank 1, node_rank 1

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@woshiyyya woshiyyya added the go add ONLY when ready to merge, run all tests label Jun 20, 2024
@woshiyyya
Copy link
Member

woshiyyya commented Jul 11, 2024

Thank you @KyleKoon, in general it looks good. Can we add some tests to reproduce the failure case we want to resolve (multiple workers nodes shares the same IP)?

You can refer to this example to construct virtual clusters:

def ray_2_node_2_gpu():

def test_local_world_size_with_same_ip_nodes(ray_2_node_2_cpu):
config = TestConfig()
with patch.object(
WorkerGroup, "add_workers", mock_add_workers_to_nodes_with_same_ip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2 virtual nodes will have the same node_ip. Can we also have a test without patching the add_worker method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. Left one comment

@aslonnie aslonnie removed the request for review from a team July 22, 2024 19:59
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@matthewdeng matthewdeng merged commit 42eb499 into ray-project:master Jul 24, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants