Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Colocate Trainer and rank 0 worker #43115

Merged

Conversation

woshiyyya
Copy link
Member

@woshiyyya woshiyyya commented Feb 12, 2024

Why are these changes needed?

This PR automatically merge the trainer bundle with the rank 0 worker bundle, so that the trainer and rank 0 worker can always colocate on the same node.

Benefits:

  • Enables users to specify additional resources for rank 0 worker.
  • Always colocate trainers and rank 0 workers together to make the scheduling behavior deterministic.

Major changes:

1. Merge trainer bundle and the first worker bundle.

Specifically, we build a placement groups with bundles [{}, {trainer+worker}, {worker}, ..., {worker}], and schedule the TrainTrainable with the first non-empty bundle. When assigning worker ranks, we designate the worker with the smallest GPU ID on the same node as the trainer to be rank 0.

2. Set num_workers=1 by default in ScalingConfig.

Previously, setting num_workers to None resulted launching a single TrainTrainable with zero workers. It no longer applies to the current Ray Train, as all Trainers now require at least one worker to execute the train_func.

Additionally, this approach led to undefined behaviors during the merging and separation of the first bundle. To ensure the consistent behavior, we have now set the default value of num_workers to 1.

3. Forbid using ScalingConfig with tune.with_resources.

ScalingConfig should be a Ray Train only utility and it's should not be used for Tune Trainables. For example, it doesn't make sense to provide ScalingConfig for a function trainable, since there's no trainer and worker concepts for it.

Passed Release Test:https://buildkite.com/ray-project/release/builds/9650#018dee6e-e3ce-4376-9f3d-5ad7e250e513

Related PRs:

The below two PRs enabled that the actors with empty resources can be launched on the node of a specific bundle in placement group.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: Yunxuan Xiao <[email protected]>
python/ray/air/config.py Outdated Show resolved Hide resolved
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
@woshiyyya woshiyyya force-pushed the train/colocate_trainer_rank0_worker branch from f6a6659 to ef02a3e Compare February 14, 2024 18:50
@woshiyyya woshiyyya changed the title [WIP] Colocate Trainer and rank 0 worker [Train] Colocate Trainer and rank 0 worker Feb 14, 2024
@woshiyyya woshiyyya marked this pull request as ready for review February 23, 2024 18:12
train.report({"loss": config["x"]})

# Should be able to create a DataParallelTrainer w/o scaling_config,
# but it should fail on fit
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are able to launch training w/o scaling_config since num_workers defaults to 1 now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to force users to think about num_workers, but now we'll default to 1 silently -- think this is a UX problem?

We do show scaling config in our docs a lot so I think it should be fine.

Copy link
Member Author

@woshiyyya woshiyyya Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there are some of our internal tests are using empty ScalingConfig.
For the users I think it's fine. Our docstring and examples explicitly set num_workers.

@woshiyyya woshiyyya marked this pull request as ready for review February 28, 2024 00:39
@justinvyu justinvyu self-assigned this Feb 28, 2024
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks a lot better!

Note about the naming:

  • We were considering changing trainer_resources to rank_0_resources, but that's a bit confusing since the rank 0 worker doesn't actually have access to the trainer_resources.
  • trainer_resources is confusing because users can't access the trainer at all.
  • trainer_resources should never be used to ask for more cpus/gpus -- it only really makes sense to ask for memory (or custom resources), in order to guarantee that all workers on the rank 0 node have access to that amount of memory.
  • Let's add a comment about that -- I gave a draft below.

@@ -732,7 +732,9 @@ def setup(self, config, **kwargs):
run_config = base_config.pop("run_config", None)
self._merged_config = merge_dicts(base_config, self.config)
self._merged_config["run_config"] = run_config
merged_scaling_config = self._merged_config.get("scaling_config")
merged_scaling_config = self._merged_config.get(
"scaling_config", ScalingConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add a default ScalingConfig() now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because when users didn't provide scaling_config in the Trainer init argument list, the self._merged_config will have no "scaling_config" key and returns None. However, we actually set self.scaling_config = ScalingConfig() in the __init__ function, returning a default ScalingConfig aims to align with it and skip the reconcile logic.

train.report({"loss": config["x"]})

# Should be able to create a DataParallelTrainer w/o scaling_config,
# but it should fail on fit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to force users to think about num_workers, but now we'll default to 1 silently -- think this is a UX problem?

We do show scaling config in our docs a lot so I think it should be fine.

python/ray/train/tests/test_base_trainer.py Show resolved Hide resolved
python/ray/air/config.py Show resolved Hide resolved
python/ray/air/config.py Show resolved Hide resolved
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@can-anyscale can-anyscale merged commit 4a73957 into ray-project:master Feb 28, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants