[Train] Colocate Trainer and rank 0 worker #43115

woshiyyya · 2024-02-12T20:47:30Z

Why are these changes needed?

This PR automatically merge the trainer bundle with the rank 0 worker bundle, so that the trainer and rank 0 worker can always colocate on the same node.

Benefits:

Enables users to specify additional resources for rank 0 worker.
Always colocate trainers and rank 0 workers together to make the scheduling behavior deterministic.

Major changes:

1. Merge trainer bundle and the first worker bundle.

Specifically, we build a placement groups with bundles [{}, {trainer+worker}, {worker}, ..., {worker}], and schedule the TrainTrainable with the first non-empty bundle. When assigning worker ranks, we designate the worker with the smallest GPU ID on the same node as the trainer to be rank 0.

2. Set `num_workers=1` by default in `ScalingConfig`.

Previously, setting num_workers to None resulted launching a single TrainTrainable with zero workers. It no longer applies to the current Ray Train, as all Trainers now require at least one worker to execute the train_func.

Additionally, this approach led to undefined behaviors during the merging and separation of the first bundle. To ensure the consistent behavior, we have now set the default value of num_workers to 1.

3. Forbid using `ScalingConfig` with `tune.with_resources`.

ScalingConfig should be a Ray Train only utility and it's should not be used for Tune Trainables. For example, it doesn't make sense to provide ScalingConfig for a function trainable, since there's no trainer and worker concepts for it.

Passed Release Test：https://buildkite.com/ray-project/release/builds/9650#018dee6e-e3ce-4376-9f3d-5ad7e250e513

Related PRs:

The below two PRs enabled that the actors with empty resources can be launched on the node of a specific bundle in placement group.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <[email protected]>

Signed-off-by: Yunxuan Xiao <[email protected]>

python/ray/air/config.py

python/ray/air/execution/resources/request.py

python/ray/train/base_trainer.py

python/ray/train/_internal/worker_group.py

Signed-off-by: woshiyyya <[email protected]>

…ker' into train/colocate_trainer_rank0_worker

Signed-off-by: woshiyyya <[email protected]>

…ainer_rank0_worker

Signed-off-by: woshiyyya <[email protected]>

woshiyyya · 2024-02-27T23:51:18Z

python/ray/train/tests/test_data_parallel_trainer.py

-        train.report({"loss": config["x"]})
-
-    # Should be able to create a DataParallelTrainer w/o scaling_config,
-    # but it should fail on fit


We are able to launch training w/o scaling_config since num_workers defaults to 1 now.

We used to force users to think about num_workers, but now we'll default to 1 silently -- think this is a UX problem?

We do show scaling config in our docs a lot so I think it should be fine.

Yeah there are some of our internal tests are using empty ScalingConfig.
For the users I think it's fine. Our docstring and examples explicitly set num_workers.

Signed-off-by: woshiyyya <[email protected]>

…ainer_rank0_worker

Signed-off-by: woshiyyya <[email protected]>

python/ray/air/tests/test_errors.py

justinvyu

Thanks, this looks a lot better!

Note about the naming:

We were considering changing trainer_resources to rank_0_resources, but that's a bit confusing since the rank 0 worker doesn't actually have access to the trainer_resources.
trainer_resources is confusing because users can't access the trainer at all.
trainer_resources should never be used to ask for more cpus/gpus -- it only really makes sense to ask for memory (or custom resources), in order to guarantee that all workers on the rank 0 node have access to that amount of memory.
Let's add a comment about that -- I gave a draft below.

justinvyu · 2024-02-28T01:15:20Z

python/ray/train/base_trainer.py

@@ -732,7 +732,9 @@ def setup(self, config, **kwargs):
                run_config = base_config.pop("run_config", None)
                self._merged_config = merge_dicts(base_config, self.config)
                self._merged_config["run_config"] = run_config
-                merged_scaling_config = self._merged_config.get("scaling_config")
+                merged_scaling_config = self._merged_config.get(
+                    "scaling_config", ScalingConfig()


Why do we need to add a default ScalingConfig() now?

Because when users didn't provide scaling_config in the Trainer init argument list, the self._merged_config will have no "scaling_config" key and returns None. However, we actually set self.scaling_config = ScalingConfig() in the __init__ function, returning a default ScalingConfig aims to align with it and skip the reconcile logic.

justinvyu · 2024-02-28T01:19:04Z

python/ray/train/tests/test_data_parallel_trainer.py

-        train.report({"loss": config["x"]})
-
-    # Should be able to create a DataParallelTrainer w/o scaling_config,
-    # but it should fail on fit


We used to force users to think about num_workers, but now we'll default to 1 silently -- think this is a UX problem?

We do show scaling config in our docs a lot so I think it should be fine.

python/ray/train/tests/test_base_trainer.py

python/ray/air/config.py

Signed-off-by: woshiyyya <[email protected]>

…ker' into train/colocate_trainer_rank0_worker

Signed-off-by: woshiyyya <[email protected]>

…ker' into train/colocate_trainer_rank0_worker

justinvyu

LGTM!

woshiyyya added 13 commits February 12, 2024 12:39

init

e25dcac

Signed-off-by: woshiyyya <[email protected]>

add empty bundle for trainer

7d6bba0

Signed-off-by: woshiyyya <[email protected]>

update

cde842b

Signed-off-by: woshiyyya <[email protected]>

unify naming

18c00f9

Signed-off-by: woshiyyya <[email protected]>

add tests

b440d26

Signed-off-by: woshiyyya <[email protected]>

update

28fa56c

Signed-off-by: woshiyyya <[email protected]>

update

570e4e6

Signed-off-by: woshiyyya <[email protected]>

update

c553af0

Signed-off-by: woshiyyya <[email protected]>

update

fb75f75

Signed-off-by: woshiyyya <[email protected]>

update

48fb677

Signed-off-by: woshiyyya <[email protected]>

fix tests

59149af

Signed-off-by: woshiyyya <[email protected]>

fix 1

3c3e5a0

Signed-off-by: woshiyyya <[email protected]>

simplify test

d1331fd

Signed-off-by: Yunxuan Xiao <[email protected]>

woshiyyya commented Feb 14, 2024

View reviewed changes

python/ray/air/config.py Outdated Show resolved Hide resolved

woshiyyya commented Feb 14, 2024

View reviewed changes

python/ray/air/execution/resources/request.py Outdated Show resolved Hide resolved

woshiyyya commented Feb 14, 2024

View reviewed changes

python/ray/train/base_trainer.py Outdated Show resolved Hide resolved

woshiyyya commented Feb 14, 2024

View reviewed changes

python/ray/train/_internal/worker_group.py Outdated Show resolved Hide resolved

woshiyyya added 3 commits February 14, 2024 10:49

fix lint

5a1b925

Signed-off-by: woshiyyya <[email protected]>

fix lint

829323f

Signed-off-by: woshiyyya <[email protected]>

fix lint

ef02a3e

Signed-off-by: woshiyyya <[email protected]>

woshiyyya force-pushed the train/colocate_trainer_rank0_worker branch from f6a6659 to ef02a3e Compare February 14, 2024 18:50

woshiyyya changed the title ~~[WIP] Colocate Trainer and rank 0 worker~~ [Train] Colocate Trainer and rank 0 worker Feb 14, 2024

woshiyyya and others added 3 commits February 22, 2024 22:15

Merge branch 'master' into train/colocate_trainer_rank0_worker

150b6ab

update tests

0ccb964

Signed-off-by: woshiyyya <[email protected]>

Merge remote-tracking branch 'origin/train/colocate_trainer_rank0_wor…

df7f81a

…ker' into train/colocate_trainer_rank0_worker

woshiyyya marked this pull request as ready for review February 23, 2024 18:12

woshiyyya assigned ericl and matthewdeng Feb 23, 2024

woshiyyya added 2 commits February 23, 2024 13:47

fix test

caed199

Signed-off-by: woshiyyya <[email protected]>

Merge remote-tracking branch 'upstream/master' into train/colocate_tr…

dc99cfc

…ainer_rank0_worker

refractor from_placement_group_factory

92bc658

Signed-off-by: woshiyyya <[email protected]>

woshiyyya commented Feb 27, 2024

View reviewed changes

woshiyyya added 3 commits February 27, 2024 16:04

refractor tests with custom backend

699b7d5

Signed-off-by: woshiyyya <[email protected]>

Merge remote-tracking branch 'upstream/master' into train/colocate_tr…

7f15c00

…ainer_rank0_worker

refractor tests

b1025e2

Signed-off-by: woshiyyya <[email protected]>

woshiyyya marked this pull request as ready for review February 28, 2024 00:39

woshiyyya requested review from richardliaw, krfricke, xwjiang2010, amogkam, Yard1, maxpumperla and a team as code owners February 28, 2024 00:39

woshiyyya commented Feb 28, 2024

View reviewed changes

python/ray/air/tests/test_errors.py Show resolved Hide resolved

woshiyyya requested review from matthewdeng and justinvyu February 28, 2024 00:49

justinvyu self-assigned this Feb 28, 2024

justinvyu reviewed Feb 28, 2024

View reviewed changes

woshiyyya and others added 9 commits February 27, 2024 18:45

add release test

131ea98

Signed-off-by: woshiyyya <[email protected]>

add release test

1a7e7e5

Signed-off-by: woshiyyya <[email protected]>

convert unit test to release tsst

186dcfa

Signed-off-by: woshiyyya <[email protected]>

Merge remote-tracking branch 'origin/train/colocate_trainer_rank0_wor…

31c4b81

…ker' into train/colocate_trainer_rank0_worker

fix release test

612674e

Signed-off-by: woshiyyya <[email protected]>

ignore_reinit_error=True

481b483

Signed-off-by: woshiyyya <[email protected]>

ignore_reinit_error=True

b99b0ec

Signed-off-by: woshiyyya <[email protected]>

Merge remote-tracking branch 'origin/train/colocate_trainer_rank0_wor…

0df7969

…ker' into train/colocate_trainer_rank0_worker

Merge branch 'master' into train/colocate_trainer_rank0_worker

d743157

justinvyu approved these changes Feb 28, 2024

View reviewed changes

can-anyscale merged commit 4a73957 into ray-project:master Feb 28, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Colocate Trainer and rank 0 worker #43115

[Train] Colocate Trainer and rank 0 worker #43115

woshiyyya commented Feb 12, 2024 •

edited

Loading

woshiyyya Feb 27, 2024

justinvyu Feb 28, 2024

woshiyyya Feb 28, 2024 •

edited

Loading

justinvyu left a comment

justinvyu Feb 28, 2024

woshiyyya Feb 28, 2024

justinvyu Feb 28, 2024

justinvyu left a comment

[Train] Colocate Trainer and rank 0 worker #43115

[Train] Colocate Trainer and rank 0 worker #43115

Conversation

woshiyyya commented Feb 12, 2024 • edited Loading

Why are these changes needed?

Benefits:

Major changes:

1. Merge trainer bundle and the first worker bundle.

2. Set num_workers=1 by default in ScalingConfig.

3. Forbid using ScalingConfig with tune.with_resources.

Related PRs:

Related issue number

Checks

woshiyyya Feb 27, 2024

Choose a reason for hiding this comment

justinvyu Feb 28, 2024

Choose a reason for hiding this comment

woshiyyya Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

justinvyu Feb 28, 2024

Choose a reason for hiding this comment

woshiyyya Feb 28, 2024

Choose a reason for hiding this comment

justinvyu Feb 28, 2024

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

woshiyyya commented Feb 12, 2024 •

edited

Loading

2. Set `num_workers=1` by default in `ScalingConfig`.

3. Forbid using `ScalingConfig` with `tune.with_resources`.

woshiyyya Feb 28, 2024 •

edited

Loading