-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Colocate Trainer and rank 0 worker #43115
[Train] Colocate Trainer and rank 0 worker #43115
Conversation
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: Yunxuan Xiao <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
f6a6659
to
ef02a3e
Compare
Signed-off-by: woshiyyya <[email protected]>
…ker' into train/colocate_trainer_rank0_worker
Signed-off-by: woshiyyya <[email protected]>
…ainer_rank0_worker
Signed-off-by: woshiyyya <[email protected]>
train.report({"loss": config["x"]}) | ||
|
||
# Should be able to create a DataParallelTrainer w/o scaling_config, | ||
# but it should fail on fit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are able to launch training w/o scaling_config since num_workers
defaults to 1 now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to force users to think about num_workers
, but now we'll default to 1 silently -- think this is a UX problem?
We do show scaling config in our docs a lot so I think it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there are some of our internal tests are using empty ScalingConfig.
For the users I think it's fine. Our docstring and examples explicitly set num_workers.
Signed-off-by: woshiyyya <[email protected]>
…ainer_rank0_worker
Signed-off-by: woshiyyya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks a lot better!
Note about the naming:
- We were considering changing
trainer_resources
torank_0_resources
, but that's a bit confusing since the rank 0 worker doesn't actually have access to thetrainer_resources
. trainer_resources
is confusing because users can't access the trainer at all.trainer_resources
should never be used to ask for more cpus/gpus -- it only really makes sense to ask for memory (or custom resources), in order to guarantee that all workers on the rank 0 node have access to that amount of memory.- Let's add a comment about that -- I gave a draft below.
@@ -732,7 +732,9 @@ def setup(self, config, **kwargs): | |||
run_config = base_config.pop("run_config", None) | |||
self._merged_config = merge_dicts(base_config, self.config) | |||
self._merged_config["run_config"] = run_config | |||
merged_scaling_config = self._merged_config.get("scaling_config") | |||
merged_scaling_config = self._merged_config.get( | |||
"scaling_config", ScalingConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to add a default ScalingConfig()
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because when users didn't provide scaling_config
in the Trainer init argument list, the self._merged_config
will have no "scaling_config" key and returns None
. However, we actually set self.scaling_config = ScalingConfig()
in the __init__
function, returning a default ScalingConfig aims to align with it and skip the reconcile logic.
train.report({"loss": config["x"]}) | ||
|
||
# Should be able to create a DataParallelTrainer w/o scaling_config, | ||
# but it should fail on fit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to force users to think about num_workers
, but now we'll default to 1 silently -- think this is a UX problem?
We do show scaling config in our docs a lot so I think it should be fine.
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
…ker' into train/colocate_trainer_rank0_worker
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
…ker' into train/colocate_trainer_rank0_worker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Why are these changes needed?
This PR automatically merge the trainer bundle with the rank 0 worker bundle, so that the trainer and rank 0 worker can always colocate on the same node.
Benefits:
Major changes:
1. Merge trainer bundle and the first worker bundle.
Specifically, we build a placement groups with bundles
[{}, {trainer+worker}, {worker}, ..., {worker}]
, and schedule theTrainTrainable
with the first non-empty bundle. When assigning worker ranks, we designate the worker with the smallest GPU ID on the same node as the trainer to be rank 0.2. Set
num_workers=1
by default inScalingConfig
.Previously, setting
num_workers
toNone
resulted launching a singleTrainTrainable
with zero workers. It no longer applies to the current Ray Train, as all Trainers now require at least one worker to execute thetrain_func
.Additionally, this approach led to undefined behaviors during the merging and separation of the first bundle. To ensure the consistent behavior, we have now set the default value of
num_workers
to 1.3. Forbid using
ScalingConfig
withtune.with_resources
.ScalingConfig
should be a Ray Train only utility and it's should not be used for Tune Trainables. For example, it doesn't make sense to provide ScalingConfig for a function trainable, since there's no trainer and worker concepts for it.Passed Release Test:https://buildkite.com/ray-project/release/builds/9650#018dee6e-e3ce-4376-9f3d-5ad7e250e513
Related PRs:
The below two PRs enabled that the actors with empty resources can be launched on the node of a specific bundle in placement group.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.