[train] add placement group support #20091

matthewdeng · 2021-11-05T00:49:56Z

Why are these changes needed?

Placement groups are used as an implementation detail for scheduling Ray Train workers.

Scheduling behavior is as follows:

If the Trainer is already within a placement group and should_capture_child_tasks_in_placement_group is True, then the workers will inherit this placement group. This is currently used for the Tune integration where Tune will manage the placement groups.
By default, a placement group will be created with the PACK strategy. This is primarily done with the intention to improve GPU training by reducing network communication overhead.
The strategy can be overridden to use the SPREAD strategy by setting the TRAIN_ENABLE_WORKER_SPREAD environment variable to 1. Use-cases that may require this strategy include CPU-only training and elastic training. In the future this can be promoted up to the Trainer API when the need becomes more established.

Related issue number

Closes #19610

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

amogkam · 2021-11-09T16:51:44Z

python/ray/train/backends/backend.py

+        self._placement_group.
+        """
+        current_placement_group = get_current_placement_group()
+        should_capture_child_tasks_in_placement_group = \


Man I wish there was a better way to handle this. Do we have a shared doc with Tune on asks for placement group?

Created #20196 to track this!

amogkam · 2021-11-09T16:56:17Z

python/ray/train/worker_group.py

@@ -279,6 +286,9 @@ def execute_single(self, worker_index: int, func: Callable[..., T], *args,
    def remove_workers(self, worker_indexes: List[int]):


Now with placement groups the semantics here get a little tricky since we can’t add workers without first removing them if placement groups is enabled.

Do you think we should do either of the following?

Make remove_workers and add_workers private, and only expose a single replace_workers API. This will ensure that remove and add are called atomically (except for the initial creation) and your placement group will always be “full”.

Keep track of the free bundles in the placement group and if add_workers is called without free resources in the placement group, then raise an error.

Great point, my initial thoughts are:

This is reasonable for the current system, but I don't think it should be a requirement of the WorkerGroup/ActorGroup. With the presented API we don't really make any assumptions on the size of the placement group.

This could get tricky keeping track of what a "free" bundle is (if worker:bundle is 1:1 this is easier but not sure if this makes sense). Would prefer to leave this up to placement groups to be responsible for this and raise the error when this happens ([placement groups/autoscaler] unfulfillable requests should raise an error #18018).

I do actually think #18524 is the preferred solution for Ray Train and placement groups are just a workaround to achieve this functionality, so it doesn't seem right to me to build WorkerGroup around placement group semantics.

I added a comment to add_workers to warn any users about this.

Even prior to this change, the same undefined behavior would be present if add_workers is called within a parent placement group or if the cluster is full.

The only usecase in TensorflowBackend actually has some custom logic in between remove_workers and add_workers.

…n-pg

HuangLED · 2021-12-30T05:40:33Z

python/ray/train/constants.py

+
+# Integer value which if set will change the placement group strategy from
+# PACK to SPREAD. 1 for True, 0 for False.
+TRAIN_ENABLE_WORKER_SPREAD_ENV =\


Looks like I have to set this on all the machines (including HEAD and non-HEAD). Why is that? Isn't the scheduling part handled by head node (and partially client side)? @matthewdeng

Hey @HuangLED, what version of Ray are you using? Also are you using Ray Client?

This should be addressed in master (via this commit) so that the environment variable should only need to be set on the driver, but even without this change the environment variable should only be needed on the host where the BackendExecutor is.

We are using v1.9.0 I think.

Yes, I am using client mode. e.g. Two machines M1 and M2, while M1 being the head. Then connecting from a third machine using ray.init("ray://M1:port").

Solely setting the env on M1 does not work.

Can you verify that the BackendExecutor process is running on M1?

[train] add placement group support

5a4dbef

matthewdeng assigned amogkam Nov 5, 2021

matthewdeng added 2 commits November 4, 2021 21:56

fix additional resources

7118815

fix tests

f15ec69

amogkam approved these changes Nov 9, 2021

View reviewed changes

matthewdeng added 2 commits November 9, 2021 14:46

add comment to add_workers

79478c3

Merge branch 'master' of https://github.com/ray-project/ray into trai…

d6bb724

…n-pg

amogkam merged commit 33af739 into ray-project:master Nov 10, 2021

matthewdeng mentioned this pull request Dec 8, 2021

[sgd] add placement group support #17037

Merged

6 tasks

HuangLED reviewed Dec 30, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] add placement group support #20091

[train] add placement group support #20091

matthewdeng commented Nov 5, 2021

amogkam Nov 9, 2021

matthewdeng Nov 9, 2021

amogkam Nov 9, 2021

matthewdeng Nov 9, 2021

matthewdeng Nov 9, 2021

HuangLED Dec 30, 2021

matthewdeng Dec 30, 2021

HuangLED Dec 30, 2021

matthewdeng Dec 30, 2021

		@@ -279,6 +286,9 @@ def execute_single(self, worker_index: int, func: Callable[..., T], *args,
		def remove_workers(self, worker_indexes: List[int]):

[train] add placement group support #20091

[train] add placement group support #20091

Conversation

matthewdeng commented Nov 5, 2021

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment