Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] add placement group support #20091

Merged
merged 5 commits into from
Nov 10, 2021
Merged

Conversation

matthewdeng
Copy link
Contributor

Why are these changes needed?

Placement groups are used as an implementation detail for scheduling Ray Train workers.

Scheduling behavior is as follows:

  1. If the Trainer is already within a placement group and should_capture_child_tasks_in_placement_group is True, then the workers will inherit this placement group. This is currently used for the Tune integration where Tune will manage the placement groups.
  2. By default, a placement group will be created with the PACK strategy. This is primarily done with the intention to improve GPU training by reducing network communication overhead.
  3. The strategy can be overridden to use the SPREAD strategy by setting the TRAIN_ENABLE_WORKER_SPREAD environment variable to 1. Use-cases that may require this strategy include CPU-only training and elastic training. In the future this can be promoted up to the Trainer API when the need becomes more established.

Related issue number

Closes #19610

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

self._placement_group.
"""
current_placement_group = get_current_placement_group()
should_capture_child_tasks_in_placement_group = \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man I wish there was a better way to handle this. Do we have a shared doc with Tune on asks for placement group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #20196 to track this!

@@ -279,6 +286,9 @@ def execute_single(self, worker_index: int, func: Callable[..., T], *args,
def remove_workers(self, worker_indexes: List[int]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now with placement groups the semantics here get a little tricky since we can’t add workers without first removing them if placement groups is enabled.

Do you think we should do either of the following?

  1. Make remove_workers and add_workers private, and only expose a single replace_workers API. This will ensure that remove and add are called atomically (except for the initial creation) and your placement group will always be “full”.
  2. Keep track of the free bundles in the placement group and if add_workers is called without free resources in the placement group, then raise an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, my initial thoughts are:

  1. This is reasonable for the current system, but I don't think it should be a requirement of the WorkerGroup/ActorGroup. With the presented API we don't really make any assumptions on the size of the placement group.
  2. This could get tricky keeping track of what a "free" bundle is (if worker:bundle is 1:1 this is easier but not sure if this makes sense). Would prefer to leave this up to placement groups to be responsible for this and raise the error when this happens ([placement groups/autoscaler] unfulfillable requests should raise an error #18018).

I do actually think #18524 is the preferred solution for Ray Train and placement groups are just a workaround to achieve this functionality, so it doesn't seem right to me to build WorkerGroup around placement group semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to add_workers to warn any users about this.

  • Even prior to this change, the same undefined behavior would be present if add_workers is called within a parent placement group or if the cluster is full.
  • The only usecase in TensorflowBackend actually has some custom logic in between remove_workers and add_workers.

@amogkam amogkam merged commit 33af739 into ray-project:master Nov 10, 2021

# Integer value which if set will change the placement group strategy from
# PACK to SPREAD. 1 for True, 0 for False.
TRAIN_ENABLE_WORKER_SPREAD_ENV =\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I have to set this on all the machines (including HEAD and non-HEAD). Why is that? Isn't the scheduling part handled by head node (and partially client side)? @matthewdeng

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @HuangLED, what version of Ray are you using? Also are you using Ray Client?

This should be addressed in master (via this commit) so that the environment variable should only need to be set on the driver, but even without this change the environment variable should only be needed on the host where the BackendExecutor is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using v1.9.0 I think.

Yes, I am using client mode. e.g. Two machines M1 and M2, while M1 being the head. Then connecting from a third machine using ray.init("ray://M1:port").

Solely setting the env on M1 does not work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you verify that the BackendExecutor process is running on M1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Train] Support for spreading workers
3 participants