[sgd] add placement group support #17037

matthewdeng · 2021-07-13T16:56:20Z

Why are these changes needed?

Placement groups can be used for many use-cases including improved load balancing, fault tolerance, and data colocation.

This change adds a default placement group behavior for a RaySGD worker group in which a single placement group is created with a bundle for each worker, and the SPREAD strategy is used for uniform distribution.

In the future, we may expand the external API to allow the user to define their own placement group strategy for more specific use-cases.

Note: In this implementation, if the worker is already in a placement group then it will be reused. This is done for the existing Tune-SGD integration in which the Tune Trainable will create a placement group to colocate workers for the Tune trials.

Related issue number

Closes #16682

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

amogkam

looks good to me overall, just left a couple comments. Also don't forget to add your tests to the BUILD file

python/ray/util/sgd/tests/test_placement_groups.py

python/ray/util/sgd/torch/worker_group.py

amogkam

LGTM, thanks! Can you ping again once CI passes

HuangLED · 2021-12-08T04:44:19Z

@matthewdeng @amogkam

Sorry for pinging in this 5-month old PR.

Got a question, in SGD v2, it is still the default behavior that trainer instances would use SPREAD strategy, correct? (saw something strange, would like to double-check first).

matthewdeng · 2021-12-08T18:12:01Z

Hey @HuangLED, apologies for not sharing this knowledge with you earlier. For Ray Train, we've actually decided to go with the alternative of defaulting to PACK. At the moment we expose a TRAIN_ENABLE_WORKER_SPREAD environment variable which can be set to 1 to use SPREAD. See #20091 for explanation in the description, and related code if you're interested.

HuangLED · 2021-12-08T18:16:45Z

apologies for not sharing this knowledge with you earlier. For Ray Train, we've actually decided to go with the alternative of defaulting to PACK. At the moment we expose a TRAIN_ENABLE_WORKER_SPREAD environment variable which can be set to 1 to use SPREAD. See #20091 for explanation in the description, and related code if you're interested.

Thank you! When saying "Ray Train", it refers to SGDV2, correct?

matthewdeng · 2021-12-08T18:20:56Z

@HuangLED yep, we've rebranded Ray SGD v2 as Ray Train!

Some useful references:

[sgd] add placement group support

79db595

matthewdeng assigned amogkam Jul 13, 2021

matthewdeng added 3 commits July 13, 2021 16:26

add logic for removing placement group upon shutdown

22e540e

set placement group; add tests

e9347a9

Merge branch 'master' of https://github.com/ray-project/ray into sgd/pg

8e49582

amogkam reviewed Jul 16, 2021

View reviewed changes

python/ray/util/sgd/tests/test_placement_groups.py Show resolved Hide resolved

python/ray/util/sgd/torch/worker_group.py Outdated Show resolved Hide resolved

python/ray/util/sgd/torch/worker_group.py Outdated Show resolved Hide resolved

matthewdeng added 3 commits July 16, 2021 16:14

address comments - add timeout and improve error handling

6a5771e

remove unused import

3231a5a

mock SGD_PLACEMENT_GROUP_TIMEOUT_S

8c4640d

amogkam approved these changes Jul 19, 2021

View reviewed changes

amogkam merged commit fef74aa into ray-project:master Jul 20, 2021

richardliaw added this to the SGD v2 milestone Aug 3, 2021

matthewdeng deleted the sgd/pg branch August 13, 2021 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sgd] add placement group support #17037

[sgd] add placement group support #17037

matthewdeng commented Jul 13, 2021

amogkam left a comment

amogkam left a comment

HuangLED commented Dec 8, 2021

matthewdeng commented Dec 8, 2021

HuangLED commented Dec 8, 2021 •

edited

Loading

matthewdeng commented Dec 8, 2021

[sgd] add placement group support #17037

[sgd] add placement group support #17037

Conversation

matthewdeng commented Jul 13, 2021

Why are these changes needed?

Related issue number

Checks

amogkam left a comment

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

HuangLED commented Dec 8, 2021

matthewdeng commented Dec 8, 2021

HuangLED commented Dec 8, 2021 • edited Loading

matthewdeng commented Dec 8, 2021

HuangLED commented Dec 8, 2021 •

edited

Loading