Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sgd] add placement group support #17037

Merged
merged 7 commits into from
Jul 20, 2021
Merged

Conversation

matthewdeng
Copy link
Contributor

Why are these changes needed?

Placement groups can be used for many use-cases including improved load balancing, fault tolerance, and data colocation.

This change adds a default placement group behavior for a RaySGD worker group in which a single placement group is created with a bundle for each worker, and the SPREAD strategy is used for uniform distribution.

In the future, we may expand the external API to allow the user to define their own placement group strategy for more specific use-cases.

Note: In this implementation, if the worker is already in a placement group then it will be reused. This is done for the existing Tune-SGD integration in which the Tune Trainable will create a placement group to colocate workers for the Tune trials.

Related issue number

Closes #16682

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me overall, just left a couple comments. Also don't forget to add your tests to the BUILD file

python/ray/util/sgd/torch/worker_group.py Outdated Show resolved Hide resolved
python/ray/util/sgd/torch/worker_group.py Outdated Show resolved Hide resolved
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Can you ping again once CI passes

@amogkam amogkam merged commit fef74aa into ray-project:master Jul 20, 2021
@richardliaw richardliaw added this to the SGD v2 milestone Aug 3, 2021
@matthewdeng matthewdeng deleted the sgd/pg branch August 13, 2021 17:52
@HuangLED
Copy link
Contributor

HuangLED commented Dec 8, 2021

@matthewdeng @amogkam

Sorry for pinging in this 5-month old PR.

Got a question, in SGD v2, it is still the default behavior that trainer instances would use SPREAD strategy, correct? (saw something strange, would like to double-check first).

@matthewdeng
Copy link
Contributor Author

Hey @HuangLED, apologies for not sharing this knowledge with you earlier. For Ray Train, we've actually decided to go with the alternative of defaulting to PACK. At the moment we expose a TRAIN_ENABLE_WORKER_SPREAD environment variable which can be set to 1 to use SPREAD. See #20091 for explanation in the description, and related code if you're interested.

@HuangLED
Copy link
Contributor

HuangLED commented Dec 8, 2021

apologies for not sharing this knowledge with you earlier. For Ray Train, we've actually decided to go with the alternative of defaulting to PACK. At the moment we expose a TRAIN_ENABLE_WORKER_SPREAD environment variable which can be set to 1 to use SPREAD. See #20091 for explanation in the description, and related code if you're interested.

Thank you! When saying "Ray Train", it refers to SGDV2, correct?

@matthewdeng
Copy link
Contributor Author

@HuangLED yep, we've rebranded Ray SGD v2 as Ray Train!

Some useful references:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[raysgd] Placement group support in RaySGD
4 participants