-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sgd] add placement group support #17037
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me overall, just left a couple comments. Also don't forget to add your tests to the BUILD file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! Can you ping again once CI passes
Sorry for pinging in this 5-month old PR. Got a question, in SGD v2, it is still the default behavior that trainer instances would use SPREAD strategy, correct? (saw something strange, would like to double-check first). |
Hey @HuangLED, apologies for not sharing this knowledge with you earlier. For Ray Train, we've actually decided to go with the alternative of defaulting to |
Thank you! When saying "Ray Train", it refers to SGDV2, correct? |
@HuangLED yep, we've rebranded Ray SGD v2 as Ray Train! Some useful references: |
Why are these changes needed?
Placement groups can be used for many use-cases including improved load balancing, fault tolerance, and data colocation.
This change adds a default placement group behavior for a RaySGD worker group in which a single placement group is created with a bundle for each worker, and the SPREAD strategy is used for uniform distribution.
In the future, we may expand the external API to allow the user to define their own placement group strategy for more specific use-cases.
Note: In this implementation, if the worker is already in a placement group then it will be reused. This is done for the existing Tune-SGD integration in which the Tune Trainable will create a placement group to colocate workers for the Tune trials.
Related issue number
Closes #16682
Checks
scripts/format.sh
to lint the changes in this PR.