Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Fix ScalingConfig(accelerator_type) to request a small fraction of the accelerator label #44225

Merged
merged 2 commits into from
Mar 22, 2024

Conversation

justinvyu
Copy link
Contributor

Why are these changes needed?

accelerator_type is currently implemented as a custom resource with a quantity of 1 if an instance has an accelerator of that type. For example, both a machine with 1 A10G GPU and a machine with 4 A10G GPUs will have {"accelerator_type:A10G": 1.0}. This label is just an indicator of whether the machine contains the accelerator, rather than a count of the number of accelerators of that type.

This PR makes our accelerator type resource request match Ray Core by setting it to a fractional value (0.001). This is needed to fix autoscaling behavior to request the correct number of GPUs.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
@@ -206,7 +206,7 @@ def _resources_per_worker_not_none(self):

if self.accelerator_type:
accelerator = f"{RESOURCE_CONSTRAINT_PREFIX}{self.accelerator_type}"
resources_per_worker.setdefault(accelerator, 1)
resources_per_worker.setdefault(accelerator, 0.001)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a constant (or use an existing one if it already exists)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if accelerator_type is not None:
resources[
f"{ray_constants.RESOURCE_CONSTRAINT_PREFIX}{accelerator_type}"
] = 0.001

Seems core team directly use 0.001 here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjyao cool if we extract this into a constant? Gives it some concrete meaning 🙂

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@justinvyu justinvyu merged commit 5923cb9 into ray-project:master Mar 22, 2024
5 checks passed
@justinvyu justinvyu deleted the fix_accelerator_type_amt2 branch March 22, 2024 17:37
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Mar 27, 2024
…tion of the accelerator label (ray-project#44225)

Make Ray Train's accelerator type resource request match Ray Core by setting it to a fractional value (0.001). This is needed to fix autoscaling behavior to request the correct number of GPUs.

Signed-off-by: Justin Yu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants