[train] Fix `ScalingConfig(accelerator_type)` to request a small fraction of the accelerator label #44225

justinvyu · 2024-03-21T20:19:14Z

Why are these changes needed?

accelerator_type is currently implemented as a custom resource with a quantity of 1 if an instance has an accelerator of that type. For example, both a machine with 1 A10G GPU and a machine with 4 A10G GPUs will have {"accelerator_type:A10G": 1.0}. This label is just an indicator of whether the machine contains the accelerator, rather than a count of the number of accelerators of that type.

This PR makes our accelerator type resource request match Ray Core by setting it to a fractional value (0.001). This is needed to fix autoscaling behavior to request the correct number of GPUs.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <[email protected]>

matthewdeng · 2024-03-21T20:20:39Z

python/ray/air/config.py

@@ -206,7 +206,7 @@ def _resources_per_worker_not_none(self):

        if self.accelerator_type:
            accelerator = f"{RESOURCE_CONSTRAINT_PREFIX}{self.accelerator_type}"
-            resources_per_worker.setdefault(accelerator, 1)
+            resources_per_worker.setdefault(accelerator, 0.001)


Can we make this a constant (or use an existing one if it already exists)?

ray/python/ray/_private/utils.py

Lines 389 to 392 in 2747c80

if accelerator_type is not None:

resources[

f"{ray_constants.RESOURCE_CONSTRAINT_PREFIX}{accelerator_type}"

] = 0.001

Seems core team directly use 0.001 here.

@jjyao cool if we extract this into a constant? Gives it some concrete meaning 🙂

woshiyyya

Thanks for the fix!

…tion of the accelerator label (ray-project#44225) Make Ray Train's accelerator type resource request match Ray Core by setting it to a fractional value (0.001). This is needed to fix autoscaling behavior to request the correct number of GPUs. Signed-off-by: Justin Yu <[email protected]>

justinvyu added 2 commits March 21, 2024 13:13

change to 0.001

3e96f90

Signed-off-by: Justin Yu <[email protected]>

fix tests

59907ea

Signed-off-by: Justin Yu <[email protected]>

justinvyu assigned matthewdeng and woshiyyya Mar 21, 2024

justinvyu requested review from matthewdeng and woshiyyya as code owners March 21, 2024 20:19

matthewdeng reviewed Mar 21, 2024

View reviewed changes

woshiyyya approved these changes Mar 21, 2024

View reviewed changes

justinvyu merged commit 5923cb9 into ray-project:master Mar 22, 2024
5 checks passed

justinvyu deleted the fix_accelerator_type_amt2 branch March 22, 2024 17:37

justinvyu mentioned this pull request Mar 22, 2024

[OA][template] Finetune stable-diffusion (dreambooth) anyscale/templates#148

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Fix `ScalingConfig(accelerator_type)` to request a small fraction of the accelerator label #44225

[train] Fix `ScalingConfig(accelerator_type)` to request a small fraction of the accelerator label #44225

justinvyu commented Mar 21, 2024

matthewdeng Mar 21, 2024

woshiyyya Mar 21, 2024

matthewdeng Mar 21, 2024

woshiyyya left a comment

	if accelerator_type is not None:
	resources[
	f"{ray_constants.RESOURCE_CONSTRAINT_PREFIX}{accelerator_type}"
	] = 0.001

[train] Fix ScalingConfig(accelerator_type) to request a small fraction of the accelerator label #44225

[train] Fix ScalingConfig(accelerator_type) to request a small fraction of the accelerator label #44225

Conversation

justinvyu commented Mar 21, 2024

Why are these changes needed?

Related issue number

Checks

matthewdeng Mar 21, 2024

Choose a reason for hiding this comment

woshiyyya Mar 21, 2024

Choose a reason for hiding this comment

matthewdeng Mar 21, 2024

Choose a reason for hiding this comment

woshiyyya left a comment

Choose a reason for hiding this comment

[train] Fix `ScalingConfig(accelerator_type)` to request a small fraction of the accelerator label #44225

[train] Fix `ScalingConfig(accelerator_type)` to request a small fraction of the accelerator label #44225