[tune] `ResourceChangingScheduler` dynamic resource allocation during tuning #16787

Yard1 · 2021-06-30T22:59:58Z

Why are these changes needed?

This PR adds a scheduler and API needed for dynamic resource allocation in Tune.

Includes #16844

One idea for improvement is prioritisation of placement groups. This would need to be implemented on the core's side. Regardless, the current setup seems to be working well.

Known issues:

It is possible for the following situation to happen:
- User defined resource allocation function allocates n resources to trial A and m resources to trial B, so that n+m > available_resources and n > m
- Placement group manager sets placement group with m resources as ready
- Scheduler chooses trial A to run
- Because trial's A placement group is not ready and setting it as ready would require more resources than available, an endless loop occurs
- This is not trivial to mitigate, or even detect. The best way for mitigation is to make sure that the user defined resource allocation function doesn't allow for this situation to happen. The example function does just that. In the future, detection and mitigation should be built in. A warning has been added to the docs.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Yard1 · 2021-07-01T16:14:24Z

@richardliaw @krfricke Could you take a quick look over and let me know if I am on the right track? Thanks!

richardliaw · 2021-07-01T20:14:17Z

python/ray/tune/schedulers/changing_scheduler.py

+            self, trial_runner: "trial_runner.TrialRunner", trial: Trial,
+            resources: Union[Dict, Callable, PlacementGroupFactory]):
+        print(f"setting trial {trial} resource to {resources}")
+        trial_runner.update_trial_resources(trial, resources)


Can we avoid doing this in the scheduler and instead rely on the event loop to update the resources?

Specifically, something like:

action = on_result(runner) # pause trial and update resources ... # when cluster has available resources, they should start trial from checkpoint. trial = choose_trial_to_run(...)

@richardliaw changed - let me know if it's better now

This looks quite nice!

Co-authored-by: Kai Fricke <[email protected]>

Yard1 · 2021-07-02T21:48:06Z

@richardliaw This is almost done implementation-wise. Needs docs & tests. Updated the description.

Yard1 · 2021-07-06T00:01:34Z

@richardliaw Docs have been added and code cleaned up. Ready for review! 🚀

richardliaw · 2021-07-06T22:03:27Z

python/ray/tune/examples/xgboost_dynamic_resources_example.py

+        total_available_cpus = (
+            trial_runner.trial_executor._avail_resources.cpu)
+        total_available_gpus = (
+            trial_runner.trial_executor._avail_resources.gpu)


hmm, maybe we want to expose this a bit better

richardliaw

I did a brief skim and it looks reasonable. kai should be one to approve/merge, and we can iterate on this once it's in master!

krfricke

Generally looks good, just a couple of minor questions and nits

python/ray/tune/examples/xgboost_dynamic_resources_example.py

python/ray/tune/schedulers/resource_changing_scheduler.py

krfricke

Just some quick clarification needed, but almost ready to go

python/ray/tune/examples/xgboost_dynamic_resources_example.py

python/ray/tune/schedulers/resource_changing_scheduler.py

krfricke · 2021-07-13T15:28:58Z

Looks good, ping when tests pass?

krfricke

Thanks!

Initial dynamic resources prototype

90718b3

Ensure trial runs at least once with new resources

aabd2f0

richardliaw reviewed Jul 1, 2021

View reviewed changes

Yard1 and others added 8 commits July 1, 2021 22:34

Use event loop

8d76f2e

Fix

0212818

Make return_pg destroy unusable groups

3557c31

Co-authored-by: Kai Fricke <[email protected]>

Account for unexpected PGs during reconciliation

654ed36

Co-authored-by: Kai Fricke <[email protected]>

Syntax

c498f24

Merge branch 'master' into placement_group_manager_fixes

ab27021

Merge branch 'placement_group_manager_fixes' into change_resources

1fd911e

Improve logic

e555fdc

Yard1 added 4 commits July 5, 2021 22:57

Cleanup

fd1157b

Rename example

21cef6e

Tests, example docs

fe69ac4

Docs

4cb48a5

Yard1 changed the title ~~[WIP][tune] Initial dynamic resources prototype~~ [tune] ResourceChangingScheduler dynamic resource allocation during tuning Jul 5, 2021

Yard1 marked this pull request as ready for review July 6, 2021 00:00

Yard1 requested review from richardliaw and krfricke July 6, 2021 00:00

krfricke self-assigned this Jul 6, 2021

Yard1 added 5 commits July 6, 2021 16:25

Merge branch 'master' into placement_group_manager_fixes

aa26554

Add a special case for PBT in stop_trial

0f18785

Merge branch 'placement_group_manager_fixes' into change_resources

f24be06

Merge branch 'master' into placement_group_manager_fixes

cf004c6

Merge branch 'placement_group_manager_fixes' into change_resources

5000ec6

richardliaw reviewed Jul 6, 2021

View reviewed changes

Yard1 added 2 commits July 7, 2021 19:12

Merge branch 'master' into change_resources

a8af356

Raise exception if reuse_actors=True

a5e14f2

krfricke reviewed Jul 9, 2021

View reviewed changes

Nits, make evenly_distribute_cpus_gpus the default

98a81ee

Yard1 requested a review from krfricke July 9, 2021 12:37

richardliaw mentioned this pull request Jul 12, 2021

[tune] How to utilize the trial.update_resources() function #16984

Closed

krfricke reviewed Jul 12, 2021

View reviewed changes

python/ray/tune/examples/xgboost_dynamic_resources_example.py Show resolved Hide resolved

python/ray/tune/schedulers/resource_changing_scheduler.py Outdated Show resolved Hide resolved

Yard1 added 3 commits July 12, 2021 15:59

Default base_trial_resource

4cef701

Raise an error if base resources vary

07c8141

Keep track of reallocated trials

d0927dc

Yard1 added 2 commits July 13, 2021 15:36

Add comments

6482bb3

Set xgboost_dynamic_resources_example as medium

c042700

krfricke approved these changes Jul 14, 2021

View reviewed changes

krfricke merged commit 6e780eb into ray-project:master Jul 14, 2021

Yard1 mentioned this pull request Jul 14, 2021

[tune] Support function trainables for ResourceChangingScheduler #17071

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] `ResourceChangingScheduler` dynamic resource allocation during tuning #16787

[tune] `ResourceChangingScheduler` dynamic resource allocation during tuning #16787

Yard1 commented Jun 30, 2021 •

edited

Loading

Yard1 commented Jul 1, 2021

richardliaw Jul 1, 2021 •

edited

Loading

Yard1 Jul 1, 2021

richardliaw Jul 2, 2021

Yard1 commented Jul 2, 2021

Yard1 commented Jul 6, 2021

richardliaw Jul 6, 2021

richardliaw left a comment

krfricke left a comment

krfricke left a comment

krfricke commented Jul 13, 2021

krfricke left a comment

[tune] ResourceChangingScheduler dynamic resource allocation during tuning #16787

[tune] ResourceChangingScheduler dynamic resource allocation during tuning #16787

Conversation

Yard1 commented Jun 30, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

Yard1 commented Jul 1, 2021

richardliaw Jul 1, 2021 • edited Loading

Choose a reason for hiding this comment

Yard1 Jul 1, 2021

Choose a reason for hiding this comment

richardliaw Jul 2, 2021

Choose a reason for hiding this comment

Yard1 commented Jul 2, 2021

Yard1 commented Jul 6, 2021

richardliaw Jul 6, 2021

Choose a reason for hiding this comment

richardliaw left a comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

krfricke commented Jul 13, 2021

krfricke left a comment

Choose a reason for hiding this comment

[tune] `ResourceChangingScheduler` dynamic resource allocation during tuning #16787

[tune] `ResourceChangingScheduler` dynamic resource allocation during tuning #16787

Yard1 commented Jun 30, 2021 •

edited

Loading

richardliaw Jul 1, 2021 •

edited

Loading