Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] ResourceChangingScheduler dynamic resource allocation during tuning #16787

Merged
merged 27 commits into from
Jul 14, 2021

Conversation

Yard1
Copy link
Member

@Yard1 Yard1 commented Jun 30, 2021

Why are these changes needed?

This PR adds a scheduler and API needed for dynamic resource allocation in Tune.

Includes #16844

One idea for improvement is prioritisation of placement groups. This would need to be implemented on the core's side. Regardless, the current setup seems to be working well.

Known issues:

  • It is possible for the following situation to happen:
    • User defined resource allocation function allocates n resources to trial A and m resources to trial B, so that n+m > available_resources and n > m
    • Placement group manager sets placement group with m resources as ready
    • Scheduler chooses trial A to run
    • Because trial's A placement group is not ready and setting it as ready would require more resources than available, an endless loop occurs
    • This is not trivial to mitigate, or even detect. The best way for mitigation is to make sure that the user defined resource allocation function doesn't allow for this situation to happen. The example function does just that. In the future, detection and mitigation should be built in. A warning has been added to the docs.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Yard1
Copy link
Member Author

Yard1 commented Jul 1, 2021

@richardliaw @krfricke Could you take a quick look over and let me know if I am on the right track? Thanks!

self, trial_runner: "trial_runner.TrialRunner", trial: Trial,
resources: Union[Dict, Callable, PlacementGroupFactory]):
print(f"setting trial {trial} resource to {resources}")
trial_runner.update_trial_resources(trial, resources)
Copy link
Contributor

@richardliaw richardliaw Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid doing this in the scheduler and instead rely on the event loop to update the resources?

Specifically, something like:

action = on_result(runner)
# pause trial and update resources
... 
# when cluster has available resources, they should start trial from checkpoint. 
trial = choose_trial_to_run(...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardliaw changed - let me know if it's better now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite nice!

@Yard1
Copy link
Member Author

Yard1 commented Jul 2, 2021

@richardliaw This is almost done implementation-wise. Needs docs & tests. Updated the description.

@Yard1 Yard1 changed the title [WIP][tune] Initial dynamic resources prototype [tune] ResourceChangingScheduler dynamic resource allocation during tuning Jul 5, 2021
@Yard1 Yard1 marked this pull request as ready for review July 6, 2021 00:00
@Yard1
Copy link
Member Author

Yard1 commented Jul 6, 2021

@richardliaw Docs have been added and code cleaned up. Ready for review! 🚀

@krfricke krfricke self-assigned this Jul 6, 2021
Comment on lines 139 to 142
total_available_cpus = (
trial_runner.trial_executor._avail_resources.cpu)
total_available_gpus = (
trial_runner.trial_executor._avail_resources.gpu)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe we want to expose this a bit better

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a brief skim and it looks reasonable. kai should be one to approve/merge, and we can iterate on this once it's in master!

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, just a couple of minor questions and nits

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some quick clarification needed, but almost ready to go

@krfricke
Copy link
Contributor

Looks good, ping when tests pass?

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants