-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] ResourceChangingScheduler
dynamic resource allocation during tuning
#16787
Conversation
@richardliaw @krfricke Could you take a quick look over and let me know if I am on the right track? Thanks! |
self, trial_runner: "trial_runner.TrialRunner", trial: Trial, | ||
resources: Union[Dict, Callable, PlacementGroupFactory]): | ||
print(f"setting trial {trial} resource to {resources}") | ||
trial_runner.update_trial_resources(trial, resources) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid doing this in the scheduler and instead rely on the event loop to update the resources?
Specifically, something like:
action = on_result(runner)
# pause trial and update resources
...
# when cluster has available resources, they should start trial from checkpoint.
trial = choose_trial_to_run(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardliaw changed - let me know if it's better now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite nice!
Co-authored-by: Kai Fricke <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>
@richardliaw This is almost done implementation-wise. Needs docs & tests. Updated the description. |
ResourceChangingScheduler
dynamic resource allocation during tuning
@richardliaw Docs have been added and code cleaned up. Ready for review! 🚀 |
total_available_cpus = ( | ||
trial_runner.trial_executor._avail_resources.cpu) | ||
total_available_gpus = ( | ||
trial_runner.trial_executor._avail_resources.gpu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, maybe we want to expose this a bit better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a brief skim and it looks reasonable. kai should be one to approve/merge, and we can iterate on this once it's in master!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, just a couple of minor questions and nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some quick clarification needed, but almost ready to go
Looks good, ping when tests pass? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Why are these changes needed?
This PR adds a scheduler and API needed for dynamic resource allocation in Tune.
Includes #16844
One idea for improvement is prioritisation of placement groups. This would need to be implemented on the core's side. Regardless, the current setup seems to be working well.
Known issues:
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.