-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air/tune] Internal resource management 1 - Ray AIR resource manager implementation #30777
[air/tune] Internal resource management 1 - Ray AIR resource manager implementation #30777
Conversation
commit 7175d9e68136db1cab1c3e1565fae0a02b9eff10 Author: Kai Fricke <[email protected]> Date: Mon Oct 3 10:13:19 2022 -0700 Testing Signed-off-by: Kai Fricke <[email protected]> commit 2dc3bdee204c7d950b0f6c2603cb03f418e207eb Author: Kai Fricke <[email protected]> Date: Mon Oct 3 08:35:51 2022 -0700 Revert tune.py changes Signed-off-by: Kai Fricke <[email protected]> commit 4498111 Merge: eb6ede0 7167f9c Author: Kai Fricke <[email protected]> Date: Mon Oct 3 08:34:14 2022 -0700 Merge branch 'master' of https://github.com/ray-project/ray into air/execution/prototype commit eb6ede0 Author: Kai Fricke <[email protected]> Date: Mon Oct 3 08:33:32 2022 -0700 Move to experimental subpackage Signed-off-by: Kai Fricke <[email protected]> commit 8c81be4 Author: Kai Fricke <[email protected]> Date: Sun Oct 2 17:45:04 2022 +0100 Recover trial Signed-off-by: Kai Fricke <[email protected]> commit 5289a87 Author: Kai Fricke <[email protected]> Date: Sun Oct 2 17:07:07 2022 +0100 Fix tests, add step tests Signed-off-by: Kai Fricke <[email protected]> commit 50abcb2 Author: Kai Fricke <[email protected]> Date: Sun Oct 2 16:39:31 2022 +0100 tune.run Signed-off-by: Kai Fricke <[email protected]> commit acdea1a Author: Kai Fricke <[email protected]> Date: Sun Oct 2 16:08:27 2022 +0100 Legacy interface cont Signed-off-by: Kai Fricke <[email protected]> commit b2080df Author: Kai Fricke <[email protected]> Date: Fri Sep 30 18:00:08 2022 +0100 Update resources, legacy wrapper Signed-off-by: Kai Fricke <[email protected]> commit 2768b58 Author: Kai Fricke <[email protected]> Date: Fri Sep 30 16:44:26 2022 +0100 test numerical error Signed-off-by: Kai Fricke <[email protected]> commit 1e177c8 Author: Kai Fricke <[email protected]> Date: Thu Sep 29 18:50:12 2022 +0100 Fix HEBO test Signed-off-by: Kai Fricke <[email protected]> commit 5f6e75e Author: Kai Fricke <[email protected]> Date: Thu Sep 29 15:23:41 2022 +0100 cont Signed-off-by: Kai Fricke <[email protected]> commit 7715595 Author: Kai Fricke <[email protected]> Date: Thu Sep 29 14:53:26 2022 +0100 Max pending trials Signed-off-by: Kai Fricke <[email protected]> commit 5638109 Author: Kai Fricke <[email protected]> Date: Thu Sep 29 14:08:13 2022 +0100 Overstep Signed-off-by: Kai Fricke <[email protected]> commit 9ee6455 Author: Kai Fricke <[email protected]> Date: Thu Sep 29 14:07:01 2022 +0100 Tune controller result handling Signed-off-by: Kai Fricke <[email protected]> commit fcb8557 Author: Kai Fricke <[email protected]> Date: Thu Sep 29 13:03:15 2022 +0100 Tune controller resources management tests Signed-off-by: Kai Fricke <[email protected]> commit ae757fe Author: Kai Fricke <[email protected]> Date: Thu Sep 29 12:12:48 2022 +0100 Add docs to unit tests Signed-off-by: Kai Fricke <[email protected]> commit 10d1760 Author: Kai Fricke <[email protected]> Date: Thu Sep 29 11:46:05 2022 +0100 Delete execution_pull Signed-off-by: Kai Fricke <[email protected]> commit 99f7ddd Merge: ccf8b02 f5143a9 Author: Kai Fricke <[email protected]> Date: Thu Sep 29 11:45:49 2022 +0100 Merge branch 'master' into air/execution/prototype commit ccf8b02 Author: Kai Fricke <[email protected]> Date: Tue Sep 27 16:07:31 2022 +0100 PG manager tests Signed-off-by: Kai Fricke <[email protected]> commit 2e74eaa Author: Kai Fricke <[email protected]> Date: Tue Sep 27 15:32:09 2022 +0100 PG manager Signed-off-by: Kai Fricke <[email protected]> commit c4eee47 Merge: 01020dd d456aa2 Author: Kai Fricke <[email protected]> Date: Tue Sep 27 11:43:59 2022 +0100 Merge remote-tracking branch 'upstream/master' into air/execution/prototype commit 01020dd Author: Kai Fricke <[email protected]> Date: Tue Sep 27 08:55:50 2022 +0100 PG WIP Signed-off-by: Kai Fricke <[email protected]> commit b1f3cd2 Author: Kai Fricke <[email protected]> Date: Mon Sep 26 15:07:45 2022 +0100 Add fixed resource manager test Signed-off-by: Kai Fricke <[email protected]> commit 2a5d089 Merge: ddbb7ef 9f1ea30 Author: Kai Fricke <[email protected]> Date: Mon Sep 26 14:55:38 2022 +0100 Merge remote-tracking branch 'upstream/master' into air/execution/prototype commit ddbb7ef Merge: b685a44 6530635 Author: Kai Fricke <[email protected]> Date: Thu Sep 22 17:07:02 2022 +0100 Merge branch 'master' into air/execution/prototype commit b685a44 Author: Kai Fricke <[email protected]> Date: Tue Sep 20 13:04:32 2022 +0100 K fold CV Signed-off-by: Kai Fricke <[email protected]> commit 7fbbeed Author: Kai Fricke <[email protected]> Date: Tue Sep 20 10:55:49 2022 +0100 Resources Signed-off-by: Kai Fricke <[email protected]> commit 0173489 Merge: 66fe655 15883cd Author: Kai Fricke <[email protected]> Date: Tue Sep 20 10:46:25 2022 +0100 Merge remote-tracking branch 'upstream/master' into air/execution/prototype commit 66fe655 Merge: 8359ebd 805d5dc Author: Kai Fricke <[email protected]> Date: Tue Sep 20 09:54:19 2022 +0100 Merge remote-tracking branch 'upstream/master' into air/execution/prototype commit 8359ebd Merge: 1e9b3f9 7280ef4 Author: Kai Fricke <[email protected]> Date: Tue Sep 20 09:54:13 2022 +0100 Merge branch 'master' into air/execution/prototype commit 1e9b3f9 Author: Kai Fricke <[email protected]> Date: Fri Sep 9 19:11:50 2022 +0100 num samples, benchmark Signed-off-by: Kai Fricke <[email protected]> commit 11331eb Author: Kai Fricke <[email protected]> Date: Fri Sep 9 16:49:29 2022 +0100 uuid Signed-off-by: Kai Fricke <[email protected]> commit bdbff3b Author: Kai Fricke <[email protected]> Date: Fri Sep 9 16:38:06 2022 +0100 Simple split controller Signed-off-by: Kai Fricke <[email protected]> commit a0b7815 Author: Kai Fricke <[email protected]> Date: Tue Sep 6 14:36:22 2022 +0100 Train example working with sync futures Signed-off-by: Kai Fricke <[email protected]> commit aedd0dc Author: Kai Fricke <[email protected]> Date: Tue Sep 6 12:53:59 2022 +0100 Make pausing work Signed-off-by: Kai Fricke <[email protected]> commit 8155508 Author: Kai Fricke <[email protected]> Date: Tue Sep 6 12:47:18 2022 +0100 Fix execution for tune Signed-off-by: Kai Fricke <[email protected]> commit 860ffe5 Author: Kai Fricke <[email protected]> Date: Tue Sep 6 12:24:48 2022 +0100 Push-based package Signed-off-by: Kai Fricke <[email protected]> commit dac1637 Author: Kai Fricke <[email protected]> Date: Tue Sep 6 11:11:08 2022 +0100 Separate pull-based package Signed-off-by: Kai Fricke <[email protected]> commit 15ddea1 Author: Kai Fricke <[email protected]> Date: Fri Sep 2 12:30:37 2022 +0100 tune checkpointing works Signed-off-by: Kai Fricke <[email protected]> commit 1358832 Merge: b7c46d8 8692eb6 Author: Kai Fricke <[email protected]> Date: Fri Sep 2 11:37:56 2022 +0100 Merge branch 'master' into air/execution/prototype commit b7c46d8 Merge: 859cb49 ed29291 Author: Kai Fricke <[email protected]> Date: Wed Aug 31 14:36:18 2022 -0700 Merge remote-tracking branch 'upstream/master' into air/execution/prototype commit 859cb49 Author: Kai Fricke <[email protected]> Date: Wed Aug 31 14:19:21 2022 -0700 train sequentially working Signed-off-by: Kai Fricke <[email protected]> commit 7498377 Author: Kai Fricke <[email protected]> Date: Wed Aug 31 11:37:37 2022 -0700 train wip Signed-off-by: Kai Fricke <[email protected]> commit f95b103 Author: Kai Fricke <[email protected]> Date: Wed Aug 31 10:50:13 2022 -0700 Update API Signed-off-by: Kai Fricke <[email protected]> commit 0effb2a Author: Kai Fricke <[email protected]> Date: Wed Aug 31 08:43:58 2022 -0700 typed futures Signed-off-by: Kai Fricke <[email protected]> commit 68e90dc Author: Kai Fricke <[email protected]> Date: Wed Aug 31 07:53:06 2022 -0700 tune update Signed-off-by: Kai Fricke <[email protected]> commit 515228e Author: Kai Fricke <[email protected]> Date: Wed Aug 31 06:37:39 2022 -0700 tune update Signed-off-by: Kai Fricke <[email protected]> commit ffef078 Author: Kai Fricke <[email protected]> Date: Tue Aug 30 17:24:32 2022 -0700 tune.run with progress report Signed-off-by: Kai Fricke <[email protected]> commit 5d7ace0 Author: Kai Fricke <[email protected]> Date: Tue Aug 30 17:17:34 2022 -0700 tune example running Signed-off-by: Kai Fricke <[email protected]> commit 3f44ac3 Author: Kai Fricke <[email protected]> Date: Tue Aug 30 15:48:38 2022 -0700 resource manager, tune example Signed-off-by: Kai Fricke <[email protected]> commit 5160de5 Author: Kai Fricke <[email protected]> Date: Tue Aug 30 14:20:30 2022 -0700 tune controller + result Signed-off-by: Kai Fricke <[email protected]> commit 3bd2184 Author: Kai Fricke <[email protected]> Date: Tue Aug 30 12:51:12 2022 -0700 initial structure Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Kai Fricke <[email protected]>
# Conflicts: # python/ray/tune/tune.py
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
# Conflicts: # python/ray/tune/tests/test_sample.py
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to explain to me all the additional context around this, I feel like I learned a lot 😄
In this review, I focused primarily on making sure the APIs/documentation for the existing Placement Group flow are really clear and representative of the resource management lifecycle.
Consequently, I still think the FixedResourceManager
flow doesn't adhere to the APIs very well, but I'm okay with not overindexing on this in this PR since it is not being exposed yet.
Signed-off-by: Kai Fricke <[email protected]>
Thanks for the review @matthewdeng. I've updated the docstrings throughout. ptal |
# first PG bundle with num_cpus=0. The second object will also be scheduled in | ||
# the first bundle, but will occupy the respective resources. Thus, we start | ||
# counting from i = -1, and bundle = max(i, 0) which will be [0, 0, 1, 2, ...] | ||
if self.resource_request.head_bundle_is_empty: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can I give a suggestion, things like head_bundle_is_empty
should not live on request class.
ResourceRequest class will be purer if it doesn't care about how applications plan to use the resources.
instead, we can make it a parameter to this annotate_remote_objects()
function:
def annotate_remote_objects(
self, objects: List[RemoteRayObject], empty_head_bundle=False,
) -> List[Union[RemoteRayObject]]:
annotated = []
def _annotate_obj(obj, bundle, bundle_idx):
num_cpus = bundle.pop("CPU", 0)
num_gpus = bundle.pop("GPU", 0)
memory = bundle.pop("memory", 0.0)
annotated.append(
obj.options(
scheduling_strategy=PlacementGroupSchedulingStrategy(
placement_group=self.placement_group,
# Max ensures that empty head bundles are correctly placed
placement_group_bundle_index=bundle_idx,
placement_group_capture_child_tasks=True,
),
num_cpus=num_cpus,
num_gpus=num_gpus,
memory=memory,
resources=bundle,
)
)
first_obj = 0
if empty_head_bundle:
_annotate_obj(objects[0], {}, 0)
first_obj = 1
for i, (obj, bundle) in enumerate(
zip(objects[first_obj:], self.resource_request.bundles)
):
_annotate_obj(obj, bundle, i)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, _head_bundle_is_empty
is automatically derived from the bundles in the __init__
of the resource request, i.e. it's no something the user passes. When a user passes bundles=[{}, {"CPU": 1}]
the property is set to True. So this is just an internal representation, not an API.
I would not want to introduce this to the annotation API as then the caller will have to analyze the bundles to decide what to pass to the API - this logic should remain in the acquired resource class, which has access to all this information.
I'd be open to leaving the empty bundle in the self._bundles, but this would just be a refactor of the internal representation without any major effect. So for now I think it's best to keep it. (Note that this is also just copied over from PlacementGroupFactory, there is no new logic here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, sorry, I looked at ResourceRequest more closely and understand what you are trying to do now.
Is there a background story about why we give head node, and head node only, such special treatment?
I would imagine an otherwise more generic framework for remote actors that don't require resources, regardless of where they are in the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a background story!
This is mostly about distributed training. Let's say we have a Ray Train trainer that starts 2 distributed workers a 4 CPUs, so 8 CPUs in total. Ever since we've been using Ray Tune as the execution backend, this requires us to start a remote "head" worker (comparable to the RLlib learner) that starts these 2 distributed workers. But because the head worker is also a remote actor, it requires resources.
So now we're in a situation where we actually would need 9 CPUs (1 for the trainable actor, 2*4 for the distributed workers). That's undesirable, as it means we need bigger clusters (especially when tuning). Also (unlike the rllib trainer), the trainable worker is not doing much at all, it just does a bit of coordination between the distributed workers.
So the solution is to pass num_cpus=0
to this actor.
So now we need to somehow specify resources (a PlacementGroupFactory) that schedules the first actor with 0 CPUs. So what we want is [{"CPU": 0}, {"CPU": 1}, {"CPU": 1}]
.
At first, we wanted to make this work without any special casing. Placement groups however don't accept empty bundles. The first workaround was to pass [{"CPU": 0.01}, {"CPU": 0.99}, {"CPU": 1}]
or so which is obviously not great from a user perspective. The second workaround is what we're currently doing: We don't schedule the first "bundle" but just schedule the first actor onto the first bundle with empty resources.
I agree this is not ideal - e.g. what happens if we pass another empty bundle in one of the workers? At the moment this will just raise an error.
The current solution works well from a user perspective, though. So I'd say let's keep it for now until we feel the need to come up with something better.
Signed-off-by: Kai Fricke <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the many iterations on this! A few more naming/documentation clarifications but otherwise looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your patience, 1 minor suggestion, your call :)
# first PG bundle with num_cpus=0. The second object will also be scheduled in | ||
# the first bundle, but will occupy the respective resources. Thus, we start | ||
# counting from i = -1, and bundle = max(i, 0) which will be [0, 0, 1, 2, ...] | ||
if self.resource_request.head_bundle_is_empty: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, sorry, I looked at ResourceRequest more closely and understand what you are trying to do now.
Is there a background story about why we give head node, and head node only, such special treatment?
I would imagine an otherwise more generic framework for remote actors that don't require resources, regardless of where they are in the list.
Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Kai Fricke <[email protected]>
Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
…IR resource manager (#30016) Includes/depends on #30777 TLDR: This PR refactors Ray Tune's resource management to use a central AIR resource management package instead of the tightly coupled PlacementGroupManager. Ray Tune's resource management currently uses a tightly coupled placement group manager. This leads to a number of shortcomings: - The tight coupling on the manager side (e.g. PG manager keeps track of trials) prevents re-usability - The tight coupling on the trial executor side prevents using different resource management strategies (e.g. shared or budget-based) - It's hard to test independently. Tests for the resource management require a simulated tune setup. To improve stability, extensibility, and maintainability, this PR moves the resource management logic into a central `ray.air.execution.resources` subpackage. The resource management has a simple API that works with `ResourceRequest`s and `AllocatedResources` to manage requested and assigned resources, respectively. The actual resource management can then be anything - per default it is a placement group based manager, but this PR also introduces a PoC budget-based manager that can be plugged in. The PR does not substantially change existing tests, so we can be certain that the new resource model is a fully compatible replacement for the old placement group manager. Signed-off-by: Kai Fricke <[email protected]>
…implementation (#30777) Prerequisite to #30016 This PR adds a new Ray AIR resource manager to replace the PlacementGroupManager of Ray Tune. Details can be found in #30016. Specifically, this PR - Adds the main resource manager abstractions - Renames (and moves) PlacementGroupFactory to ResourceRequest - Adds implementations and tests for a placement group based manager and a budget based manager Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Kai Fricke <[email protected]> Co-authored-by: matthewdeng <[email protected]>
…IR resource manager (#30016) Includes/depends on #30777 TLDR: This PR refactors Ray Tune's resource management to use a central AIR resource management package instead of the tightly coupled PlacementGroupManager. Ray Tune's resource management currently uses a tightly coupled placement group manager. This leads to a number of shortcomings: - The tight coupling on the manager side (e.g. PG manager keeps track of trials) prevents re-usability - The tight coupling on the trial executor side prevents using different resource management strategies (e.g. shared or budget-based) - It's hard to test independently. Tests for the resource management require a simulated tune setup. To improve stability, extensibility, and maintainability, this PR moves the resource management logic into a central `ray.air.execution.resources` subpackage. The resource management has a simple API that works with `ResourceRequest`s and `AllocatedResources` to manage requested and assigned resources, respectively. The actual resource management can then be anything - per default it is a placement group based manager, but this PR also introduces a PoC budget-based manager that can be plugged in. The PR does not substantially change existing tests, so we can be certain that the new resource model is a fully compatible replacement for the old placement group manager. Signed-off-by: Kai Fricke <[email protected]>
…implementation (ray-project#30777) Prerequisite to ray-project#30016 This PR adds a new Ray AIR resource manager to replace the PlacementGroupManager of Ray Tune. Details can be found in ray-project#30016. Specifically, this PR - Adds the main resource manager abstractions - Renames (and moves) PlacementGroupFactory to ResourceRequest - Adds implementations and tests for a placement group based manager and a budget based manager Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Kai Fricke <[email protected]> Co-authored-by: matthewdeng <[email protected]> Signed-off-by: tmynn <[email protected]>
…IR resource manager (ray-project#30016) Includes/depends on ray-project#30777 TLDR: This PR refactors Ray Tune's resource management to use a central AIR resource management package instead of the tightly coupled PlacementGroupManager. Ray Tune's resource management currently uses a tightly coupled placement group manager. This leads to a number of shortcomings: - The tight coupling on the manager side (e.g. PG manager keeps track of trials) prevents re-usability - The tight coupling on the trial executor side prevents using different resource management strategies (e.g. shared or budget-based) - It's hard to test independently. Tests for the resource management require a simulated tune setup. To improve stability, extensibility, and maintainability, this PR moves the resource management logic into a central `ray.air.execution.resources` subpackage. The resource management has a simple API that works with `ResourceRequest`s and `AllocatedResources` to manage requested and assigned resources, respectively. The actual resource management can then be anything - per default it is a placement group based manager, but this PR also introduces a PoC budget-based manager that can be plugged in. The PR does not substantially change existing tests, so we can be certain that the new resource model is a fully compatible replacement for the old placement group manager. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: tmynn <[email protected]>
Why are these changes needed?
Prerequisite to #30016
This PR adds a new Ray AIR resource manager to replace the PlacementGroupManager of Ray Tune. Details can be found in #30016.
Specifically, this PR
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.