Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air/tune] Internal resource management 1 - Ray AIR resource manager implementation #30777

Merged
merged 66 commits into from
Dec 20, 2022

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Nov 30, 2022

Why are these changes needed?

Prerequisite to #30016

This PR adds a new Ray AIR resource manager to replace the PlacementGroupManager of Ray Tune. Details can be found in #30016.

Specifically, this PR

  • Adds the main resource manager abstractions
  • Renames (and moves) PlacementGroupFactory to ResourceRequest
  • Adds implementations and tests for a placement group based manager and a budget based manager

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 30 commits October 3, 2022 10:23
commit 7175d9e68136db1cab1c3e1565fae0a02b9eff10
Author: Kai Fricke <[email protected]>
Date:   Mon Oct 3 10:13:19 2022 -0700

    Testing

    Signed-off-by: Kai Fricke <[email protected]>

commit 2dc3bdee204c7d950b0f6c2603cb03f418e207eb
Author: Kai Fricke <[email protected]>
Date:   Mon Oct 3 08:35:51 2022 -0700

    Revert tune.py changes

    Signed-off-by: Kai Fricke <[email protected]>

commit 4498111
Merge: eb6ede0 7167f9c
Author: Kai Fricke <[email protected]>
Date:   Mon Oct 3 08:34:14 2022 -0700

    Merge branch 'master' of https://github.com/ray-project/ray into air/execution/prototype

commit eb6ede0
Author: Kai Fricke <[email protected]>
Date:   Mon Oct 3 08:33:32 2022 -0700

    Move to experimental subpackage

    Signed-off-by: Kai Fricke <[email protected]>

commit 8c81be4
Author: Kai Fricke <[email protected]>
Date:   Sun Oct 2 17:45:04 2022 +0100

    Recover trial

    Signed-off-by: Kai Fricke <[email protected]>

commit 5289a87
Author: Kai Fricke <[email protected]>
Date:   Sun Oct 2 17:07:07 2022 +0100

    Fix tests, add step tests

    Signed-off-by: Kai Fricke <[email protected]>

commit 50abcb2
Author: Kai Fricke <[email protected]>
Date:   Sun Oct 2 16:39:31 2022 +0100

    tune.run

    Signed-off-by: Kai Fricke <[email protected]>

commit acdea1a
Author: Kai Fricke <[email protected]>
Date:   Sun Oct 2 16:08:27 2022 +0100

    Legacy interface cont

    Signed-off-by: Kai Fricke <[email protected]>

commit b2080df
Author: Kai Fricke <[email protected]>
Date:   Fri Sep 30 18:00:08 2022 +0100

    Update resources, legacy wrapper

    Signed-off-by: Kai Fricke <[email protected]>

commit 2768b58
Author: Kai Fricke <[email protected]>
Date:   Fri Sep 30 16:44:26 2022 +0100

    test numerical error

    Signed-off-by: Kai Fricke <[email protected]>

commit 1e177c8
Author: Kai Fricke <[email protected]>
Date:   Thu Sep 29 18:50:12 2022 +0100

    Fix HEBO test

    Signed-off-by: Kai Fricke <[email protected]>

commit 5f6e75e
Author: Kai Fricke <[email protected]>
Date:   Thu Sep 29 15:23:41 2022 +0100

    cont

    Signed-off-by: Kai Fricke <[email protected]>

commit 7715595
Author: Kai Fricke <[email protected]>
Date:   Thu Sep 29 14:53:26 2022 +0100

    Max pending trials

    Signed-off-by: Kai Fricke <[email protected]>

commit 5638109
Author: Kai Fricke <[email protected]>
Date:   Thu Sep 29 14:08:13 2022 +0100

    Overstep

    Signed-off-by: Kai Fricke <[email protected]>

commit 9ee6455
Author: Kai Fricke <[email protected]>
Date:   Thu Sep 29 14:07:01 2022 +0100

    Tune controller result handling

    Signed-off-by: Kai Fricke <[email protected]>

commit fcb8557
Author: Kai Fricke <[email protected]>
Date:   Thu Sep 29 13:03:15 2022 +0100

    Tune controller resources management tests

    Signed-off-by: Kai Fricke <[email protected]>

commit ae757fe
Author: Kai Fricke <[email protected]>
Date:   Thu Sep 29 12:12:48 2022 +0100

    Add docs to unit tests

    Signed-off-by: Kai Fricke <[email protected]>

commit 10d1760
Author: Kai Fricke <[email protected]>
Date:   Thu Sep 29 11:46:05 2022 +0100

    Delete execution_pull

    Signed-off-by: Kai Fricke <[email protected]>

commit 99f7ddd
Merge: ccf8b02 f5143a9
Author: Kai Fricke <[email protected]>
Date:   Thu Sep 29 11:45:49 2022 +0100

    Merge branch 'master' into air/execution/prototype

commit ccf8b02
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 27 16:07:31 2022 +0100

    PG manager tests

    Signed-off-by: Kai Fricke <[email protected]>

commit 2e74eaa
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 27 15:32:09 2022 +0100

    PG manager

    Signed-off-by: Kai Fricke <[email protected]>

commit c4eee47
Merge: 01020dd d456aa2
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 27 11:43:59 2022 +0100

    Merge remote-tracking branch 'upstream/master' into air/execution/prototype

commit 01020dd
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 27 08:55:50 2022 +0100

    PG WIP

    Signed-off-by: Kai Fricke <[email protected]>

commit b1f3cd2
Author: Kai Fricke <[email protected]>
Date:   Mon Sep 26 15:07:45 2022 +0100

    Add fixed resource manager test

    Signed-off-by: Kai Fricke <[email protected]>

commit 2a5d089
Merge: ddbb7ef 9f1ea30
Author: Kai Fricke <[email protected]>
Date:   Mon Sep 26 14:55:38 2022 +0100

    Merge remote-tracking branch 'upstream/master' into air/execution/prototype

commit ddbb7ef
Merge: b685a44 6530635
Author: Kai Fricke <[email protected]>
Date:   Thu Sep 22 17:07:02 2022 +0100

    Merge branch 'master' into air/execution/prototype

commit b685a44
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 20 13:04:32 2022 +0100

    K fold CV

    Signed-off-by: Kai Fricke <[email protected]>

commit 7fbbeed
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 20 10:55:49 2022 +0100

    Resources

    Signed-off-by: Kai Fricke <[email protected]>

commit 0173489
Merge: 66fe655 15883cd
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 20 10:46:25 2022 +0100

    Merge remote-tracking branch 'upstream/master' into air/execution/prototype

commit 66fe655
Merge: 8359ebd 805d5dc
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 20 09:54:19 2022 +0100

    Merge remote-tracking branch 'upstream/master' into air/execution/prototype

commit 8359ebd
Merge: 1e9b3f9 7280ef4
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 20 09:54:13 2022 +0100

    Merge branch 'master' into air/execution/prototype

commit 1e9b3f9
Author: Kai Fricke <[email protected]>
Date:   Fri Sep 9 19:11:50 2022 +0100

    num samples, benchmark

    Signed-off-by: Kai Fricke <[email protected]>

commit 11331eb
Author: Kai Fricke <[email protected]>
Date:   Fri Sep 9 16:49:29 2022 +0100

    uuid

    Signed-off-by: Kai Fricke <[email protected]>

commit bdbff3b
Author: Kai Fricke <[email protected]>
Date:   Fri Sep 9 16:38:06 2022 +0100

    Simple split controller

    Signed-off-by: Kai Fricke <[email protected]>

commit a0b7815
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 6 14:36:22 2022 +0100

    Train example working with sync futures

    Signed-off-by: Kai Fricke <[email protected]>

commit aedd0dc
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 6 12:53:59 2022 +0100

    Make pausing work

    Signed-off-by: Kai Fricke <[email protected]>

commit 8155508
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 6 12:47:18 2022 +0100

    Fix execution for tune

    Signed-off-by: Kai Fricke <[email protected]>

commit 860ffe5
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 6 12:24:48 2022 +0100

    Push-based package

    Signed-off-by: Kai Fricke <[email protected]>

commit dac1637
Author: Kai Fricke <[email protected]>
Date:   Tue Sep 6 11:11:08 2022 +0100

    Separate pull-based package

    Signed-off-by: Kai Fricke <[email protected]>

commit 15ddea1
Author: Kai Fricke <[email protected]>
Date:   Fri Sep 2 12:30:37 2022 +0100

    tune checkpointing works

    Signed-off-by: Kai Fricke <[email protected]>

commit 1358832
Merge: b7c46d8 8692eb6
Author: Kai Fricke <[email protected]>
Date:   Fri Sep 2 11:37:56 2022 +0100

    Merge branch 'master' into air/execution/prototype

commit b7c46d8
Merge: 859cb49 ed29291
Author: Kai Fricke <[email protected]>
Date:   Wed Aug 31 14:36:18 2022 -0700

    Merge remote-tracking branch 'upstream/master' into air/execution/prototype

commit 859cb49
Author: Kai Fricke <[email protected]>
Date:   Wed Aug 31 14:19:21 2022 -0700

    train sequentially working

    Signed-off-by: Kai Fricke <[email protected]>

commit 7498377
Author: Kai Fricke <[email protected]>
Date:   Wed Aug 31 11:37:37 2022 -0700

    train wip

    Signed-off-by: Kai Fricke <[email protected]>

commit f95b103
Author: Kai Fricke <[email protected]>
Date:   Wed Aug 31 10:50:13 2022 -0700

    Update API

    Signed-off-by: Kai Fricke <[email protected]>

commit 0effb2a
Author: Kai Fricke <[email protected]>
Date:   Wed Aug 31 08:43:58 2022 -0700

    typed futures

    Signed-off-by: Kai Fricke <[email protected]>

commit 68e90dc
Author: Kai Fricke <[email protected]>
Date:   Wed Aug 31 07:53:06 2022 -0700

    tune update

    Signed-off-by: Kai Fricke <[email protected]>

commit 515228e
Author: Kai Fricke <[email protected]>
Date:   Wed Aug 31 06:37:39 2022 -0700

    tune update

    Signed-off-by: Kai Fricke <[email protected]>

commit ffef078
Author: Kai Fricke <[email protected]>
Date:   Tue Aug 30 17:24:32 2022 -0700

    tune.run with progress report

    Signed-off-by: Kai Fricke <[email protected]>

commit 5d7ace0
Author: Kai Fricke <[email protected]>
Date:   Tue Aug 30 17:17:34 2022 -0700

    tune example running

    Signed-off-by: Kai Fricke <[email protected]>

commit 3f44ac3
Author: Kai Fricke <[email protected]>
Date:   Tue Aug 30 15:48:38 2022 -0700

    resource manager, tune example

    Signed-off-by: Kai Fricke <[email protected]>

commit 5160de5
Author: Kai Fricke <[email protected]>
Date:   Tue Aug 30 14:20:30 2022 -0700

    tune controller + result

    Signed-off-by: Kai Fricke <[email protected]>

commit 3bd2184
Author: Kai Fricke <[email protected]>
Date:   Tue Aug 30 12:51:12 2022 -0700

    initial structure

    Signed-off-by: Kai Fricke <[email protected]>

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
# Conflicts:
#	python/ray/tune/tests/test_sample.py
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to explain to me all the additional context around this, I feel like I learned a lot 😄

In this review, I focused primarily on making sure the APIs/documentation for the existing Placement Group flow are really clear and representative of the resource management lifecycle.

Consequently, I still think the FixedResourceManager flow doesn't adhere to the APIs very well, but I'm okay with not overindexing on this in this PR since it is not being exposed yet.

python/ray/air/execution/resources/request.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/request.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/request.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/resource_manager.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/resource_manager.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/resource_manager.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/placement_group.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/fixed.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/fixed.py Show resolved Hide resolved
Signed-off-by: Kai Fricke <[email protected]>
@krfricke
Copy link
Contributor Author

Thanks for the review @matthewdeng. I've updated the docstrings throughout. ptal

python/ray/air/execution/resources/fixed.py Outdated Show resolved Hide resolved
# first PG bundle with num_cpus=0. The second object will also be scheduled in
# the first bundle, but will occupy the respective resources. Thus, we start
# counting from i = -1, and bundle = max(i, 0) which will be [0, 0, 1, 2, ...]
if self.resource_request.head_bundle_is_empty:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can I give a suggestion, things like head_bundle_is_empty should not live on request class.
ResourceRequest class will be purer if it doesn't care about how applications plan to use the resources.
instead, we can make it a parameter to this annotate_remote_objects() function:

def annotate_remote_objects(
    self, objects: List[RemoteRayObject], empty_head_bundle=False,
) -> List[Union[RemoteRayObject]]:
    annotated = []

    def _annotate_obj(obj, bundle, bundle_idx):
        num_cpus = bundle.pop("CPU", 0)
        num_gpus = bundle.pop("GPU", 0)
        memory = bundle.pop("memory", 0.0)

        annotated.append(
            obj.options(
                scheduling_strategy=PlacementGroupSchedulingStrategy(
                    placement_group=self.placement_group,
                    # Max ensures that empty head bundles are correctly placed
                    placement_group_bundle_index=bundle_idx,
                    placement_group_capture_child_tasks=True,
                ),
                num_cpus=num_cpus,
                num_gpus=num_gpus,
                memory=memory,
                resources=bundle,
            )
        )

    first_obj = 0
    if empty_head_bundle:
        _annotate_obj(objects[0], {}, 0)
        first_obj = 1

    for i, (obj, bundle) in enumerate(
        zip(objects[first_obj:], self.resource_request.bundles)
    ):
        _annotate_obj(obj, bundle, i)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, _head_bundle_is_empty is automatically derived from the bundles in the __init__ of the resource request, i.e. it's no something the user passes. When a user passes bundles=[{}, {"CPU": 1}] the property is set to True. So this is just an internal representation, not an API.

I would not want to introduce this to the annotation API as then the caller will have to analyze the bundles to decide what to pass to the API - this logic should remain in the acquired resource class, which has access to all this information.

I'd be open to leaving the empty bundle in the self._bundles, but this would just be a refactor of the internal representation without any major effect. So for now I think it's best to keep it. (Note that this is also just copied over from PlacementGroupFactory, there is no new logic here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, sorry, I looked at ResourceRequest more closely and understand what you are trying to do now.
Is there a background story about why we give head node, and head node only, such special treatment?
I would imagine an otherwise more generic framework for remote actors that don't require resources, regardless of where they are in the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a background story!

This is mostly about distributed training. Let's say we have a Ray Train trainer that starts 2 distributed workers a 4 CPUs, so 8 CPUs in total. Ever since we've been using Ray Tune as the execution backend, this requires us to start a remote "head" worker (comparable to the RLlib learner) that starts these 2 distributed workers. But because the head worker is also a remote actor, it requires resources.

So now we're in a situation where we actually would need 9 CPUs (1 for the trainable actor, 2*4 for the distributed workers). That's undesirable, as it means we need bigger clusters (especially when tuning). Also (unlike the rllib trainer), the trainable worker is not doing much at all, it just does a bit of coordination between the distributed workers.

So the solution is to pass num_cpus=0 to this actor.

So now we need to somehow specify resources (a PlacementGroupFactory) that schedules the first actor with 0 CPUs. So what we want is [{"CPU": 0}, {"CPU": 1}, {"CPU": 1}].

At first, we wanted to make this work without any special casing. Placement groups however don't accept empty bundles. The first workaround was to pass [{"CPU": 0.01}, {"CPU": 0.99}, {"CPU": 1}] or so which is obviously not great from a user perspective. The second workaround is what we're currently doing: We don't schedule the first "bundle" but just schedule the first actor onto the first bundle with empty resources.

I agree this is not ideal - e.g. what happens if we pass another empty bundle in one of the workers? At the moment this will just raise an error.

The current solution works well from a user perspective, though. So I'd say let's keep it for now until we feel the need to come up with something better.

python/ray/air/execution/resources/resource_manager.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/placement_group.py Outdated Show resolved Hide resolved
Signed-off-by: Kai Fricke <[email protected]>
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the many iterations on this! A few more naming/documentation clarifications but otherwise looks good to me.

python/ray/air/execution/resources/request.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/resource_manager.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/resource_manager.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/resource_manager.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/resource_manager.py Outdated Show resolved Hide resolved
python/ray/air/execution/resources/resource_manager.py Outdated Show resolved Hide resolved
Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your patience, 1 minor suggestion, your call :)

# first PG bundle with num_cpus=0. The second object will also be scheduled in
# the first bundle, but will occupy the respective resources. Thus, we start
# counting from i = -1, and bundle = max(i, 0) which will be [0, 0, 1, 2, ...]
if self.resource_request.head_bundle_is_empty:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, sorry, I looked at ResourceRequest more closely and understand what you are trying to do now.
Is there a background story about why we give head node, and head node only, such special treatment?
I would imagine an otherwise more generic framework for remote actors that don't require resources, regardless of where they are in the list.

python/ray/air/execution/resources/placement_group.py Outdated Show resolved Hide resolved
@krfricke krfricke merged commit edb17fd into ray-project:master Dec 20, 2022
@krfricke krfricke deleted the tune/refactor-pg-manager-base branch December 20, 2022 11:16
krfricke added a commit that referenced this pull request Dec 20, 2022
…IR resource manager (#30016)

Includes/depends on #30777

TLDR: This PR refactors Ray Tune's resource management to use a central AIR resource management package instead of the tightly coupled PlacementGroupManager.

Ray Tune's resource management currently uses a tightly coupled placement group manager. This leads to a number of shortcomings:
- The tight coupling on the manager side (e.g. PG manager keeps track of trials) prevents re-usability
- The tight coupling on the trial executor side prevents using different resource management strategies (e.g. shared or budget-based)
- It's hard to test independently. Tests for the resource management require a simulated tune setup.

To improve stability, extensibility, and maintainability, this PR moves the resource management logic into a central `ray.air.execution.resources` subpackage. The resource management has a simple API that works with `ResourceRequest`s and `AllocatedResources` to manage requested and assigned resources, respectively. The actual resource management can then be anything - per default it is a placement group based manager, but this PR also introduces a PoC budget-based manager that can be plugged in.

The PR does not substantially change existing tests, so we can be certain that the new resource model is a fully compatible replacement for the old placement group manager.

Signed-off-by: Kai Fricke <[email protected]>
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…implementation (#30777)

Prerequisite to #30016

This PR adds a new Ray AIR resource manager to replace the PlacementGroupManager of Ray Tune. Details can be found in #30016.

Specifically, this PR
- Adds the main resource manager abstractions
- Renames (and moves) PlacementGroupFactory to ResourceRequest
- Adds implementations and tests for a placement group based manager and a budget based manager

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…IR resource manager (#30016)

Includes/depends on #30777

TLDR: This PR refactors Ray Tune's resource management to use a central AIR resource management package instead of the tightly coupled PlacementGroupManager.

Ray Tune's resource management currently uses a tightly coupled placement group manager. This leads to a number of shortcomings:
- The tight coupling on the manager side (e.g. PG manager keeps track of trials) prevents re-usability
- The tight coupling on the trial executor side prevents using different resource management strategies (e.g. shared or budget-based)
- It's hard to test independently. Tests for the resource management require a simulated tune setup.

To improve stability, extensibility, and maintainability, this PR moves the resource management logic into a central `ray.air.execution.resources` subpackage. The resource management has a simple API that works with `ResourceRequest`s and `AllocatedResources` to manage requested and assigned resources, respectively. The actual resource management can then be anything - per default it is a placement group based manager, but this PR also introduces a PoC budget-based manager that can be plugged in.

The PR does not substantially change existing tests, so we can be certain that the new resource model is a fully compatible replacement for the old placement group manager.

Signed-off-by: Kai Fricke <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…implementation (ray-project#30777)

Prerequisite to ray-project#30016

This PR adds a new Ray AIR resource manager to replace the PlacementGroupManager of Ray Tune. Details can be found in ray-project#30016.

Specifically, this PR
- Adds the main resource manager abstractions
- Renames (and moves) PlacementGroupFactory to ResourceRequest
- Adds implementations and tests for a placement group based manager and a budget based manager

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
Signed-off-by: tmynn <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…IR resource manager (ray-project#30016)

Includes/depends on ray-project#30777

TLDR: This PR refactors Ray Tune's resource management to use a central AIR resource management package instead of the tightly coupled PlacementGroupManager.

Ray Tune's resource management currently uses a tightly coupled placement group manager. This leads to a number of shortcomings:
- The tight coupling on the manager side (e.g. PG manager keeps track of trials) prevents re-usability
- The tight coupling on the trial executor side prevents using different resource management strategies (e.g. shared or budget-based)
- It's hard to test independently. Tests for the resource management require a simulated tune setup.

To improve stability, extensibility, and maintainability, this PR moves the resource management logic into a central `ray.air.execution.resources` subpackage. The resource management has a simple API that works with `ResourceRequest`s and `AllocatedResources` to manage requested and assigned resources, respectively. The actual resource management can then be anything - per default it is a placement group based manager, but this PR also introduces a PoC budget-based manager that can be plugged in.

The PR does not substantially change existing tests, so we can be certain that the new resource model is a fully compatible replacement for the old placement group manager.

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants