[data] introduce abstract interface for data autoscaling #45002

raulchen · 2024-04-26T22:00:02Z

Why are these changes needed?

Introduce an abstract interface for data autoscaling, making autoscaling behavior easier to customize and extend. Main components:
- Autoscaler: the abstract interface responsible for all autoscaling decisions, including cluster and actor pool autoscaling.
- AutoscalingActorPool: abstract interface that represents an actor pool that can autoscale.
- DefaultAutoscaler: default implementation.
No major code logic changes in this PR, except
- fixing a small bug of calculating actor pool util (should be num_active_actors/current_size instead of num_running_actors/current_size).
- ActorPoolMapOperator.incremental_resource_usage now doesn't consider autoscaling, as we are abstracting autoscaling out of the op. Previously the info wasn't useful either.
- Removed actor pool autoscaling logic for bulk executor.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hao Chen <[email protected]>

bveeramani

Overall LGTM

python/ray/data/_internal/execution/autoscaler/__init__.py

bveeramani · 2024-05-06T02:04:39Z

python/ray/data/_internal/execution/autoscaler/autoscaler.py

+
+
+@DeveloperAPI
+class Autoscaler(metaclass=ABCMeta):


Nit:

New class ABC has ABCMeta as its meta class. Using ABC as a base class has essentially the same effect as specifying metaclass=abc.ABCMeta, but is simpler to type and easier to read.

Suggested change

class Autoscaler(metaclass=ABCMeta):

class Autoscaler(ABC):

thanks. I forgot this again 😅

bveeramani · 2024-05-06T02:05:52Z

python/ray/data/_internal/execution/autoscaler/autoscaler.py

+        self._execution_id = execution_id
+
+    @abstractmethod
+    def try_trigger_scaling(self, scheduling_decision: "SchedulingDecision"):


This method is supposed to trigger autoscaling of both the cluster and actor pools, right? Could we add a note in the docstring? I don't think it's obvious what's getting autoscaled from the name

bveeramani · 2024-05-06T02:06:06Z

python/ray/data/_internal/execution/autoscaler/autoscaling_actor_pool.py

+
+
+@DeveloperAPI
+class AutoscalingActorPool(metaclass=ABCMeta):


Nit:

Suggested change

class AutoscalingActorPool(metaclass=ABCMeta):

class AutoscalingActorPool(ABC):

bveeramani · 2024-05-06T02:08:40Z

python/ray/data/_internal/execution/autoscaler/default_autoscaler.py

+        self._try_scale_up_cluster(scheduling_decision)
+        self._try_scale_up_or_down_actor_pool(scheduling_decision)
+
+    def _actor_pool_util(self, actor_pool: AutoscalingActorPool):


Nit: Could we rename this method to be more descriptive? I wasn't able to infer what the method does from the name _actor_pool_util

will update it to _calculate_actor_pool_util

Oh, does "util" refer to "utilization" in this context? Thought it meant "util" as in "utility"

yeah, I didn't realize "util" is also an common abbreviation for "utility"

python/ray/data/_internal/execution/autoscaler/default_autoscaler.py

bveeramani · 2024-05-06T02:17:46Z

python/ray/data/_internal/execution/interfaces/physical_operator.py

+    def get_autoscaling_actor_pools(self) -> List[AutoscalingActorPool]:
+        """Return a list of `AutoscalingActorPool`s managed by this operator."""
+        return []


Do we expect any operators other than ActorPoolMapOperator to override this method? Feel like it's not ideal to add a method to the PhysicalOperator interface just for one subclass, although I can't think of any alternatives off the top of my head.

we could potentially use some top PhysicalOperator level attribute which describes if the operator relies on actors or tasks, then call this method only for operators for which that attribute is true

streaming aggregation also use actor pools and may implement this. also, I want to avoid other components (autoscaler and resource manager) depending on the ActorPoolMapOperator. This makes the dependency graph more complex.

use some top PhysicalOperator level attribute which describes if the operator relies on actors or tasks

I think this is redundant, because we can tell this by if get_autoscaling_actor_pools returns an empty list.

bveeramani · 2024-05-06T02:20:18Z

python/ray/data/_internal/execution/autoscaler/default_autoscaler.py

+                        actor_pool, op
+                    )
+                    if should_scale_up and not should_scale_down:
+                        if actor_pool.scale_up(1) == 0:


When would this evaluate to true? Looks like scale_up always returns input value?

if i understand correctly, looks like scale_up is intended to return the number of actors actually added, which could differ from the requested scaleup. but in the current default implementation, looks like we return the input as @bveeramani said

yes, the current implementation always returns true. But I wanted to the make the interface more flexible and also make it consistent with scale_down

python/ray/data/_internal/execution/autoscaler/default_autoscaler.py

scottjlee · 2024-05-06T17:35:20Z

python/ray/data/_internal/execution/autoscaler/autoscaler.py

+    def try_trigger_scaling(self, scheduling_decision: "SchedulingDecision"):
+        """Try trigger autoscaling.
+
+        This method will be called each time when StreamExecutor makes


nit: here and docstring of on_executor_shutdown, to make sure readers don't get confused by a different name

Suggested change

This method will be called each time when StreamExecutor makes

This method will be called each time when StreamingExecutor makes

scottjlee · 2024-05-06T17:53:16Z

python/ray/data/_internal/execution/autoscaler/default_autoscaler.py

+                        actor_pool, op
+                    )
+                    if should_scale_up and not should_scale_down:
+                        if actor_pool.scale_up(1) == 0:


if i understand correctly, looks like scale_up is intended to return the number of actors actually added, which could differ from the requested scaleup. but in the current default implementation, looks like we return the input as @bveeramani said

scottjlee · 2024-05-06T18:12:50Z

python/ray/data/_internal/execution/interfaces/physical_operator.py

+    def get_autoscaling_actor_pools(self) -> List[AutoscalingActorPool]:
+        """Return a list of `AutoscalingActorPool`s managed by this operator."""
+        return []


we could potentially use some top PhysicalOperator level attribute which describes if the operator relies on actors or tasks, then call this method only for operators for which that attribute is true

scottjlee · 2024-05-06T19:11:23Z

python/ray/data/_internal/execution/streaming_executor_state.py

+class OpSchedulingStatus:
+    """The scheduling status of an operator."""
+
+    # Whether the operator is runnable.


what exactly does "runnable" mean here? can we expand on this definition?
also, wondering if it would make sense to connect this with OpState, or combine the two classes together

select_op_to_run checks which ops are runnable, and choose the best op to run. "runnable" means this. Will add a comment.
also good suggestion to incorporate this class with OpState.

Signed-off-by: Hao Chen <[email protected]>

raulchen · 2024-05-06T23:24:16Z

@bveeramani @scottjlee thanks for your comments. all addressed.

can-anyscale · 2024-05-08T04:27:47Z

@raulchen this broke https://github.com/orgs/anyscale/projects/76/views/1?pane=issue&itemId=54699584, can you help investigate, thankks

## Why are these changes needed? Fix following bugs introduced by #45002: * `autoscaler.try_trigger_scaling` not called when `select_op_to_run` returns None. * scaling up condition on `under_resource_limits`. `python/ray/data/tests/test_streaming_integration.py::test_e2e_autoscaling_up` should pass after this fix. ## Related issue number Closes #43481 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Hao Chen <[email protected]>

…#45002) * Introduce an abstract interface for data autoscaling, making autoscaling behavior easier to customize and extend. Main components: * `Autoscaler`: the abstract interface responsible for all autoscaling decisions, including cluster and actor pool autoscaling. * `AutoscalingActorPool`: abstract interface that represents an actor pool that can autoscale. * `DefaultAutoscaler`: default implementation. * No major code logic changes in this PR, except * fixing a small bug of calculating actor pool util (should be `num_active_actors/current_size` instead of `num_running_actors/current_size`). * `ActorPoolMapOperator.incremental_resource_usage` now doesn't consider autoscaling, as we are abstracting autoscaling out of the op. Previously the info wasn't useful either. * Removed actor pool autoscaling logic for bulk executor. --------- Signed-off-by: Hao Chen <[email protected]>

## Why are these changes needed? Fix following bugs introduced by ray-project#45002: * `autoscaler.try_trigger_scaling` not called when `select_op_to_run` returns None. * scaling up condition on `under_resource_limits`. `python/ray/data/tests/test_streaming_integration.py::test_e2e_autoscaling_up` should pass after this fix. ## Related issue number Closes ray-project#43481 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Hao Chen <[email protected]>

raulchen requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, stephanie-wang and omatthew98 as code owners April 26, 2024 22:00

raulchen marked this pull request as draft April 26, 2024 22:00

raulchen added 15 commits April 29, 2024 16:47

autoscaler interface

e8592ce

Signed-off-by: Hao Chen <[email protected]>

integrate

3f8bfa9

Signed-off-by: Hao Chen <[email protected]>

separate file

50d7217

Signed-off-by: Hao Chen <[email protected]>

increment_resource_usage

b9e0297

Signed-off-by: Hao Chen <[email protected]>

actor pool map

45f69bc

Signed-off-by: Hao Chen <[email protected]>

fix test_actor_pool_map_oprator

18fb475

Signed-off-by: Hao Chen <[email protected]>

lint

6b0fc9d

Signed-off-by: Hao Chen <[email protected]>

fix test_streaming_executor

2253c82

Signed-off-by: Hao Chen <[email protected]>

autoscaling actor pool

c1dcd65

Signed-off-by: Hao Chen <[email protected]>

format

0af38bc

Signed-off-by: Hao Chen <[email protected]>

minor

edb1725

Signed-off-by: Hao Chen <[email protected]>

SchedulingDecision

74f9e13

Signed-off-by: Hao Chen <[email protected]>

fix

3bd0f54

Signed-off-by: Hao Chen <[email protected]>

comments

a853c52

Signed-off-by: Hao Chen <[email protected]>

refine

11f96e0

raulchen force-pushed the autoscaler-interface branch from 43b1e07 to 11f96e0 Compare April 29, 2024 23:48

raulchen added 3 commits April 30, 2024 17:01

fix test_actor_pool

c678dee

Signed-off-by: Hao Chen <[email protected]>

test actor pool scaling

a225fc6

Signed-off-by: Hao Chen <[email protected]>

test cluster scaling

25f47a4

Signed-off-by: Hao Chen <[email protected]>

raulchen changed the title ~~[data] Autoscaling interface~~ [data] introduce abstract interface for data autoscaling May 1, 2024

raulchen marked this pull request as ready for review May 1, 2024 03:05

raulchen added 4 commits May 1, 2024 11:43

fix test_executor_resource_management.py

406e40f

Signed-off-by: Hao Chen <[email protected]>

fix resource_manager

1b6bf5f

Signed-off-by: Hao Chen <[email protected]>

remove bulk test

51ba4e9

Signed-off-by: Hao Chen <[email protected]>

lint

75bdd26

Signed-off-by: Hao Chen <[email protected]>

bveeramani reviewed May 6, 2024

View reviewed changes

scottjlee reviewed May 6, 2024

View reviewed changes

raulchen added 4 commits May 6, 2024 14:43

minor

b6fd417

Signed-off-by: Hao Chen <[email protected]>

refine scheduling status

71cacc4

Signed-off-by: Hao Chen <[email protected]>

fix

774464a

Signed-off-by: Hao Chen <[email protected]>

Merge branch 'master' into autoscaler-interface

9c47b7a

bveeramani approved these changes May 7, 2024

View reviewed changes

scottjlee approved these changes May 7, 2024

View reviewed changes

raulchen merged commit 2ad4e33 into ray-project:master May 7, 2024
4 checks passed

raulchen deleted the autoscaler-interface branch May 7, 2024 19:05

raulchen mentioned this pull request May 8, 2024

[data] fix bugs introduced by autoscaler refactor #45200

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] introduce abstract interface for data autoscaling #45002

[data] introduce abstract interface for data autoscaling #45002

raulchen commented Apr 26, 2024 •

edited

Loading

bveeramani left a comment

bveeramani May 6, 2024

raulchen May 6, 2024

bveeramani May 6, 2024

raulchen May 6, 2024

bveeramani May 6, 2024

bveeramani May 6, 2024

raulchen May 6, 2024

bveeramani May 6, 2024

raulchen May 6, 2024

bveeramani May 6, 2024

scottjlee May 6, 2024

raulchen May 6, 2024

raulchen May 6, 2024 •

edited

Loading

bveeramani May 6, 2024

scottjlee May 6, 2024

raulchen May 6, 2024

scottjlee May 6, 2024

scottjlee May 6, 2024

scottjlee May 6, 2024

scottjlee May 6, 2024

raulchen May 6, 2024

raulchen commented May 6, 2024

can-anyscale commented May 8, 2024

	class AutoscalingActorPool(metaclass=ABCMeta):
	class AutoscalingActorPool(ABC):

	This method will be called each time when StreamExecutor makes
	This method will be called each time when StreamingExecutor makes

[data] introduce abstract interface for data autoscaling #45002

[data] introduce abstract interface for data autoscaling #45002

Conversation

raulchen commented Apr 26, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen May 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen commented May 6, 2024

can-anyscale commented May 8, 2024

raulchen commented Apr 26, 2024 •

edited

Loading

raulchen May 6, 2024 •

edited

Loading