[autoscaler] Tweaks to support remote (K8s) operators #17194

DmitriGekhtman · 2021-07-19T23:09:00Z

Why are these changes needed?

This PR adds some optional functionality to the autoscaler in preparation for future support for remote Kubernetes operators. (Here, remote means running elsewhere than autoscaler/NodeProvider.)

~~Methods are added to NodeProvider interface to support hooks that are executed~~ ~~before and after autoscaler updates. These hooks can be used for communication~~ ~~with a remote operator~~.
A flag is added to switch off NodeUpdater functionality, so that a remote operator can instead manage node state (e.g. by starting a Ray pod with entrypoint "ray start --block"). ~~Note: this also switches off delayed heartbeat node recovery logic. (Failed pods should just crash.)~~
A separate branch of logic is introduced to terminate nodes that have missed autoscaler updates. Ideally this branch could be removed once a worker node healthcheck is implemented.
Pending states in the autoscaler summary are redefined as "not completed." This leaves room for remote operators to set additional pending states. It does not change behavior of current NodeProviders.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/autoscaler/_private/autoscaler.py

DmitriGekhtman · 2021-07-21T01:46:40Z

Added a couple tests, removed extraneous stuff. Good to review again.

ijrsvt · 2021-07-21T15:40:23Z

python/ray/autoscaler/_private/autoscaler.py

+        disable_node_updaters: Disables node updaters if True.
+            (For Kubernetes operator usage.)


Thanks for merging master to include the docstring :)

ijrsvt · 2021-07-21T15:52:07Z

python/ray/tests/test_autoscaler.py

+    def testScaleUpNoUpdaters(self):
+        """Repeat of testScaleUp with disable_node_updaters=True.
+        Check at the end that no runner calls are made.
+        """
+        config_path = self.write_config(SMALL_CLUSTER)
+        self.provider = MockProvider()
+        runner = MockProcessRunner()
+        mock_metrics = Mock(spec=AutoscalerPrometheusMetrics())
+        autoscaler = StandardAutoscaler(


Can you parametrize testScaleUp with disable_node_updaters?

Yeah, since the test logic for the two cases differs by like two lines.

unittest doesn't work with pytest.mark.parametrize -- instead used a helper method that accepts a disable_node_updaters arg and used it in separate tests.

ijrsvt · 2021-07-21T15:52:31Z

python/ray/tests/test_autoscaler.py

+    def testTerminateUnhealthyWorkers(self):
+        """Test termination of unhealthy workers, when
+        autoscaler.disable_node_updaters == True.
+
+        Modified copy-paste of testRecoverUnhealthyWorkers.
+        """
+        config_path = self.write_config(SMALL_CLUSTER)


Same here, would it be possible to parametrize testRecoverUnhealthyWorkers?

Would rather not, since the functionality tested is different and I had to think for more than 30 sec on how to refactor to unify the two tests :).

yiranwang52 · 2021-07-21T16:11:39Z

python/ray/autoscaler/_private/autoscaler.py

@@ -81,7 +80,8 @@ def __init__(
            update_interval_s: int = AUTOSCALER_UPDATE_INTERVAL_S,
            prefix_cluster_info: bool = False,
            event_summarizer: Optional[EventSummarizer] = None,
-            prom_metrics: Optional[AutoscalerPrometheusMetrics] = None):
+            prom_metrics: Optional[AutoscalerPrometheusMetrics] = None,
+            disable_node_updaters: bool = False):


instead of a new parameter, I wonder if this can be assumed to be true if there is no SSH configs, since in that case you cannot really operate the ray process/container, so no update and cannot save broken nodes

I think the idea would be to assume that there are no node updaters if there is no auth field in the provider config. That would work and would keep monitor and autoscaler __init__ signatures cleaner.
On the other hand it could be confusing -- for example, the current logic for K8s support inserts an empty auth config to keep everything happy (and uses kubectl exec to execute updates.)
I think it's better for now to keep things explicit and make the init signatures uglier.

okay, up to you. I just prefer no interface change if not necessary.

I'll put this in the provider field of the config, since it's kind of a property of the NodeProvider. This (a) doesn't change the autoscaler/monitor Python interface (b) still makes the flag explicit.
(Open to alternatives.)

ijrsvt · 2021-07-22T15:27:46Z

python/ray/tests/test_autoscaler.py

+    def testScaleUp(self):
+        self.ScaleUpHelper(disable_node_updaters=False)
+
+    def testScaleUpNoUpdaters(self):
+        self.ScaleUpHelper(disable_node_updaters=True)
+


python/ray/autoscaler/_private/monitor.py

DmitriGekhtman · 2021-07-23T03:37:40Z

The travis build issue seems unrelated.

ijrsvt

LGTM now!

DmitriGekhtman · 2021-07-23T15:30:02Z

Merging, as Travis issue appears unrelated.

DmitriGekhtman added 6 commits July 17, 2021 12:26

node provider hooks

f975a31

disable node updaters

d4c97f5

pending means not completed

04ff48d

draft wip

32c75dc

Merge branch 'master' into node-provider-hooks-node-updater-off

c17cc73

add flag to autoscaler initialization

5c79e52

DmitriGekhtman requested a review from wuisawesome July 19, 2021 23:24

DmitriGekhtman assigned ijrsvt and yiranwang52 Jul 19, 2021

ijrsvt reviewed Jul 20, 2021

View reviewed changes

python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved

DmitriGekhtman added 6 commits July 20, 2021 00:59

Explain

b8b070c

terminate unhealthy nodes

257dbba

fix, add event summarizer message

e7076d7

Revert node provider

4d00abf

remove hooks from autoscaler.py

6b3acae

Merge branch 'master' into node-provider-hooks-node-updater-off

c93059a

DmitriGekhtman commented Jul 20, 2021

View reviewed changes

python/ray/autoscaler/_private/autoscaler.py Show resolved Hide resolved

DmitriGekhtman added 5 commits July 20, 2021 13:53

avert indent apocalypse

feb24d7

wip

62415c9

copy-node-termination-logic

87cf277

Added a test

1f55df5

Finish tests

35b21f0

DmitriGekhtman added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 21, 2021

DmitriGekhtman requested a review from ijrsvt July 21, 2021 01:45

ijrsvt reviewed Jul 21, 2021

View reviewed changes

yiranwang52 reviewed Jul 21, 2021

View reviewed changes

test cleanup

2b41c6a

Move disable node updaters to config yaml

97481c5

DmitriGekhtman added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Jul 21, 2021

DmitriGekhtman added 3 commits July 21, 2021 16:22

Merge branch 'master' into node-provider-hooks-node-updater-off

3cdeee5

fix

0d9c1ed

Build failure - merge and rerun.

d44e7d8

ijrsvt reviewed Jul 22, 2021

View reviewed changes

python/ray/autoscaler/_private/monitor.py Outdated Show resolved Hide resolved

DmitriGekhtman added 3 commits July 22, 2021 09:55

Merge branch 'master' into node-provider-hooks-node-updater-off

5d00248

Drop arg

ebcec84

Merge branch 'master' into node-provider-hooks-node-updater-off

44e23ed

DmitriGekhtman removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 23, 2021

DmitriGekhtman requested a review from ijrsvt July 23, 2021 03:38

ijrsvt approved these changes Jul 23, 2021

View reviewed changes

DmitriGekhtman merged commit e701ded into ray-project:master Jul 23, 2021

DmitriGekhtman deleted the node-provider-hooks-node-updater-off branch July 23, 2021 18:09

DmitriGekhtman mentioned this pull request Jul 25, 2021

[autoscaler] Avoid race in no-updaters logic #17328

Merged

6 tasks

ijrsvt mentioned this pull request Jul 28, 2021

[autoscaler] Support Peloton node provider #17312

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Tweaks to support remote (K8s) operators #17194

[autoscaler] Tweaks to support remote (K8s) operators #17194

DmitriGekhtman commented Jul 19, 2021 •

edited

Loading

DmitriGekhtman commented Jul 21, 2021

ijrsvt Jul 21, 2021 •

edited

Loading

ijrsvt Jul 21, 2021

DmitriGekhtman Jul 21, 2021

DmitriGekhtman Jul 21, 2021 •

edited

Loading

ijrsvt Jul 21, 2021

DmitriGekhtman Jul 21, 2021 •

edited

Loading

yiranwang52 Jul 21, 2021 •

edited

Loading

DmitriGekhtman Jul 21, 2021 •

edited

Loading

yiranwang52 Jul 21, 2021

DmitriGekhtman Jul 21, 2021

ijrsvt Jul 22, 2021

DmitriGekhtman commented Jul 23, 2021

ijrsvt left a comment

DmitriGekhtman commented Jul 23, 2021

		disable_node_updaters: Disables node updaters if True.
		(For Kubernetes operator usage.)

[autoscaler] Tweaks to support remote (K8s) operators #17194

[autoscaler] Tweaks to support remote (K8s) operators #17194

Conversation

DmitriGekhtman commented Jul 19, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

DmitriGekhtman commented Jul 21, 2021

ijrsvt Jul 21, 2021 • edited Loading

Choose a reason for hiding this comment

ijrsvt Jul 21, 2021

Choose a reason for hiding this comment

DmitriGekhtman Jul 21, 2021

Choose a reason for hiding this comment

DmitriGekhtman Jul 21, 2021 • edited Loading

Choose a reason for hiding this comment

ijrsvt Jul 21, 2021

Choose a reason for hiding this comment

DmitriGekhtman Jul 21, 2021 • edited Loading

Choose a reason for hiding this comment

yiranwang52 Jul 21, 2021 • edited Loading

Choose a reason for hiding this comment

DmitriGekhtman Jul 21, 2021 • edited Loading

Choose a reason for hiding this comment

yiranwang52 Jul 21, 2021

Choose a reason for hiding this comment

DmitriGekhtman Jul 21, 2021

Choose a reason for hiding this comment

ijrsvt Jul 22, 2021

Choose a reason for hiding this comment

DmitriGekhtman commented Jul 23, 2021

ijrsvt left a comment

Choose a reason for hiding this comment

DmitriGekhtman commented Jul 23, 2021

DmitriGekhtman commented Jul 19, 2021 •

edited

Loading

ijrsvt Jul 21, 2021 •

edited

Loading

DmitriGekhtman Jul 21, 2021 •

edited

Loading

DmitriGekhtman Jul 21, 2021 •

edited

Loading

yiranwang52 Jul 21, 2021 •

edited

Loading

DmitriGekhtman Jul 21, 2021 •

edited

Loading