Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Tweaks to support remote (K8s) operators #17194

Conversation

DmitriGekhtman
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman commented Jul 19, 2021

Why are these changes needed?

This PR adds some optional functionality to the autoscaler in preparation for future support for remote Kubernetes operators. (Here, remote means running elsewhere than autoscaler/NodeProvider.)

  • Methods are added to NodeProvider interface to support hooks that are executed before and after autoscaler updates. These hooks can be used for communication with a remote operator.
  • A flag is added to switch off NodeUpdater functionality, so that a remote operator can instead manage node state (e.g. by starting a Ray pod with entrypoint "ray start --block"). Note: this also switches off delayed heartbeat node recovery logic. (Failed pods should just crash.)
  • A separate branch of logic is introduced to terminate nodes that have missed autoscaler updates. Ideally this branch could be removed once a worker node healthcheck is implemented.
  • Pending states in the autoscaler summary are redefined as "not completed." This leaves room for remote operators to set additional pending states. It does not change behavior of current NodeProviders.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@DmitriGekhtman DmitriGekhtman added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 21, 2021
@DmitriGekhtman
Copy link
Contributor Author

Added a couple tests, removed extraneous stuff. Good to review again.

Comment on lines 101 to 102
disable_node_updaters: Disables node updaters if True.
(For Kubernetes operator usage.)
Copy link
Contributor

@ijrsvt ijrsvt Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for merging master to include the docstring :)

Comment on lines 1038 to 1046
def testScaleUpNoUpdaters(self):
"""Repeat of testScaleUp with disable_node_updaters=True.
Check at the end that no runner calls are made.
"""
config_path = self.write_config(SMALL_CLUSTER)
self.provider = MockProvider()
runner = MockProcessRunner()
mock_metrics = Mock(spec=AutoscalerPrometheusMetrics())
autoscaler = StandardAutoscaler(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parametrize testScaleUp with disable_node_updaters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, since the test logic for the two cases differs by like two lines.

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unittest doesn't work with pytest.mark.parametrize -- instead used a helper method that accepts a disable_node_updaters arg and used it in separate tests.

Comment on lines 1936 to 1942
def testTerminateUnhealthyWorkers(self):
"""Test termination of unhealthy workers, when
autoscaler.disable_node_updaters == True.

Modified copy-paste of testRecoverUnhealthyWorkers.
"""
config_path = self.write_config(SMALL_CLUSTER)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, would it be possible to parametrize testRecoverUnhealthyWorkers?

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rather not, since the functionality tested is different and I had to think for more than 30 sec on how to refactor to unify the two tests :).

@@ -81,7 +80,8 @@ def __init__(
update_interval_s: int = AUTOSCALER_UPDATE_INTERVAL_S,
prefix_cluster_info: bool = False,
event_summarizer: Optional[EventSummarizer] = None,
prom_metrics: Optional[AutoscalerPrometheusMetrics] = None):
prom_metrics: Optional[AutoscalerPrometheusMetrics] = None,
disable_node_updaters: bool = False):
Copy link
Contributor

@yiranwang52 yiranwang52 Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a new parameter, I wonder if this can be assumed to be true if there is no SSH configs, since in that case you cannot really operate the ray process/container, so no update and cannot save broken nodes

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea would be to assume that there are no node updaters if there is no auth field in the provider config. That would work and would keep monitor and autoscaler __init__ signatures cleaner.
On the other hand it could be confusing -- for example, the current logic for K8s support inserts an empty auth config to keep everything happy (and uses kubectl exec to execute updates.)
I think it's better for now to keep things explicit and make the init signatures uglier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, up to you. I just prefer no interface change if not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put this in the provider field of the config, since it's kind of a property of the NodeProvider. This (a) doesn't change the autoscaler/monitor Python interface (b) still makes the flag explicit.
(Open to alternatives.)

@DmitriGekhtman DmitriGekhtman added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Jul 21, 2021
Comment on lines +1048 to +1053
def testScaleUp(self):
self.ScaleUpHelper(disable_node_updaters=False)

def testScaleUpNoUpdaters(self):
self.ScaleUpHelper(disable_node_updaters=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@DmitriGekhtman
Copy link
Contributor Author

The travis build issue seems unrelated.

@DmitriGekhtman DmitriGekhtman removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 23, 2021
Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now!

@DmitriGekhtman
Copy link
Contributor Author

Merging, as Travis issue appears unrelated.

@DmitriGekhtman DmitriGekhtman merged commit e701ded into ray-project:master Jul 23, 2021
@DmitriGekhtman DmitriGekhtman deleted the node-provider-hooks-node-updater-off branch July 23, 2021 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants