-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Tweaks to support remote (K8s) operators #17194
[autoscaler] Tweaks to support remote (K8s) operators #17194
Conversation
Added a couple tests, removed extraneous stuff. Good to review again. |
disable_node_updaters: Disables node updaters if True. | ||
(For Kubernetes operator usage.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for merging master to include the docstring :)
python/ray/tests/test_autoscaler.py
Outdated
def testScaleUpNoUpdaters(self): | ||
"""Repeat of testScaleUp with disable_node_updaters=True. | ||
Check at the end that no runner calls are made. | ||
""" | ||
config_path = self.write_config(SMALL_CLUSTER) | ||
self.provider = MockProvider() | ||
runner = MockProcessRunner() | ||
mock_metrics = Mock(spec=AutoscalerPrometheusMetrics()) | ||
autoscaler = StandardAutoscaler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parametrize testScaleUp
with disable_node_updaters
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, since the test logic for the two cases differs by like two lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unittest doesn't work with pytest.mark.parametrize -- instead used a helper method that accepts a disable_node_updaters arg and used it in separate tests.
python/ray/tests/test_autoscaler.py
Outdated
def testTerminateUnhealthyWorkers(self): | ||
"""Test termination of unhealthy workers, when | ||
autoscaler.disable_node_updaters == True. | ||
|
||
Modified copy-paste of testRecoverUnhealthyWorkers. | ||
""" | ||
config_path = self.write_config(SMALL_CLUSTER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, would it be possible to parametrize testRecoverUnhealthyWorkers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would rather not, since the functionality tested is different and I had to think for more than 30 sec on how to refactor to unify the two tests :).
@@ -81,7 +80,8 @@ def __init__( | |||
update_interval_s: int = AUTOSCALER_UPDATE_INTERVAL_S, | |||
prefix_cluster_info: bool = False, | |||
event_summarizer: Optional[EventSummarizer] = None, | |||
prom_metrics: Optional[AutoscalerPrometheusMetrics] = None): | |||
prom_metrics: Optional[AutoscalerPrometheusMetrics] = None, | |||
disable_node_updaters: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of a new parameter, I wonder if this can be assumed to be true if there is no SSH configs, since in that case you cannot really operate the ray process/container, so no update and cannot save broken nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea would be to assume that there are no node updaters if there is no auth field in the provider config. That would work and would keep monitor and autoscaler __init__ signatures cleaner.
On the other hand it could be confusing -- for example, the current logic for K8s support inserts an empty auth config to keep everything happy (and uses kubectl exec
to execute updates.)
I think it's better for now to keep things explicit and make the init signatures uglier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, up to you. I just prefer no interface change if not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put this in the provider field of the config, since it's kind of a property of the NodeProvider. This (a) doesn't change the autoscaler/monitor Python interface (b) still makes the flag explicit.
(Open to alternatives.)
def testScaleUp(self): | ||
self.ScaleUpHelper(disable_node_updaters=False) | ||
|
||
def testScaleUpNoUpdaters(self): | ||
self.ScaleUpHelper(disable_node_updaters=True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The travis build issue seems unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now!
Merging, as Travis issue appears unrelated. |
Why are these changes needed?
This PR adds some optional functionality to the autoscaler in preparation for future support for remote Kubernetes operators. (Here, remote means running elsewhere than autoscaler/NodeProvider.)
Methods are added to NodeProvider interface to support hooks that are executedbefore and after autoscaler updates. These hooks can be used for communicationwith a remote operator.Note: this also switches off delayed heartbeat node recovery logic. (Failed pods should just crash.)Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.