Clean up interaction between Autoscaler and Kuberay #23428

sriram-anyscale · 2022-03-23T17:38:52Z

Why are these changes needed?

There are situations when the coordination between the Autoscaler and Kuberay can get confused. This PR along with a Kuberay PR (ray-project/kuberay#208) addresses these situations.

Examples:

Autoscaler request Kuberay to delete a specific set of nodes, but before the Kuberay reconciler kicks in, a node dies. This causes Kuberay to delete a random set of nodes instead of the ones specified. This issue gets fixed in the Kuberay PR.
Autoscaler requests creation or termination of nodes. But simultaneously there is another request that changes the number of replicas (e.g., through the Kuberay API server). In this case, the _wait_for_pods methods will never terminate, and cause the Autoscaler to get stuck. This PR fixes this issue.

Details on the code changes:

The Autoscaler no longer waits for Kuberay to complete the request (through waiting in _wait_for_pods). Instead it makes sure the previous request has been completed each time before it submits a new request.

Instead of ensuring that the number of replicas are correct (as _wait_for_pods was doing) - which is error prone, we now check that Kuberay has cleared workersToDelete as the indication that the previous request has been completed.

The Autoscaler no longer clears workersToDelete.

The Autoscaler adds a dummy entry into workersToDelete even for createNode requests (which Kuberay will eventually clear) so future requests can ensure the createNode request has been completed.

Related issue number

none

Checks

Need help on how to run tests

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

There are situations when the coordination between the Autoscaler and Kuberay can get confused. This PR along with a Kuberay PR (ray-project/kuberay#208) addresses these situations. Examples: - Autoscaler request Kuberay to delete a specific set of nodes, but before the Kuberay reconciler kicks in, a node dies. This causes Kuberay to delete a random set of nodes instead of the ones specified. This issue gets fixed in the Kuberay PR. - Autoscaler requests creation or termination of nodes. But simultaneously there is another request that changes the number of replicas (e.g., through the Kuberay API server). In this case, the _wait_for_pods methods will never terminate, and cause the Autoscaler to get stuck. This PR fixes this issue. Details on the code changes: The Autoscaler no longer waits for Kuberay to complete the request (through waiting in _wait_for_pods). Instead it makes sure the previous request has been completed each time before it submits a new request. Instead of ensuring that the number of replicas are correct (as _wait_for_pods was doing) - which is error prone, we now check that Kuberay has cleared workersToDelete as the indication that the previous request has been completed. The Autoscaler no longer clears workersToDelete. The Autoscaler adds a dummy entry into workersToDelete even for createNode requests (which Kuberay will eventually clear) so future requests can ensure the createNode request has been completed.

DmitriGekhtman

Nice, this is definitely an improvement.

DmitriGekhtman · 2022-03-26T18:08:33Z

python/ray/autoscaler/kuberay/kuberay-autoscaler.patch

+-#        args:
+-#        - --enable-leader-election
+        args:
+        - --prioritize-workers-to-delete


diffs of diffs are hard on my head --

What happened here? If I understand right, prioritize workers to delete is the feature flag on the KubeRay side.
What are the implications of removing --enabling-leader-election?

Hey Dmitri, the --enabling-leader-election flag is not being removed, it was never active -- the --enabling-leader-election flag was commented out and the comment is removed :)

It is possible that we can do this better with kustomize.yaml, but I don't know much about it and the patch certainly works for now :) Happy to look a little more into whether we can do without the patch (and it is temporary anyways since the feature flag will be removed at some point).

Got it. In my Saturday morning grog, I missed the # symbols.

Once everything settles, we can recommend deploying using purely the files hosted in the KubeRay repo.
Patching works while autoscaling support is experimental.

Yes, that sounds great :)

DmitriGekhtman · 2022-03-26T18:17:46Z

python/ray/autoscaler/_private/kuberay/node_provider.py

+                {
+                    "op": "replace",
+                    "path": prefix + "/scaleStrategy",
+                    "value": {"workersToDelete": [NOT_A_NODE_NAME]},


This is a hack in the usual sense:
We're achieving a goal (awaiting an ack from the operator)
by means of a semantically unrelated action (writing and reading to workersToDelete).

Basically, we're using workersToDelete to effect a task queue of capacity 1 for the autoscaler's scale requests.

Can we add a todo that we do not intend to do things this way forever?

The correct way is to get an ack by reading something from the CR's status field -- this will require changes to CRD.
cc @Jeffwan

(This PR's method is of course a less error-prone hack than _wait_for_pods.)

DmitriGekhtman · 2022-03-26T19:21:29Z

There's a branch of logic that we're probably not going to test in this PR -- namely, what happens when workersToDelete isn't processed in time and the create/terminate request is rejected.

In an ideal world, how would we test that? WDYT @sriram-anyscale @pcmoritz @Jeffwan?

DmitriGekhtman · 2022-03-26T19:22:47Z

Lint errors:
https://buildkite.com/ray-project/ray-builders-pr/builds/28204#dcbb4a5c-2699-4914-981e-9a4e6be830ad

Unused time import and some lines that are too long.

DmitriGekhtman · 2022-03-26T21:35:39Z

Let's rebase or merge master to validate against the e2e test.

DmitriGekhtman · 2022-03-27T16:37:37Z

There's a branch of logic that we're probably not going to test in this PR -- namely, what happens when workersToDelete isn't processed in time and the create/terminate request is rejected.

In an ideal world, how would we test that?

An idea: Test the methods of KubeRayNodeProvider against the CI kind cluster. Mock out the operator.

DmitriGekhtman · 2022-03-27T17:12:36Z

python/ray/autoscaler/_private/kuberay/node_provider.py

-                {"op": "test", "path": path, "value": group_spec["replicas"]},
+                # This test makes sure that the previous request to Kuberay has been completed.
+                {
+                    "op": "test",


It is possible for the autoscaler to issue a terminate_node and create_node in quick succession within one update iteration.

It's also possible for the autoscaler to send multiple create_node requests in quick succession within one update, for example when spinning up workers of different groups.

In these situations, we wouldn't be surprised if the operator failed to reconcile between the scale requests. The eventual outcome would be correct (probably) -- the terminate and create requests would get spaced out between different autoscaler updates.

But the error message in the autoscaler's logs would be confusing. We should probably catch the failed test error and log something meaningful before raising the error.

Life would be easier if instead issuing of a bunch of terminate and create requests in quick succession, the autoscaler issued one scale request per iteration. The scale request would say which workers to delete and how many workers of each group to add [or how many workers of each group we should have in total].

sriram-anyscale · 2022-03-28T16:55:49Z

I have a few improvements to make on this PR that will clean up the logic/abstractions, and may also respond to @DmitriGekhtman's comments. Will update by end of day.

…uberay-fixes

…Instead it gets the current replica count from ground truth (the actual number of running pods). This removes all need for strong coordination between this autoscaler and the Kuberay operator. There are some matching changes required elsewhere for which I will need help - once we agree that this approach is correct, we can address these additional required changes.

pcmoritz · 2022-03-29T23:21:34Z

python/ray/autoscaler/_private/kuberay/node_provider.py

-            group_index, group_spec = self._get_worker_group(raycluster, group_name)
+            group_index, _ = self._get_worker_group(raycluster, group_name)
+            tag_filters = {TAG_RAY_USER_NODE_TYPE: groupName}
+            current_replica_count = len(self.non_terminated_nodes(tag_filters))


just fyi: The fact that we need to call non_terminated_nodes here is slightly unfortunate -- ideally we would only call non_terminated_nodes once per autoscaler iteration and there was probably a place earlier in the autoscaler that was reading non_terminated nodes and as a result decided that scaling up was the right thing to do and maybe we could re-use the non_terminated_nodes from that. For code simplicity it is probably better to call it again rather than try to thread it through, so let's go with what you have -- overall this is a really great improvement to what we had earlier and there is no need to try to make it perfect :)

Let's chat about this tomorrow - we can figure out how to refactor this. We are also doing something similar in terminate_nodes.

Hmm, yep, same thing in terminate_nodes. An issue is that a list without specifying a resource version is bad for scalability. But in practice it's not that many more calls here.

There are some environment variables we can set to cap the potential number of "create nodes" per autoscaler iteration to the number of worker groups.

It would also be possible to cache the results of the autoscaler's non_terminated_nodes({}) call.

Then create_node and terminate_node would do some modification of the cached values,
which is more or less exactly what the autoscaler already does in the _update method.

This would be safe since everything is behind self._lock.

I think Philipp was suggesting somehow reusing the autoscaler's cacheing work, but that would be hard.

Anyways, unless there's a simpler approach, not clear if we want to refactor in this PR.

python/ray/autoscaler/_private/kuberay/node_provider.py

DmitriGekhtman · 2022-03-31T21:25:29Z

There's a lint error (unused time in the kuberay node provider)
https://buildkite.com/ray-project/ray-builders-pr/builds/28429#719b51f4-d94f-4297-9975-79477bb79596

Looks good otherwise!

DmitriGekhtman · 2022-04-12T20:56:52Z

Hmm kind setup failed with an error indicating that kind was already set up.
https://buildkite.com/ray-project/ray-builders-pr/builds/29304#983eea8c-8c32-46b7-bc71-057478f59f58

+ kind create cluster --wait 120s --config ./ci/travis/kind.config.yaml
--
  | ERROR: failed to create cluster: node(s) already exist for a cluster with the name "kind"
  |

Looks like an instance is reused with some dangling state. Probably not a major cause for concern, but cc @simon-mo .

DmitriGekhtman · 2022-04-12T21:29:06Z

After retrying the CI issue described above, we have a completely green run! Hooray, merging.

sriram-anyscale requested review from pcmoritz and DmitriGekhtman March 23, 2022 17:38

sriram-anyscale mentioned this pull request Mar 23, 2022

Prioritize WorkersToDelete ray-project/kuberay#208

Merged

4 tasks

sriram-anyscale and others added 2 commits March 23, 2022 10:43

Added comments

b1d63d0

use new code and feature flag

8b12754

DmitriGekhtman reviewed Mar 26, 2022

View reviewed changes

DmitriGekhtman reviewed Mar 27, 2022

View reviewed changes

sriram-anyscale added 4 commits March 28, 2022 12:55

Merge branch 'master' into kuberay-fixes

06b8451

Merge branch 'kuberay-fixes' of github.com:sriram-anyscale/ray into k…

66d7781

…uberay-fixes

Removed a note I forgot to delete

c59b7dc

pcmoritz reviewed Mar 29, 2022

View reviewed changes

python/ray/autoscaler/_private/kuberay/node_provider.py Outdated Show resolved Hide resolved

fixed variable name

030c020

pcmoritz and others added 3 commits April 2, 2022 15:40

Merge branch 'master' into kuberay-fixes

5d1c58f

clean up

e359c91

Removed unused "import time"

4c52f5d

DmitriGekhtman approved these changes Apr 11, 2022

View reviewed changes

DmitriGekhtman self-assigned this Apr 11, 2022

Merge branch 'master' into kuberay-fixes

077570b

DmitriGekhtman merged commit 608aa77 into ray-project:master Apr 12, 2022

sriram-anyscale deleted the kuberay-fixes branch May 1, 2022 23:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up interaction between Autoscaler and Kuberay #23428

Clean up interaction between Autoscaler and Kuberay #23428

sriram-anyscale commented Mar 23, 2022

DmitriGekhtman left a comment

DmitriGekhtman Mar 26, 2022

pcmoritz Mar 26, 2022

DmitriGekhtman Mar 26, 2022

DmitriGekhtman Mar 26, 2022

pcmoritz Mar 26, 2022

DmitriGekhtman Mar 26, 2022

DmitriGekhtman Mar 26, 2022

DmitriGekhtman commented Mar 26, 2022

DmitriGekhtman commented Mar 26, 2022

DmitriGekhtman commented Mar 26, 2022

DmitriGekhtman commented Mar 27, 2022

DmitriGekhtman Mar 27, 2022 •

edited

Loading

sriram-anyscale commented Mar 28, 2022

pcmoritz Mar 29, 2022

sriram-anyscale Mar 29, 2022

DmitriGekhtman Mar 30, 2022

DmitriGekhtman Mar 30, 2022

DmitriGekhtman commented Mar 31, 2022 •

edited

Loading

DmitriGekhtman commented Apr 12, 2022

DmitriGekhtman commented Apr 12, 2022

Clean up interaction between Autoscaler and Kuberay #23428

Clean up interaction between Autoscaler and Kuberay #23428

Conversation

sriram-anyscale commented Mar 23, 2022

Why are these changes needed?

Related issue number

Checks

DmitriGekhtman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitriGekhtman commented Mar 26, 2022

DmitriGekhtman commented Mar 26, 2022

DmitriGekhtman commented Mar 26, 2022

DmitriGekhtman commented Mar 27, 2022

DmitriGekhtman Mar 27, 2022 • edited Loading

Choose a reason for hiding this comment

sriram-anyscale commented Mar 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitriGekhtman commented Mar 31, 2022 • edited Loading

DmitriGekhtman commented Apr 12, 2022

DmitriGekhtman commented Apr 12, 2022

DmitriGekhtman Mar 27, 2022 •

edited

Loading

DmitriGekhtman commented Mar 31, 2022 •

edited

Loading