Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up interaction between Autoscaler and Kuberay #23428

Merged
merged 12 commits into from
Apr 12, 2022

Conversation

sriram-anyscale
Copy link
Contributor

Why are these changes needed?

There are situations when the coordination between the Autoscaler and Kuberay can get confused. This PR along with a Kuberay PR (ray-project/kuberay#208) addresses these situations.

Examples:

  • Autoscaler request Kuberay to delete a specific set of nodes, but before the Kuberay reconciler kicks in, a node dies. This causes Kuberay to delete a random set of nodes instead of the ones specified. This issue gets fixed in the Kuberay PR.

  • Autoscaler requests creation or termination of nodes. But simultaneously there is another request that changes the number of replicas (e.g., through the Kuberay API server). In this case, the _wait_for_pods methods will never terminate, and cause the Autoscaler to get stuck. This PR fixes this issue.

Details on the code changes:

The Autoscaler no longer waits for Kuberay to complete the request (through waiting in _wait_for_pods). Instead it makes sure the previous request has been completed each time before it submits a new request.

Instead of ensuring that the number of replicas are correct (as _wait_for_pods was doing) - which is error prone, we now check that Kuberay has cleared workersToDelete as the indication that the previous request has been completed.

The Autoscaler no longer clears workersToDelete.

The Autoscaler adds a dummy entry into workersToDelete even for createNode requests (which Kuberay will eventually clear) so future requests can ensure the createNode request has been completed.

Related issue number

none

Checks

Need help on how to run tests

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

There are situations when the coordination between the Autoscaler and Kuberay can get confused. This PR along with a Kuberay PR (ray-project/kuberay#208) addresses these situations.

Examples:

- Autoscaler request Kuberay to delete a specific set of nodes, but before the Kuberay reconciler kicks in, a node dies. This causes Kuberay to delete a random set of nodes instead of the ones specified. This issue gets fixed in the Kuberay PR.

- Autoscaler requests creation or termination of nodes. But simultaneously there is another request that changes the number of replicas (e.g., through the Kuberay API server). In this case, the _wait_for_pods methods will never terminate, and cause the Autoscaler to get stuck. This PR fixes this issue.

Details on the code changes:

The Autoscaler no longer waits for Kuberay to complete the request (through waiting in _wait_for_pods). Instead it makes sure the previous request has been completed each time before it submits a new request.

Instead of ensuring that the number of replicas are correct (as _wait_for_pods was doing) - which is error prone, we now check that Kuberay has cleared workersToDelete as the indication that the previous request has been completed.

The Autoscaler no longer clears workersToDelete.

The Autoscaler adds a dummy entry into workersToDelete even for createNode requests (which Kuberay will eventually clear) so future requests can ensure the createNode request has been completed.
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is definitely an improvement.

-# args:
-# - --enable-leader-election
+ args:
+ - --prioritize-workers-to-delete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diffs of diffs are hard on my head --

What happened here? If I understand right, prioritize workers to delete is the feature flag on the KubeRay side.
What are the implications of removing --enabling-leader-election?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Dmitri, the --enabling-leader-election flag is not being removed, it was never active -- the --enabling-leader-election flag was commented out and the comment is removed :)

It is possible that we can do this better with kustomize.yaml, but I don't know much about it and the patch certainly works for now :) Happy to look a little more into whether we can do without the patch (and it is temporary anyways since the feature flag will be removed at some point).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. In my Saturday morning grog, I missed the # symbols.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once everything settles, we can recommend deploying using purely the files hosted in the KubeRay repo.
Patching works while autoscaling support is experimental.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds great :)

{
"op": "replace",
"path": prefix + "/scaleStrategy",
"value": {"workersToDelete": [NOT_A_NODE_NAME]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hack in the usual sense:
We're achieving a goal (awaiting an ack from the operator)
by means of a semantically unrelated action (writing and reading to workersToDelete).

Basically, we're using workersToDelete to effect a task queue of capacity 1 for the autoscaler's scale requests.

Can we add a todo that we do not intend to do things this way forever?

The correct way is to get an ack by reading something from the CR's status field -- this will require changes to CRD.
cc @Jeffwan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This PR's method is of course a less error-prone hack than _wait_for_pods.)

@DmitriGekhtman
Copy link
Contributor

There's a branch of logic that we're probably not going to test in this PR -- namely, what happens when workersToDelete isn't processed in time and the create/terminate request is rejected.

In an ideal world, how would we test that? WDYT @sriram-anyscale @pcmoritz @Jeffwan?

@DmitriGekhtman
Copy link
Contributor

Lint errors:
https://buildkite.com/ray-project/ray-builders-pr/builds/28204#dcbb4a5c-2699-4914-981e-9a4e6be830ad

Unused time import and some lines that are too long.

@DmitriGekhtman
Copy link
Contributor

Let's rebase or merge master to validate against the e2e test.

@DmitriGekhtman
Copy link
Contributor

There's a branch of logic that we're probably not going to test in this PR -- namely, what happens when workersToDelete isn't processed in time and the create/terminate request is rejected.

In an ideal world, how would we test that?

An idea: Test the methods of KubeRayNodeProvider against the CI kind cluster. Mock out the operator.

{"op": "test", "path": path, "value": group_spec["replicas"]},
# This test makes sure that the previous request to Kuberay has been completed.
{
"op": "test",
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman Mar 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible for the autoscaler to issue a terminate_node and create_node in quick succession within one update iteration.

It's also possible for the autoscaler to send multiple create_node requests in quick succession within one update, for example when spinning up workers of different groups.

In these situations, we wouldn't be surprised if the operator failed to reconcile between the scale requests. The eventual outcome would be correct (probably) -- the terminate and create requests would get spaced out between different autoscaler updates.

But the error message in the autoscaler's logs would be confusing. We should probably catch the failed test error and log something meaningful before raising the error.

Life would be easier if instead issuing of a bunch of terminate and create requests in quick succession, the autoscaler issued one scale request per iteration. The scale request would say which workers to delete and how many workers of each group to add [or how many workers of each group we should have in total].

@sriram-anyscale
Copy link
Contributor Author

I have a few improvements to make on this PR that will clean up the logic/abstractions, and may also respond to @DmitriGekhtman's comments. Will update by end of day.

…Instead it gets the current replica count from ground truth (the actual number of running pods). This removes all need for strong coordination between this autoscaler and the Kuberay operator.

There are some matching changes required elsewhere for which I will need help - once we agree that this approach is correct, we can address these additional required changes.
group_index, group_spec = self._get_worker_group(raycluster, group_name)
group_index, _ = self._get_worker_group(raycluster, group_name)
tag_filters = {TAG_RAY_USER_NODE_TYPE: groupName}
current_replica_count = len(self.non_terminated_nodes(tag_filters))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fyi: The fact that we need to call non_terminated_nodes here is slightly unfortunate -- ideally we would only call non_terminated_nodes once per autoscaler iteration and there was probably a place earlier in the autoscaler that was reading non_terminated nodes and as a result decided that scaling up was the right thing to do and maybe we could re-use the non_terminated_nodes from that. For code simplicity it is probably better to call it again rather than try to thread it through, so let's go with what you have -- overall this is a really great improvement to what we had earlier and there is no need to try to make it perfect :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's chat about this tomorrow - we can figure out how to refactor this. We are also doing something similar in terminate_nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yep, same thing in terminate_nodes. An issue is that a list without specifying a resource version is bad for scalability. But in practice it's not that many more calls here.

There are some environment variables we can set to cap the potential number of "create nodes" per autoscaler iteration to the number of worker groups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be possible to cache the results of the autoscaler's non_terminated_nodes({}) call.

Then create_node and terminate_node would do some modification of the cached values,
which is more or less exactly what the autoscaler already does in the _update method.

This would be safe since everything is behind self._lock.

I think Philipp was suggesting somehow reusing the autoscaler's cacheing work, but that would be hard.

Anyways, unless there's a simpler approach, not clear if we want to refactor in this PR.

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Mar 31, 2022

There's a lint error (unused time in the kuberay node provider)
https://buildkite.com/ray-project/ray-builders-pr/builds/28429#719b51f4-d94f-4297-9975-79477bb79596

Looks good otherwise!

@DmitriGekhtman DmitriGekhtman self-assigned this Apr 11, 2022
@DmitriGekhtman
Copy link
Contributor

Hmm kind setup failed with an error indicating that kind was already set up.
https://buildkite.com/ray-project/ray-builders-pr/builds/29304#983eea8c-8c32-46b7-bc71-057478f59f58

+ kind create cluster --wait 120s --config ./ci/travis/kind.config.yaml
--
  | ERROR: failed to create cluster: node(s) already exist for a cluster with the name "kind"
  |  

Looks like an instance is reused with some dangling state. Probably not a major cause for concern, but cc @simon-mo .

@DmitriGekhtman
Copy link
Contributor

After retrying the CI issue described above, we have a completely green run! Hooray, merging.

@DmitriGekhtman DmitriGekhtman merged commit 608aa77 into ray-project:master Apr 12, 2022
@sriram-anyscale sriram-anyscale deleted the kuberay-fixes branch May 1, 2022 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants