Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate RayCluster Hash on KubeRay Version Change #2320

Merged
merged 10 commits into from
Aug 23, 2024

Conversation

ryanaoleary
Copy link
Contributor

@ryanaoleary ryanaoleary commented Aug 21, 2024

Why are these changes needed?

This PR adds a new ray.io/kuberay-version annotation to RayClusters created with KubeRay. This annotation is used to check for a KubeRay version change, which can cause a RayService to restart and create a new RayCluster erroneously due to a hash mismatch. If the KubeRay version annotation differs when checking shouldPrepareNewRayCluster and the RayCluster hashes do not match, the controller updates the utils.HashWithoutReplicasAndWorkersToDeleteKey annotation, KubeRay version annotation, and returns DoNothing to avoid restarting the RayService on version change. This unblocks the version upgrade to KubeRay v1.2.0.

This PR was tested by following the reproduction script in the linked issue and verifying that Active RayCluster config doesn't match goal config. RayService operator should prepare a new Ray cluster does not occur.

Related issue number

Closes #2315

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@ryanaoleary ryanaoleary changed the title Re-generate hash when KubeRay version changes Generate RayCluster Hash on KubeRay Version Change Aug 21, 2024
@ryanaoleary
Copy link
Contributor Author

ryanaoleary commented Aug 21, 2024

cc: @kevin85421, @andrewsykim

@ryanaoleary
Copy link
Contributor Author

ryanaoleary commented Aug 21, 2024

@kevin85421 is the desired behavior here to immediately start a zero-downtime upgrade (r.markRestartAndAddPendingClusterName(ctx, rayServiceInstance)) if the KUBERAY_VERSION changes? Or should we just add the updated annotations on a version change but return DoNothing, and leave the RayCluster as-is that iteration rather than restarting it?

@kevin85421
Copy link
Member

is the desired behavior here to immediately start a zero-downtime upgrade (r.markRestartAndAddPendingClusterName(ctx, rayServiceInstance)) if the KUBERAY_VERSION changes?

No, if the RayService CR's KUBERAY_VERSION annotation (e.g., 1.2.0) is different from (or doesn't exist) the KubeRay operator's KUBERAY_VERSION (e.g., 1.3.0), then we should not perform a zero downtime upgrade until:

  • The ray.io/hash-without-replicas-and-workers-to-delete annotation is updated.
  • The KUBERAY_VERSION annotation is updated.

After these updates, we can use the new ray.io/hash-without-replicas-and-workers-to-delete annotation to determine whether to trigger a zero downtime upgrade.

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
// and new KubeRay version, but do not restart the RayCluster.
activeKubeRayVersion := activeRayCluster.ObjectMeta.Annotations[utils.KubeRayVersion]
if activeKubeRayVersion != utils.KUBERAY_VERSION {
activeRayCluster.ObjectMeta.Annotations[utils.HashWithoutReplicasAndWorkersToDeleteKey] = goalClusterHash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to update the input argument (activeRayCluster) inside this function. How about removing the logic for updating annotations here and directly return Update instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5b5fe07, we now just log and call Update and the cluster hash and KubeRay version are updated in constructRayClusterForRayService.

Signed-off-by: Ryan O'Leary <[email protected]>
@kevin85421
Copy link
Member

Thanks! I will manually test it.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a unit test in rayservice_controller_unit_test.go? Maybe in TestReconcileRayCluster, or you can add a new test if needed.

Signed-off-by: Ryan O'Leary <[email protected]>
@ryanaoleary
Copy link
Contributor Author

Would you mind adding a unit test in rayservice_controller_unit_test.go? Maybe in TestReconcileRayCluster, or you can add a new test if needed.

Added a unit test case in f9c61e7

Signed-off-by: Ryan O'Leary <[email protected]>
ryanaoleary and others added 2 commits August 21, 2024 22:07
Signed-off-by: Ryan O'Leary <[email protected]>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others look good to me! I will manually test it after the comments are addressed.

runtimeObjects = append(runtimeObjects, tc.activeCluster.DeepCopy())
}
fakeClient := clientFake.NewClientBuilder().WithScheme(newScheme).WithRuntimeObjects(runtimeObjects...).Build()
r := RayServiceReconciler{
Client: fakeClient,
Scheme: newScheme,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Contributor Author

@ryanaoleary ryanaoleary Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without setting the Scheme, the test returns a nil pointer error when creating a new RayCluster in constructRayClusterForRayService which gets called when updating the RayCluster.

Signed-off-by: Ryan O'Leary <[email protected]>
@MortalHappiness
Copy link
Member

LGTM

1 similar comment
@rueian
Copy link
Contributor

rueian commented Aug 23, 2024

LGTM

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually test this PR by following this issue: #2315 (comment).

  • Zero downtime upgrade will not be triggered accidentally after upgrading.
  • The annotation ray.io/kuberay-version is added to the RayCluster CR correctly.
  • Then, I updated rayVersion from 2.34.0 to 2.100.0 to trigger zero downtime upgrade. It is triggered as expected.

@kevin85421 kevin85421 merged commit e7f0c2c into ray-project:master Aug 23, 2024
27 checks passed
kevin85421 pushed a commit to kevin85421/kuberay that referenced this pull request Aug 26, 2024
* Re-generate hash when KubeRay version changes

Signed-off-by: Ryan O'Leary <[email protected]>

* Change logic to DoNothing on KubeRay version mismatch

Signed-off-by: Ryan O'Leary <[email protected]>

* Add KubeRay version annotation to test

Signed-off-by: Ryan O'Leary <[email protected]>

* Move update logic

Signed-off-by: Ryan O'Leary <[email protected]>

* Update rayservice_controller.go

Signed-off-by: ryanaoleary <[email protected]>

* Add unit test

Signed-off-by: Ryan O'Leary <[email protected]>

* Add period

Signed-off-by: Ryan O'Leary <[email protected]>

* Go vet changes

Signed-off-by: Ryan O'Leary <[email protected]>

* Update rayservice_controller_unit_test.go

Signed-off-by: ryanaoleary <[email protected]>

* Address test comments

Signed-off-by: Ryan O'Leary <[email protected]>

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: ryanaoleary <[email protected]>
kevin85421 added a commit that referenced this pull request Aug 27, 2024
…) (#2339)

* Re-generate hash when KubeRay version changes



* Change logic to DoNothing on KubeRay version mismatch



* Add KubeRay version annotation to test



* Move update logic



* Update rayservice_controller.go



* Add unit test



* Add period



* Go vet changes



* Update rayservice_controller_unit_test.go



* Address test comments



---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: ryanaoleary <[email protected]>
Co-authored-by: ryanaoleary <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] RayService Restarts After Upgrading KubeRay to v1.2.0-rc.0
5 participants