Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayCluster controller] [Bug] Unconditionally reconcile RayCluster every 60s instead of only upon change #850

Merged

Conversation

architkulkarni
Copy link
Contributor

@architkulkarni architkulkarni commented Dec 30, 2022

Signed-off-by: Archit Kulkarni [email protected]

Why are these changes needed?

This PR fixes an issue where RayJobs would never be submitted even if the Ray Cluster was up and running. Debugged this with @brucez-anyscale .

The root cause is the following:

The cluster status is only marked "Ready" if all pods are RUNNING. The cluster state only gets updated when a reconcile happens.

However, the Cluster reconcile doesn't get requeued if all the pods are in [RUNNING, PENDING] state.

In particular, when the head pod becomes PENDING, the cluster never gets reconciled again.

So the cluster is stuck in a state where it's status is not "Ready". Since the RayJob reconciler waits for the cluster status to be "Ready" before submission, the job hangs and is never submitted.

The fix in this PR is to requeue the cluster reconcile in the case where the cluster pods are not all running yet. This PR introduces a new cluster state "Pending" for this case. always requeue the cluster reconciliation (every 300s, for now.)

As for why the head node state change PENDING -> RUNNING didn't trigger a cluster reconcile before, we think it might be due to an event filter

WithEventFilter(predicate.Or(predicate.GenerationChangedPredicate{}, predicate.LabelChangedPredicate{}, predicate.AnnotationChangedPredicate{})).

which only does the reconcile when there's a config, label or annotation change. However we can't remove this filter because it's necessary to prevent reconciling in a tight loop, because instance.Status.LastUpdateTime is updated at each reconcile.

Conclusion after discussion on this PR: Fixing the event filters is a separate but related bug and will be addressed in a followup PR, the issue will be tracked here: #872. Once that is fixed, we could possibly consider a patch 0.4.1 release. After this PR, there may be an up to 60s unnecessary delay between when the cluster is ready and when the Job is submitted, which is not ideal.

Related issue number

Closes #845

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
      # This script is used to rebuild the KubeRay operator image and load it into the Kind cluster. 
# Step 0: Delete the existing Kind cluster
kind delete cluster 

# Step 1: Create a Kind cluster
kind create cluster --image=kindest/node:v1.24.0

# Step 2: Modify KubeRay source code
# For example, add a log "Hello KubeRay" in the function `Reconcile` in `raycluster_controller.go`.

# Step 3: Build a Docker image
#         This command will copy the source code directory into the image, and build it.
# Command: IMG={IMG_REPO}:{IMG_TAG} make docker-build
IMG=kuberay/operator:nightly make docker-build

# Step 4: Load the custom KubeRay image into the Kind cluster.
# Command: kind load docker-image {IMG_REPO}:{IMG_TAG}
kind load docker-image kuberay/operator:nightly

# Step 5: Keep consistency
# If you update RBAC or CRD, you need to synchronize them.
# See the section "Consistency check" for more information.

# Step 6: Install KubeRay operator with the custom image via local Helm chart
# (Path: helm-chart/kuberay-operator)
# Command: helm install kuberay-operator --set image.repository={IMG_REPO} --set image.tag={IMG_TAG} .
helm install kuberay-operator --set image.repository=kuberay/operator --set image.tag=nightly ~/kuberay/helm-chart/kuberay-operator

# Step 7: Submit a RayJob
kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml   

# Step 8: Check the status of the pods
watch kubectl get pods

# When the pods are ready, check `kubectl logs` to see that the job has status SUCCEEDED
  • This PR is not tested :(

@architkulkarni
Copy link
Contributor Author

Would appreciate any hints on how to add a unit test for this, or whether this is the right approach!

@kevin85421
Copy link
Member

Would appreciate any hints on how to add a unit test for this, or whether this is the right approach!

Maybe we can refer to the existing tests in [1][2] and write tests to ensure instance.Status.State update correctly.

[1] https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/raycluster_controller_fake_test.go
[2] https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/raycluster_controller_test.go

@@ -233,6 +233,9 @@ func (r *RayClusterReconciler) rayClusterReconcile(request ctrl.Request, instanc
r.Log.Error(err, "Update status error", "cluster name", request.Name)
}
}
if instance.Status.State == rayiov1alpha1.Pending {
return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, nil
}

return ctrl.Result{}, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should always requeue. I didn't realize we didn't do that. That's very bad. Reconciliation for any controller should happen periodically even if there's isn't a trigger that demands it.

The correct fix for this issue is to change the "successful return" from

return ctrl.Result{}, nil

to

return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks, that's very good to know. Let me update the PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point.

@@ -858,6 +861,8 @@ func (r *RayClusterReconciler) updateStatus(instance *rayiov1alpha1.RayCluster)
} else {
if utils.CheckAllPodsRunnning(runtimePods) {
instance.Status.State = rayiov1alpha1.Ready
} else {
instance.Status.State = rayiov1alpha1.Pending
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's normal for a Ray cluster to gain pods that are in Pending state, particularly if autoscaling is enabled. With this change, the cluster could switch back and forth between Running and Pending. That seems undesirable. (The current reporting of state is not great either.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DmitriGekhtman
How do we differentiate between Pods are that are Pending due to autoscaler vs Pods being Pending during initial cluster setup?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the following funky logic:
The only cluster State transition in the current logic is "" -> Ready.
If, after the initial cluster setup, we have a Pending pod due to scale change or failure, the State will not return to "".

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave the state reporting alone for now (maybe track the issue of making it more sensible).

We should change things so that we always requeue, i.e. do
return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, nil

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Dec 30, 2022

As for why the head node state change PENDING -> RUNNING didn't trigger a cluster reconcile before, we think it might be due to an event filter

I am fuzzy on how these filters work, but the way things should work is that
a RayCluster.status change doesn't trigger a RayCluster reconcile
but
a Pod.status change does trigger a reconcile for the RayCluster if the pod is in the Ray cluster.

Currently, it might be that neither event triggers a reconcile.

@architkulkarni
Copy link
Contributor Author

@DmitriGekhtman Updated to always requeue the RayCluster reconciler. After this, it seems the RayJob reconciler also gets called every few seconds even after the job has succeeded. I'm not exactly sure what is causing the RayJob reconciler to get requeued, but it sounds like this is what we want anyway.

There's ~10 other instances of ctrl.Result{}, nil in the codebase but I'll leave those alone for now.

@DmitriGekhtman
Copy link
Collaborator

I'm not exactly sure what is causing the RayJob reconciler to get requeued, but it sounds like this is what we want anyway.

Probably the RayCluster status update is causing that.

@DmitriGekhtman
Copy link
Collaborator

We could say testing for the fix falls under the umbrella of e2e testing for Ray Jobs which is tracked in #763.

@DmitriGekhtman
Copy link
Collaborator

I guess by default it triggers once every 10 hrs
https://stackoverflow.com/questions/72033429/how-often-is-reconciliation-loop-run-in-kubebuilder
which seems to me way too long for any use-case.

I'm not sure if once every 2 seconds is too frequent to requeue by default, but I think this should work fine for now.

In a follow-up, we should fix the event filters to ensure that pod changes trigger immediate reconciliation.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Dec 30, 2022

one thing I like to confirm is

In particular, when the head pod becomes PENDING, the cluster never gets reconciled again.

in this case, is it meaningful to submit the job?

if head is running and partical workers are pending. I think move forward job submission makes sense. We also need to be careful that we have to honor gang setting as well

@Jeffwan
Copy link
Collaborator

Jeffwan commented Dec 30, 2022

the latest change actually will reconcile an object which reaches the end state and it's meaningless. I would say it brings side effect especially when there're lots of CRs in the cluster. I think there're better ways to solve the problem(make the changes specific to job instead of clusters globally)?

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Dec 31, 2022

the latest change actually will reconcile an object which reaches the end state and it's meaningless. I would say it brings side effect especially when there're lots of CRs in the cluster. I think there're better ways to solve the problem(make the changes specific to job instead of clusters globally)?

I think we messed up the event filters in a previous PR -- we should fix the filters so that changes to RayCluster status are ignored but changes to status of other watched resources (such as pods) are not ignored.
I'm not sure how to do that but it should be possible.

About periodically re-triggering reconciliation --
There isn't an end-state for a Ray cluster, but rather a goal state.
If you don't periodically trigger reconciliation, state can get desynchronized due to
missed events (unlikely)
or
bugs in the operator code (more likely).
For that reason, it's a standard behavior of controllers to occasionally reconcile without an event trigger. By default, controllers written using kubebuilder will do this once every 10 hours, which seems to me too infrequent for our needs. Once every two seconds is probably too frequent. We can consider using something in between, maybe in a different PR.

@DmitriGekhtman
Copy link
Collaborator

So, again, given the above discussion -- my current opinion is that we should try to solve the issue by fixing the event filter so that pod status changes trigger reconcile.

Separately, we can consider if we want to always requeue after 30 seconds or a minute.

@kevin85421
Copy link
Member

Reconcile on pod status change but not RayCluster status change

Hi @DmitriGekhtman, I am not sure whether the Pod status change is enough or not. For example, if a ClusterIP service for a head Pod is deleted accidently and the head Pod's status does not change, the service may not be created again.

@architkulkarni
Copy link
Contributor Author

Created #872 to track the filter issue.

Once this PR is merged I will open an issue to track increasing 2 seconds to 30 or 60 seconds. We should do it after #872 is fixed, otherwise it will introduce a 30 or 60 second delay at the start of each RayJob.

@architkulkarni
Copy link
Contributor Author

Reconcile on pod status change but not RayCluster status change

Hi @DmitriGekhtman, I am not sure whether the Pod status change is enough or not. For example, if a ClusterIP service for a head Pod is deleted accidently and the head Pod's status does not change, the service may not be created again.

I guess with the 2 or 30-60 second automatic reconcile, the service would be created again--does this resolve the concern? So we would have both this automatic reconciling as well as the reconciling upon Pod status change.

@DmitriGekhtman
Copy link
Collaborator

Reconcile on pod status change but not RayCluster status change

Hi @DmitriGekhtman, I am not sure whether the Pod status change is enough or not. For example, if a ClusterIP service for a head Pod is deleted accidently and the head Pod's status does not change, the service may not be created again.

I guess with the 2 or 30-60 second automatic reconcile, the service would be created again--does this resolve the concern? So we would have both this automatic reconciling as well as the reconciling upon Pod status change.

Right, I'd recommend both fixing the pod status trigger and requeueing after a minute after successful reconcile.

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry, "Requesting changes" again to indicate my intent -- @Jeffwan had a good point that unconditionally triggering reconcile every two seconds is overkill.

Instead we should

  • Unconditionally trigger reconcile once a minute. (Modify the 2 seconds in the currently altered line to 60 seconds.)
  • Make sure pod status changes trigger reconcile, by fixing the event filter.

Once this is fixed, we could possibly consider a patch 0.4.1 release to fix Ray Jobs.

@architkulkarni
Copy link
Contributor Author

Manually tested so @DmitriGekhtman I think this is ready to be merged. I think it makes sense to fix the filter issue #872 in a second PR instead of this PR because they're conceptually separate changes, but I don't feel strongly either way.

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Jan 21, 2023

Manually tested so @DmitriGekhtman I think this is ready to be merged. I think it makes sense to fix the filter issue #872 in a second PR instead of this PR because they're conceptually separate changes, but I don't feel strongly either way.

Makes sense to leave the filter fix for later. For posterity, could you update the PR title and description to explain the changes?

@architkulkarni architkulkarni changed the title [RayCluster controller] [Bug] Reconcile cluster when cluster pods are Pending [RayCluster controller] [Bug] Unconditionally reconcile RayCluster every 60s instead of only upon change Jan 23, 2023
@architkulkarni
Copy link
Contributor Author

@DmitriGekhtman Ah thanks for the reminder--should be good to go now.

@akanso
Copy link
Collaborator

akanso commented Jan 23, 2023

For customers that are using KubeRay and managing 100's or thousands of clusters, a 60 reconciliation loop would have a big impact on the operator itself, and on the traffic to the k8s Api-server.

I think this should be a configurable attribute, we can default it to 5 minutes, for example we can pass an Env var to the operator that can specify a different frequency instead of hardcoding it.

@architkulkarni
Copy link
Contributor Author

@akanso That makes sense to me, so something like this? (cc @DmitriGekhtman)

var requeueAfterSeconds int
requeueAfterSeconds, err := strconv.Atoi(os.Getenv("RAYCLUSTER_DEFAULT_RECONCILE_LOOP_S"))
if err != nil {
    requeueAfterSeconds = 5 * 60
}
return ctrl.Result{RequeueAfter: time.Duration(requeueAfterSeconds) * time.Second}, nil

And how would an environment variable typically be passed to the operator?

@akanso
Copy link
Collaborator

akanso commented Jan 23, 2023

@akanso That makes sense to me, so something like this? (cc @DmitriGekhtman)

var requeueAfterSeconds int
requeueAfterSeconds, err := strconv.Atoi(os.Getenv("RAYCLUSTER_DEFAULT_RECONCILE_LOOP_S"))
if err != nil {
    requeueAfterSeconds = 5 * 60
}
return ctrl.Result{RequeueAfter: time.Duration(requeueAfterSeconds) * time.Second}, nil

And how would an environment variable typically be passed to the operator?

We can simply pass it in the k8s Yaml manifest of the operator pod. if it is missing we default to 5 minutes.
we would want to the code snippet a log message as well.

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Jan 23, 2023

5 minute default sounds good
(anything on the order of "eventually" is fine, 10 hrs is no good)

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Jan 23, 2023

For customers that are using KubeRay and managing 100's or thousands of clusters, a 60 reconciliation loop would have a big impact on the operator itself, and on the traffic to the k8s Api-server.

Doesn't reconcile trigger on any pod spec change, though?
(Or is it just the pods that have a RayCluster ownerref?)

@architkulkarni
Copy link
Contributor Author

Successfully tested manually with the latest change.

@akanso
Copy link
Collaborator

akanso commented Jan 24, 2023

Doesn't reconcile trigger on any pod spec change, though?

it does. But imagine kuberay is down or restarting, then it will/can miss an event, such as the creation/modification/deletion of a new raycluster, this reconcile logic will make sure missed event are processed. The events already processed will be ignored since the desired state is already satisfied.

@DmitriGekhtman
Copy link
Collaborator

I meant that the logic for 5 min vs. 1 min for performance reasons in a large cluster might not make sense, given that a large cluster has many pod changes happening -- these pod changes trigger many reconciles anyway.
(But 5 min sgtm.)

@architkulkarni
Copy link
Contributor Author

@DmitriGekhtman "Go-build-and-test / Compatibility Test - Nightly (pull_request)" is failing, but I think this is known to be flaky. Maybe we can bypass the test and merge this PR anyway?

@DmitriGekhtman
Copy link
Collaborator

Failing test is known to be flakey.

@DmitriGekhtman DmitriGekhtman merged commit d1eeaab into ray-project:master Jan 24, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…ery 60s instead of only upon change (ray-project#850)

Have the RayCluster controller resume once per 5 minutes.

Signed-off-by: Archit Kulkarni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Sample RayJob YAML from documentation hangs at "Initializing"
7 participants