[RayService] Failed to update error when head pod is gone #374

simon-mo · 2022-07-14T06:24:06Z

First of all, this error handling block is broken, the logger should log errStatus instead of the global err.

kuberay/ray-operator/controllers/ray/rayservice_controller.go

Lines 193 to 196 in 68faa10

    
           if errStatus := r.Status().Update(ctx, rayServiceInstance); errStatus != nil { 
        
           	rayServiceLog.Error(err, "Fail to update status of RayService", "rayServiceInstance", rayServiceInstance) 
        
           	return ctrl.Result{}, err 
        
           }

But if we do explicitly log the errStatus message, under one failure condition, it will say:

"error": "Operation cannot be fulfilled on rayservices.ray.io \"rayservice-sample\": the object has been modified; please apply your changes to the latest version and try again", "error": "Operation cannot be fulfilled on rayservices.ray.io \"rayservice-sample\": the object has been modified; please apply your changes to the latest version and try again"}

Steps to repro:

create the sample rayservice
kubectl delete pod HEAD_POD
tail the controller logs

The text was updated successfully, but these errors were encountered:

brucez-anyscale · 2022-07-14T07:09:50Z

Thanks for finding the typo issue. Will fix it soon.
The second issue is quite common, it is due to the resource version of the CR changes, and it will try another reconcile loop.
It is a normal case.

simon-mo · 2022-07-14T17:00:21Z

@brucez-anyscale but I would like to understand why would the CR change once, and again in here, in the same Reconcile loop iteration?

brucez-anyscale · 2022-07-14T18:19:26Z

If we do a very simple reconcile loop:
Read a CR, update state, Update the CR. I think the issue still there. It should be a common case in k8s.
If you have more insights or find out we update CR twice in the loop, that could fix the issue!

simon-mo mentioned this issue Jul 14, 2022

[RayService] No serve deployments after head node failure #375

Closed

brucez-anyscale mentioned this issue Jul 14, 2022

Improve RayService Operator logic to handle head node crash #376

Merged

4 tasks

brucez-anyscale closed this as completed Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayService] Failed to update error when head pod is gone #374

[RayService] Failed to update error when head pod is gone #374

simon-mo commented Jul 14, 2022

brucez-anyscale commented Jul 14, 2022

simon-mo commented Jul 14, 2022

brucez-anyscale commented Jul 14, 2022

[RayService] Failed to update error when head pod is gone #374

[RayService] Failed to update error when head pod is gone #374

Comments

simon-mo commented Jul 14, 2022

brucez-anyscale commented Jul 14, 2022

simon-mo commented Jul 14, 2022

brucez-anyscale commented Jul 14, 2022