Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] Failed to update error when head pod is gone #374

Closed
simon-mo opened this issue Jul 14, 2022 · 3 comments
Closed

[RayService] Failed to update error when head pod is gone #374

simon-mo opened this issue Jul 14, 2022 · 3 comments

Comments

@simon-mo
Copy link
Collaborator

First of all, this error handling block is broken, the logger should log errStatus instead of the global err.

if errStatus := r.Status().Update(ctx, rayServiceInstance); errStatus != nil {
rayServiceLog.Error(err, "Fail to update status of RayService", "rayServiceInstance", rayServiceInstance)
return ctrl.Result{}, err
}

But if we do explicitly log the errStatus message, under one failure condition, it will say:

"error": "Operation cannot be fulfilled on rayservices.ray.io \"rayservice-sample\": the object has been modified; please apply your changes to the latest version and try again", "error": "Operation cannot be fulfilled on rayservices.ray.io \"rayservice-sample\": the object has been modified; please apply your changes to the latest version and try again"}

Steps to repro:

  1. create the sample rayservice
  2. kubectl delete pod HEAD_POD
  3. tail the controller logs
@brucez-anyscale
Copy link
Contributor

Thanks for finding the typo issue. Will fix it soon.
The second issue is quite common, it is due to the resource version of the CR changes, and it will try another reconcile loop.
It is a normal case.

@simon-mo
Copy link
Collaborator Author

@brucez-anyscale but I would like to understand why would the CR change once, and again in here, in the same Reconcile loop iteration?

@brucez-anyscale
Copy link
Contributor

If we do a very simple reconcile loop:
Read a CR, update state, Update the CR. I think the issue still there. It should be a common case in k8s.
If you have more insights or find out we update CR twice in the loop, that could fix the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants