Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update RayCluster .status.reason field with pod creation error #639

Merged
merged 3 commits into from
Nov 29, 2022

Conversation

davidxia
Copy link
Contributor

@davidxia davidxia commented Oct 17, 2022

Why are these changes needed?

Makes RayCluster errors related to Pod creation easier more apparent to user.

example before if Pod can't be created due to ResourceQuota exceeded

kubectl get rayclusters dxia-test2 -o yaml
...
status:
  state: failed

kubectl describe rayclusters dxia-test2
...
Status:
  State:   failed

example after if Pod can't be created due to ResourceQuota exceeded

kubectl get rayclusters dxia-test2 -o yaml
...
status:
  reason: 'pods "dxia-test2-head-lbvdc" is forbidden: exceeded quota: quota, requested:
    limits.cpu=15,requests.cpu=15, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10'
  state: failed

kubectl describe rayclusters dxia-test2
...
Status:
  Reason:  pods "dxia-test2-head-9mdm5" is forbidden: exceeded quota: quota, requested: limits.cpu=15,requests.cpu=15, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10
  State:   failed

Related issue number

fixes #603

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@davidxia
Copy link
Contributor Author

This requires an update to the RayCluster CRD which I know you like to avoid. The only other way is to set this info as an annotation or label. I think this case is worth updating the CRD to have a new field. Just have to get it right the first time. :) Lmk what you think.

@davidxia davidxia force-pushed the dxia/patch12 branch 2 times, most recently from 66c75a6 to ad2ecac Compare October 17, 2022 20:27
@DmitriGekhtman
Copy link
Collaborator

This requires an update to the RayCluster CRD which I know you like to avoid. The only other way is to set this info as an annotation or label. I think this case is worth updating the CRD to have a new field. Just have to get it right the first time. :) Lmk what you think.

Addition of optional fields is safe for backwards compatibility.

@davidxia
Copy link
Contributor Author

For comparison, here's the behavior of K8s Deployments when it can't create Pods due to exhausted ResourceQuota.

Listing Deployments shows whether it's READY with actual/desired Pods.

kubectl get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/1     0            0           11s

Getting the YAML shows a list of conditions with reason and message.

kubectl get deployments nginx-deployment -o yaml
...
status:
  conditions:
  - lastTransitionTime: "2022-10-18T03:16:51Z"
    lastUpdateTime: "2022-10-18T03:16:51Z"
    message: Created new replica set "nginx-deployment-79499496f5"
    reason: NewReplicaSetCreated
    status: "True"
    type: Progressing
  - lastTransitionTime: "2022-10-18T03:16:51Z"
    lastUpdateTime: "2022-10-18T03:16:51Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2022-10-18T03:16:51Z"
    lastUpdateTime: "2022-10-18T03:16:51Z"
    message: 'pods "nginx-deployment-79499496f5-ndzdk" is forbidden: exceeded quota:
      quota, requested: limits.cpu=11,requests.cpu=11, used: limits.cpu=60,requests.cpu=60,
      limited: limits.cpu=10,requests.cpu=10'
    reason: FailedCreate
    status: "True"
    type: ReplicaFailure
  observedGeneration: 1
  unavailableReplicas: 1

Similar detailed info for underlying ReplicaSet.

kubectl get rs
NAME                          DESIRED   CURRENT   READY   AGE
nginx-deployment-79499496f5   1         0         0       49s

kubectl get rs nginx-deployment-79499496f5 -o yaml
...
status:
  conditions:
  - lastTransitionTime: "2022-10-18T03:16:51Z"
    message: 'pods "nginx-deployment-79499496f5-ndzdk" is forbidden: exceeded quota:
      quota, requested: limits.cpu=11,requests.cpu=11, used: limits.cpu=60,requests.cpu=60,
      limited: limits.cpu=10,requests.cpu=10'
    reason: FailedCreate
    status: "True"
    type: ReplicaFailure
  observedGeneration: 1
  replicas: 0

Describing ReplicaSet shows events about root cause.

kubectl describe rs nginx-deployment-79499496f5
...
Conditions:
  Type             Status  Reason
  ----             ------  ------
  ReplicaFailure   True    FailedCreate
Events:
  Type     Reason        Age                From                   Message
  ----     ------        ----               ----                   -------
  Warning  FailedCreate  63s                replicaset-controller  Error creating: pods "nginx-deployment-79499496f5-ndzdk" is forbidden: exceeded quota: quota, requested: limits.cpu=11,requests.cpu=11, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10
  Warning  FailedCreate  63s                replicaset-controller  Error creating: pods "nginx-deployment-79499496f5-df9qh" is forbidden: exceeded quota: quota, requested: limits.cpu=11,requests.cpu=11, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10
  Warning  FailedCreate  63s                replicaset-controller  Error creating: pods "nginx-deployment-79499496f5-bdcjs" is forbidden: exceeded quota: quota, requested: limits.cpu=11,requests.cpu=11, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10
  Warning  FailedCreate  63s                replicaset-controller  Error creating: pods "nginx-deployment-79499496f5-rlkgq" is forbidden: exceeded quota: quota, requested: limits.cpu=11,requests.cpu=11, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10
  Warning  FailedCreate  63s                replicaset-controller  Error creating: pods "nginx-deployment-79499496f5-qz62l" is forbidden: exceeded quota: quota, requested: limits.cpu=11,requests.cpu=11, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10
  Warning  FailedCreate  63s                replicaset-controller  Error creating: pods "nginx-deployment-79499496f5-bch2m" is forbidden: exceeded quota: quota, requested: limits.cpu=11,requests.cpu=11, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10
  Warning  FailedCreate  63s                replicaset-controller  Error creating: pods "nginx-deployment-79499496f5-td4ml" is forbidden: exceeded quota: quota, requested: limits.cpu=11,requests.cpu=11, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10
  Warning  FailedCreate  62s                replicaset-controller  Error creating: pods "nginx-deployment-79499496f5-4nvkp" is forbidden: exceeded quota: quota, requested: limits.cpu=11,requests.cpu=11, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10
  Warning  FailedCreate  62s                replicaset-controller  Error creating: pods "nginx-deployment-79499496f5-k98jh" is forbidden: exceeded quota: quota, requested: limits.cpu=11,requests.cpu=11, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10
  Warning  FailedCreate  22s (x5 over 60s)  replicaset-controller  (combined from similar events): Error creating: pods "nginx-deployment-79499496f5-z2bb8" is forbidden: exceeded quota: quota, requested: limits.cpu=11,requests.cpu=11, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10

Might be nice to add a READY column with actual/desired Pods for RayCluster and also events?

@DmitriGekhtman
Copy link
Collaborator

I think a RayCluster is most like a ReplicaSet. Maybe we could do Desired, Current, Ready with a status.conditions list?
IMO, status.reason is too course to describe individual worker failure.
Thoughts, @Jeffwan @akanso?

@@ -214,6 +214,9 @@ func (r *RayClusterReconciler) rayClusterReconcile(request ctrl.Request, instanc
if updateErr := r.updateClusterState(instance, rayiov1alpha1.Failed); updateErr != nil {
r.Log.Error(updateErr, "RayCluster update state error", "cluster name", request.Name)
}
if updateErr := r.updateClusterReason(instance, err.Error()); updateErr != nil {
r.Log.Error(updateErr, "RayCluster update reason error", "cluster name", request.Name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start. we may want to expose more for other state.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Oct 20, 2022

I think a RayCluster is most like a ReplicaSet. Maybe we could do Desired, Current, Ready with a status.conditions list?
IMO, status.reason is too course to describe individual worker failure.
Thoughts, @Jeffwan @akanso?

Agree, conditions would be great to have. status can not tell the user how entire state machine but condition does.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Oct 20, 2022

@davidxia Is this PR ready? If so, please remove the WIP status.

@davidxia
Copy link
Contributor Author

@Jeffwan would you like me to implement conditions in this PR instead of status or do you want conditions to be in addition to status and implemented in a future PR?

@Jeffwan
Copy link
Collaborator

Jeffwan commented Oct 31, 2022

@davidxia I think future PRs sounds good. I think you mark the PR in WIP. I was not sure whether you like to implement it in same PR or not?

@DmitriGekhtman
Copy link
Collaborator

This is a good start. Before merging, it would be great to add a simple unit test.

As a follow-up for a different PR, we could emit an event with the RayCluster as subject -- I think that would get the error into the kubectl describe output.

@davidxia
Copy link
Contributor Author

Thanks, I'll add some test(s) and mark as ready for review. Any tips for how to make the "Go-build-and-test / Compatibility Test - 2.0.0 (pull_request)" pass?

@DmitriGekhtman
Copy link
Collaborator

Any tips for how to make the "Go-build-and-test / Compatibility Test - 2.0.0 (pull_request)" pass?

I'd ignore it. We need to smooth out some tests of experimental features.

@davidxia davidxia force-pushed the dxia/patch12 branch 2 times, most recently from ac82334 to 0ec6d6a Compare November 1, 2022 15:15
@davidxia davidxia marked this pull request as ready for review November 1, 2022 15:19
@davidxia
Copy link
Contributor Author

davidxia commented Nov 1, 2022

Thanks, added a unit test and marked as ready for review. (follow up issue "ray-operator emit RayCluster events and .status.conditions[] for Pod creation errors")

I tested manually on my K8s cluster (deploy new operator image, apply a ResourceQuota that's already exceeded, try to create a RayCluster in that namespace) and noticed the following difference in behavior. With v0.3.0 release, the reconcile loop retried once every second and backs off pretty quickly if there are errors. With this change, for some reason the operator tries to reconcile a failed RayCluster every 100ms with no backoff.

See the before and after logs here. Any idea why this is happening and/or if it's an issue? I can't see what changes I made here that would alter this behavior.

@DmitriGekhtman
Copy link
Collaborator

With v0.3.0 release, the reconcile loop retried once every second and backs off pretty quickly if there are errors. With this change, for some reason the operator tries to reconcile a failed RayCluster every 100ms with no backoff.

Do you see the problematic behavior in current master? Is it possible another commit caused it?

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Nov 4, 2022

@kevin85421 if you have some time, could you check to see if the undesirable reconciler behavior described a couple of comments up occurs in master?
cc @sihanwang41

@DmitriGekhtman
Copy link
Collaborator

I've opened an issue to track investigation of the overactive reconciliation loop:
#686

@DmitriGekhtman
Copy link
Collaborator

I've opened an issue to track investigation of the overactive reconciliation loop: #686

Seems master is fine. I hope we can help debug the PR branch so that it can be merged.

@davidxia
Copy link
Contributor Author

@DmitriGekhtman after rebasing this PR on latest master, building image, and testing the new operator on my cluster, I can't repro the original tight reconcile loop. Now it behaves as we expect. 🤷‍♂️

@davidxia
Copy link
Contributor Author

actually nvm, I can still repro it when I update the RayCluster CRD. Looking more...

@@ -214,6 +214,9 @@ func (r *RayClusterReconciler) rayClusterReconcile(request ctrl.Request, instanc
if updateErr := r.updateClusterState(instance, rayiov1alpha1.Failed); updateErr != nil {
r.Log.Error(updateErr, "RayCluster update state error", "cluster name", request.Name)
}
if updateErr := r.updateClusterReason(instance, err.Error()); updateErr != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DmitriGekhtman The reconcile loop doesn't cease because the err.Error() string here is always different. It begins with pods "POD_NAME" is forbidden: exceeded quota: quota... and POD_NAME keeps changing. So updating the .status.reason queues the RayCluster for another reconciliation which updates .status.reason which queues it again, on and on.

I'll make the string here stable which should fix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's interesting! Outputting the same error sgtm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DmitriGekhtman this is ugly and brittle. Is there a way to have the reconcile loop ignore changes to .status.reason or perhaps .status altogether? I want to do something like this. I think relevant code is here and/or here? But not sure and need help.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe we could just set the reason to something like "Pod reconcile failed" and then display the full error by emitting an event, which would be easily accessible via kubectl describe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm, I’ll take a look soon

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use Or to also watch for changes to Labels and Annotations.

CR annotations are currently taken into account by the operator.
Labels are not taken into account for now, but they might be later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The con of ignoring status changes is that extraneous status changes are not quickly fixed. I think that's a minor issue, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated. Also included emitting event since it's small and straightforward. Lmk if there's any good examples of how to test event or label/annotation update-trigger-reconciliation logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on my cluster. Updating labels or annotations triggers reconciliation as expected. Events also show up as expected. See gist

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

fixes ray-project#603

### example before if Pod can't be created due to ResourceQuota exceeded

```
kubectl get rayclusters dxia-test2 -o yaml
...
status:
  state: failed

kubectl describe rayclusters dxia-test2
...
Status:
  State:   failed
```

### example after if Pod can't be created due to ResourceQuota exceeded

```
kubectl get rayclusters dxia-test2 -o yaml
...
status:
  reason: 'pods "dxia-test2-head-lbvdc" is forbidden: exceeded quota: quota, requested:
    limits.cpu=15,requests.cpu=15, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10'
  state: failed

kubectl describe rayclusters dxia-test2
...
Status:
  Reason:  pods "dxia-test2-head-9mdm5" is forbidden: exceeded quota: quota, requested: limits.cpu=15,requests.cpu=15, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10
  State:   failed
```
…sn't change

by adding `predicate.GenerationChangedPredicate` to reconciler.

> This predicate will skip update events that have no change in the object's
> metadata.generation field. The metadata.generation field of an object is
> incremented by the API server when writes are made to the spec field of an
> object. This allows a controller to ignore update events where the spec is
> unchanged, and only the metadata and/or status fields are changed.

https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/predicate?utm_source=godoc#GenerationChangedPredicate

Without this change, the controller gets stuck in a tight loop repeatedly
updating the same RayCluster when a ResourceQuota is exhausted. The controller
updates the RayCluster.status, gets an event about the forbidden Pod with a
different name, updates the .status, gets an event, on and on.
and emit event on RayCluster for Pod reconciliation errors.
Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a great improvement!

@davidxia
Copy link
Contributor Author

Lmk if this can be included in upcoming 0.4.0 release.

@DmitriGekhtman
Copy link
Collaborator

Lmk if this can be included in upcoming 0.4.0 release.

It will probably be included.
Let's wait a business day or two for more feedback, though.

@DmitriGekhtman DmitriGekhtman merged commit 3a64712 into ray-project:master Nov 29, 2022
@davidxia davidxia deleted the dxia/patch12 branch November 29, 2022 14:37
@davidxia
Copy link
Contributor Author

Thanks so much!

davidxia added a commit to davidxia/kuberay that referenced this pull request Jan 23, 2023
ray-project#639 accidentally applied event filters for child resources Pods and Services.
This change does not filter Pod or Service related events. This means Pod
updates will trigger RayCluster reconciliation.

closes ray-project#872
davidxia added a commit to davidxia/kuberay that referenced this pull request Jan 23, 2023
ray-project#639 accidentally applied event filters for child resources Pods and Services.
This change does not filter Pod or Service related events. This means Pod
updates will trigger RayCluster reconciliation.

closes ray-project#872
davidxia added a commit to davidxia/kuberay that referenced this pull request Jan 25, 2023
ray-project#639 accidentally applied event filters for child resources Pods and Services.
This change does not filter Pod or Service related events. This means Pod
updates will trigger RayCluster reconciliation.

closes ray-project#872
davidxia added a commit to davidxia/kuberay that referenced this pull request Jan 25, 2023
ray-project#639 accidentally applied event filters for child resources Pods and Services.
This change does not filter Pod or Service related events. This means Pod
updates will trigger RayCluster reconciliation.

closes ray-project#872
davidxia added a commit to davidxia/kuberay that referenced this pull request Jan 25, 2023
ray-project#639 accidentally applied event filters for child resources Pods and Services.
This change does not filter Pod or Service related events. This means Pod
updates will trigger RayCluster reconciliation.

closes ray-project#872
kevin85421 pushed a commit that referenced this pull request Jan 29, 2023
#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation.
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
ray-project#639)

* feat: update RayCluster `.status.reason` field with pod creation error

Makes RayCluster errors related to Pod creation easier more apparent to user.
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
ray-project#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature][ray-operator] Make pod creation errors accessible
3 participants