Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Pod updates should trigger RayCluster reconciliation #872

Closed
2 tasks done
architkulkarni opened this issue Jan 18, 2023 · 1 comment · Fixed by #882
Closed
2 tasks done

[Bug] Pod updates should trigger RayCluster reconciliation #872

architkulkarni opened this issue Jan 18, 2023 · 1 comment · Fixed by #882
Labels
bug Something isn't working P2 Important issue, but not time critical

Comments

@architkulkarni
Copy link
Contributor

architkulkarni commented Jan 18, 2023

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

See #850 (comment) and related comments on that PR.

Currently, pod state changes don't trigger cluster reconciliation. This is due to an event filter:

WithEventFilter(predicate.Or(predicate.GenerationChangedPredicate{}, predicate.LabelChangedPredicate{}, predicate.AnnotationChangedPredicate{})).

We should not filter out pod status changes, but continue to filter out RayCluster status changes. We should take care to avoid reconciling in a tight loop, because instance.Status.LastUpdateTime is updated at each reconciliation.

Reproduction script

Same reproduction as in #845. That issue will soon be closed by #850 using a separate fix (always reconciling every 60 seconds), but the present issue is still valid. This issue will fix the up to 60s delay between the cluster being ready and the Ray Job being submitted (this is triggered by the next periodic reconcile).

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@architkulkarni architkulkarni added bug Something isn't working P2 Important issue, but not time critical labels Jan 18, 2023
davidxia added a commit to davidxia/kuberay that referenced this issue Jan 23, 2023
ray-project#639 accidentally applied event filters for child resources Pods and Services.
This change does not filter Pod or Service related events. This means Pod
updates will trigger RayCluster reconciliation.

closes ray-project#872
davidxia added a commit to davidxia/kuberay that referenced this issue Jan 23, 2023
ray-project#639 accidentally applied event filters for child resources Pods and Services.
This change does not filter Pod or Service related events. This means Pod
updates will trigger RayCluster reconciliation.

closes ray-project#872
@davidxia
Copy link
Contributor

@architkulkarni @DmitriGekhtman thanks for noticing this! I just proposed a fix in #882. I don't know how to test it manually though. Lmk if it looks good.

davidxia added a commit to davidxia/kuberay that referenced this issue Jan 25, 2023
ray-project#639 accidentally applied event filters for child resources Pods and Services.
This change does not filter Pod or Service related events. This means Pod
updates will trigger RayCluster reconciliation.

closes ray-project#872
davidxia added a commit to davidxia/kuberay that referenced this issue Jan 25, 2023
ray-project#639 accidentally applied event filters for child resources Pods and Services.
This change does not filter Pod or Service related events. This means Pod
updates will trigger RayCluster reconciliation.

closes ray-project#872
davidxia added a commit to davidxia/kuberay that referenced this issue Jan 25, 2023
ray-project#639 accidentally applied event filters for child resources Pods and Services.
This change does not filter Pod or Service related events. This means Pod
updates will trigger RayCluster reconciliation.

closes ray-project#872
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 Important issue, but not time critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants