-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: retry watcher failure causes infinite loop #8786
Conversation
✅ Deploy Preview for determined-ui canceled.
|
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #8786 +/- ##
==========================================
- Coverage 47.69% 47.69% -0.01%
==========================================
Files 1049 1049
Lines 167230 167233 +3
Branches 2239 2241 +2
==========================================
- Hits 79766 79765 -1
- Misses 87306 87310 +4
Partials 158 158
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. i'm still not sure what caused the retry watcher to crash, but it definitely did and so is the best option we have right now.
(cherry picked from commit e873381)
Description
When k8s retrywatcher fails, it causes us to receive infinite nil events and pins the CPU to 100%. Since a retrywatcher failure likely means something bad happened, we'll panic when the issue comes up.
Test Plan
Manually tested - as long as the cluster is able to work normally without crashing, this fix works. The bug that this fixes happens infrequently and we'll land a test PR after this one.
Commentary (optional)
Checklist
docs/release-notes/
.See Release Note for details.
Ticket