Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: retry watcher failure causes infinite loop #8786

Merged
merged 1 commit into from
Feb 1, 2024
Merged

Conversation

eecsliu
Copy link
Contributor

@eecsliu eecsliu commented Feb 1, 2024

Description

When k8s retrywatcher fails, it causes us to receive infinite nil events and pins the CPU to 100%. Since a retrywatcher failure likely means something bad happened, we'll panic when the issue comes up.

Test Plan

Manually tested - as long as the cluster is able to work normally without crashing, this fix works. The bug that this fixes happens infrequently and we'll land a test PR after this one.

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@eecsliu eecsliu requested a review from a team as a code owner February 1, 2024 21:19
@cla-bot cla-bot bot added the cla-signed label Feb 1, 2024
Copy link

netlify bot commented Feb 1, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit c93c758
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/65bc0ade15517d000843625e

@eecsliu eecsliu requested review from stoksc and removed request for amandavialva01 February 1, 2024 21:19
@eecsliu eecsliu added the to-cherry-pick Pull requests that need to be cherry-picked into the current release label Feb 1, 2024
Copy link

codecov bot commented Feb 1, 2024

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (36a2e29) 47.69% compared to head (c93c758) 47.69%.
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8786      +/-   ##
==========================================
- Coverage   47.69%   47.69%   -0.01%     
==========================================
  Files        1049     1049              
  Lines      167230   167233       +3     
  Branches     2239     2241       +2     
==========================================
- Hits        79766    79765       -1     
- Misses      87306    87310       +4     
  Partials      158      158              
Flag Coverage Δ
backend 43.09% <50.00%> (-0.01%) ⬇️
harness 64.32% <ø> (ø)
web 42.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/rm/kubernetesrm/informer.go 78.57% <50.00%> (-1.68%) ⬇️

... and 2 files with indirect coverage changes

Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. i'm still not sure what caused the retry watcher to crash, but it definitely did and so is the best option we have right now.

@eecsliu eecsliu merged commit e873381 into main Feb 1, 2024
86 of 96 checks passed
@eecsliu eecsliu deleted the k8s-retrywatcher-fail branch February 1, 2024 23:15
dai-release bot pushed a commit that referenced this pull request Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed to-cherry-pick Pull requests that need to be cherry-picked into the current release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants