-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve safe time to reboot handling and more #214
Conversation
Skipping CI for Draft Pull Request. |
/test 4.15-openshift-e2e |
e7fda78
to
1259ec5
Compare
/test 4.15-openshift-e2e |
5caab5f
to
be3ac95
Compare
/test 4.15-openshift-e2e |
be3ac95
to
2aca387
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, but didn't finish the review. I'll come back tomorrow
/test 4.15-openshift-e2e |
1 similar comment
/test 4.15-openshift-e2e |
bf6ffbe
to
a7fd2de
Compare
/test 4.15-openshift-e2e |
Signed-off-by: Marc Sluiter <[email protected]>
- wrap long log lines - improve comment Signed-off-by: Marc Sluiter <[email protected]>
- return an error if config isn't set yet - in case of an error, retry instead of using hard coded fallback - set client and logger in main.go - renaming to avoid package name duplication Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
- deduplicate batch size calculations - better structure and comments for reboot duration calculation - fix usage of MaxTimeForNoPeersResponse in calculation - unit tests for both Signed-off-by: Marc Sluiter <[email protected]>
7ddf56a
to
a487c23
Compare
/test 4.15-openshift-e2e |
Decreases risk of introducing issues with the MaxTimeForNoPeersResponse usage change Signed-off-by: Marc Sluiter <[email protected]>
/test 4.15-openshift-e2e |
/retest is one of the new tests flaky...? /test 4.12-test |
crap, retest was wrong for the github workflow of course 🙈 |
Signed-off-by: Marc Sluiter <[email protected]>
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clobrano, slintes The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
all discussions resolved /hold cancel |
Why we need this PR
There are several issues in the current implementation of handling the safe time to reboot.
Context: the safe time reboot determines how long we wait until we assume that the node rebooted and no workloads are running anymore, before continuing with the remediation process and accelerating workload rescheduling by deleting pods, or using the ungraceful node shutdown feature by applying the OutOfService taint.
That time can be set in the SNRConfig (
SafeTimeToAssumeNodeRebootedSeconds
). We also calculate a minimum, which depends on other values in the config, as well as cluster size and watchdog timeout.The minimum time is calculated by all SNR agents on pod start, stored by unhealthy node's agent on the SNR CR during remediation, and then used by both the unhealthy node's agent and the manager for further remediation.
The issues are:
a) the agents crashloop in case the calculated time is lower than the specified time.
b) our own default value can be too low.
c) the calculated time might not be accurate anymore during remediation because of cluster size change.
d) the unhealthy node's agent might not be able to store it's calculated time during remediation, so the manager won't use it it and fallback to the specified value.
e) even when the agent would be able to store the value, there is a race condition with the manager which wants to store the specified value.
f) both unhealthy agent and the manager are running most of the remediation code in parallel, which leads to unneeded conflicts on resource updates, and hard to understand and maintain code execution flows.
Changes made
These changes fix a+c+d+e:
This fixes a+b:
This fixes f:
More fixes done during development:
MaxTimeForNoPeersResponse
correctlyWhich issue(s) this PR fixes
Fixes ECOPROJECT-1875
and more
Supersedes #197
Test plan