-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retry cordon + drain if fail, keep lock #486
Conversation
5ea94fa
to
459f7e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this will help retry some apiserver errors
source := rand.NewSource(time.Now().UnixNano()) | ||
tick := delaytick.New(source, period) | ||
source = rand.NewSource(time.Now().UnixNano()) | ||
tick = delaytick.New(source, period) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shoudl this respect draintimeout? Before if customer set a darain timeout they would eventually moveforward. Now they never will?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question. Is the drain-timeout
configuration meant to be taken literally (as in: just try only this long to drain, and fail if it takes longer) or is it used in a more opinionated context like "only wait this long to drain nodes, but definitely reboot the node no matter what if this timeout is reached"?
The docs suggest it's just a wrapper on top of the k8s drain API, and not an opinion of whether or not to proceed if drain times out:
--drain-timeout duration timeout after which the drain is aborted (default: 0, infinite time)
c1c613e
to
016a177
Compare
This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days). |
016a177
to
93d6a78
Compare
@evrardjp if you have some cycles could you take a look at this PR? I think it's a useful evolution of the existing assumptions of the kured contract which implements "reboot node" as a transaction that includes "cordon/drain + actually reboot". |
@ckotzbauer do we feel comfortable merging this? |
I would say yes @jackfrancis |
@ckotzbauer Cool, will do, let's make sure we add explicit notes about this evolution of the retry loop in the 1.10.0 release notes |
This PR adds retry logic to each cordon/uncordon/drain operation, to account for possible failures.
Here's a summary of these changes:
The above essentially ensures that we don't leave a node in an "Unschedulable" state indefinitely. And more importantly, if there is some general cluster pathology that is preventing uncordon from working, we don't release the lock and then propagate across all nodes, reproducing the same issue everywhere.
This may address some of the edge cases documented here:
#63