Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry cordon + drain if fail, keep lock #486

Merged
merged 1 commit into from
May 6, 2022

Conversation

jackfrancis
Copy link
Collaborator

@jackfrancis jackfrancis commented Dec 16, 2021

This PR adds retry logic to each cordon/uncordon/drain operation, to account for possible failures.

Here's a summary of these changes:

  • If we're just instantiating and we already hold the lock (this is the scenario that happens when kured has just rebooted the node and the pod comes back online):
    • if we fail to grab the k8s client for any reason, we will retry again in ~1 minute (continuously until we succeed)
    • if we fail to uncordon the node, we will retry again in ~1 minute (continuously until we succeed)
    • if we fail to delete any annotations, we will retry again in ~1 minute (continuously until we succeed)

The above essentially ensures that we don't leave a node in an "Unschedulable" state indefinitely. And more importantly, if there is some general cluster pathology that is preventing uncordon from working, we don't release the lock and then propagate across all nodes, reproducing the same issue everywhere.

  • If we fail to add a necessary annotation when a pending reboot is required, we retry the loop, and don't proceed to reboot until we are able to do so
  • If we fail to cordon and drain the node, and we aren't in a "force reboot" configuration, then we release the reboot lock, uncordon the node, and go to the back of the line. The next time we get the lock and observe we need a reboot, we will retry the cordon and drain. If it continues to fail, we will likewise continue to retry indefinitely. Again, in a "force reboot" scenario we will continue onwards and reboot the node despite the fact that cordon and drain failed and there may be running workloads on the node. This is how the "force reboot" feature is supposed to work according to the docs.

This may address some of the edge cases documented here:

#63

cmd/kured/main.go Outdated Show resolved Hide resolved
cmd/kured/main.go Outdated Show resolved Hide resolved
Copy link

@paulgmiller paulgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this will help retry some apiserver errors

source := rand.NewSource(time.Now().UnixNano())
tick := delaytick.New(source, period)
source = rand.NewSource(time.Now().UnixNano())
tick = delaytick.New(source, period)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoudl this respect draintimeout? Before if customer set a darain timeout they would eventually moveforward. Now they never will?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question. Is the drain-timeout configuration meant to be taken literally (as in: just try only this long to drain, and fail if it takes longer) or is it used in a more opinionated context like "only wait this long to drain nodes, but definitely reboot the node no matter what if this timeout is reached"?

The docs suggest it's just a wrapper on top of the k8s drain API, and not an opinion of whether or not to proceed if drain times out:

--drain-timeout duration timeout after which the drain is aborted (default: 0, infinite time)

cmd/kured/main.go Outdated Show resolved Hide resolved
@jackfrancis jackfrancis force-pushed the retry-cordon-drain branch 2 times, most recently from c1c613e to 016a177 Compare December 16, 2021 22:26
@github-actions
Copy link

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

@jackfrancis
Copy link
Collaborator Author

@evrardjp if you have some cycles could you take a look at this PR? I think it's a useful evolution of the existing assumptions of the kured contract which implements "reboot node" as a transaction that includes "cordon/drain + actually reboot".

@ckotzbauer ckotzbauer added this to the 1.10.0 milestone Apr 2, 2022
@jackfrancis
Copy link
Collaborator Author

@ckotzbauer do we feel comfortable merging this?

@ckotzbauer
Copy link
Member

I would say yes @jackfrancis

@jackfrancis
Copy link
Collaborator Author

@ckotzbauer Cool, will do, let's make sure we add explicit notes about this evolution of the retry loop in the 1.10.0 release notes

@jackfrancis jackfrancis merged commit d965e7f into kubereboot:main May 6, 2022
@jackfrancis jackfrancis deleted the retry-cordon-drain branch May 6, 2022 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants