retry cordon + drain if fail, keep lock #486

jackfrancis · 2021-12-16T20:01:18Z

This PR adds retry logic to each cordon/uncordon/drain operation, to account for possible failures.

Here's a summary of these changes:

If we're just instantiating and we already hold the lock (this is the scenario that happens when kured has just rebooted the node and the pod comes back online):
- if we fail to grab the k8s client for any reason, we will retry again in ~1 minute (continuously until we succeed)
- if we fail to uncordon the node, we will retry again in ~1 minute (continuously until we succeed)
- if we fail to delete any annotations, we will retry again in ~1 minute (continuously until we succeed)

The above essentially ensures that we don't leave a node in an "Unschedulable" state indefinitely. And more importantly, if there is some general cluster pathology that is preventing uncordon from working, we don't release the lock and then propagate across all nodes, reproducing the same issue everywhere.

If we fail to add a necessary annotation when a pending reboot is required, we retry the loop, and don't proceed to reboot until we are able to do so
If we fail to cordon and drain the node, and we aren't in a "force reboot" configuration, then we release the reboot lock, uncordon the node, and go to the back of the line. The next time we get the lock and observe we need a reboot, we will retry the cordon and drain. If it continues to fail, we will likewise continue to retry indefinitely. Again, in a "force reboot" scenario we will continue onwards and reboot the node despite the fact that cordon and drain failed and there may be running workloads on the node. This is how the "force reboot" feature is supposed to work according to the docs.

This may address some of the edge cases documented here:

#63

cmd/kured/main.go

paulgmiller

Seems like this will help retry some apiserver errors

paulgmiller · 2021-12-16T21:40:09Z

cmd/kured/main.go

-	source := rand.NewSource(time.Now().UnixNano())
-	tick := delaytick.New(source, period)
+	source = rand.NewSource(time.Now().UnixNano())
+	tick = delaytick.New(source, period)


shoudl this respect draintimeout? Before if customer set a darain timeout they would eventually moveforward. Now they never will?

It's a good question. Is the drain-timeout configuration meant to be taken literally (as in: just try only this long to drain, and fail if it takes longer) or is it used in a more opinionated context like "only wait this long to drain nodes, but definitely reboot the node no matter what if this timeout is reached"?

The docs suggest it's just a wrapper on top of the k8s drain API, and not an opinion of whether or not to proceed if drain times out:

--drain-timeout duration timeout after which the drain is aborted (default: 0, infinite time)

cmd/kured/main.go

github-actions · 2022-02-15T01:54:07Z

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

jackfrancis · 2022-03-08T19:13:39Z

@evrardjp if you have some cycles could you take a look at this PR? I think it's a useful evolution of the existing assumptions of the kured contract which implements "reboot node" as a transaction that includes "cordon/drain + actually reboot".

jackfrancis · 2022-05-06T18:39:05Z

@ckotzbauer do we feel comfortable merging this?

ckotzbauer · 2022-05-06T19:07:35Z

I would say yes @jackfrancis

jackfrancis · 2022-05-06T19:19:26Z

@ckotzbauer Cool, will do, let's make sure we add explicit notes about this evolution of the retry loop in the 1.10.0 release notes

jackfrancis force-pushed the retry-cordon-drain branch from 5ea94fa to 459f7e4 Compare December 16, 2021 21:03