Set leave_on_terminate=true for servers and hardcode maxUnavailable=1 #3000

When leave_on_terminate=false (default), rolling the statefulset is disruptive because the new servers come up with the same node IDs but different IP addresses. They can't join the server cluster until the old server's node ID is marked as failed by serf. During this time, they continually start leader elections because they don't know there's a leader. When they eventually join the cluster, their election term is higher, and so they trigger a leadership swap. The leadership swap happens at the same time as the next node to be rolled is being stopped, and so the cluster can end up without a leader. With leave_on_terminate=true, the stopping server cleanly leaves the cluster, so the new server can join smoothly, even though it has the same node ID as the old server. This increases the speed of the rollout and in my testing eliminates the period without a leader. The downside of this change is that when a server leaves gracefully, it also reduces the number of raft peers. The number of peers is used to calculate the quorum size, so this can unexpectedly change the fault tolerance of the cluster. When running with an odd number of servers, 1 server leaving the cluster does not affect quorum size. E.g. 5 servers => quorum 3, 4 servers => quorum still 3. During a rollout, Kubernetes only stops 1 server at a time, so the quorum won't change. During a voluntary disruption event, e.g. a node being drained, Kubernetes uses the pod disruption budget to determine how many pods in a statefulset can be made unavailable at a time. That's why this change hardcodes this number to 1 now. Also set autopilot min_quorum to min quorum and disable autopilot upgrade migration since that's for blue/green deploys.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set leave_on_terminate=true for servers and hardcode maxUnavailable=1 #3000

Set leave_on_terminate=true for servers and hardcode maxUnavailable=1 #3000

Commits on Jan 16, 2024