KCP rollout with new config will stuck when there is an unhealthy APIServer node #10093

jessehu · 2024-02-03T15:00:40Z

What steps did you take and what happened?

Create a 1 CP or 3 CP cluster successfully without turning on MHC.
Update the existing KCP with a wrong apiserver params (e.g. add a space between two names in clusterConfiguration.apiServer.extraArgs.tls-cipher-suites).
- Then the new CP Node will become ready, but its APIServer can not start and the APIServerPodHealthy condition with False status is added on the CP Machine (shown in picture 1).
- The KCP won't become ready forever as expected。
If update the KCP with a correct apiserver params (same as the params when creating the cluster), then the CP Machine with unhealthy APIServer is deleted, and the KCP will become ready as expected.
If update the KCP with a correct apiserver params (different from the params when creating the cluster), then KCP tries to delete the oldest ready CP Machine (not the one with unhealthy APIServer), and fails at KCP preflight check. So the KCP won't become ready forever (not as expected)。

What did you expect to happen?

In step 4, KCP controller should delete the Machine with unhealthy APIServer, then the KCP rollout can succeed.

Cluster API version

1.5.2

Kubernetes version

1.25.15

Anything else you would like to add?

The root cause is either reconcileUnhealthyMachines() for MHC or upgradeControlPlane(ctx, controlPlane, machinesNeedingRollout) / selectMachineForScaleDown() do not consider the Machine CR with false APIServerPodHealthy condition.

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

sbueringer · 2024-04-04T14:11:12Z

/triage accepted

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 3, 2024

Levi080513 mentioned this issue Feb 26, 2024

🐛 Delete out of date machines with unhealthy control plane component conditions when rolling out KCP #10196

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 4, 2024

k8s-ci-robot closed this as completed in #10196 Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KCP rollout with new config will stuck when there is an unhealthy APIServer node #10093

KCP rollout with new config will stuck when there is an unhealthy APIServer node #10093

jessehu commented Feb 3, 2024

sbueringer commented Apr 4, 2024

KCP rollout with new config will stuck when there is an unhealthy APIServer node #10093

KCP rollout with new config will stuck when there is an unhealthy APIServer node #10093

Comments

jessehu commented Feb 3, 2024

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

sbueringer commented Apr 4, 2024