Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCP rollout with new config will stuck when there is an unhealthy APIServer node #10093

Closed
jessehu opened this issue Feb 3, 2024 · 1 comment · Fixed by #10196
Closed

KCP rollout with new config will stuck when there is an unhealthy APIServer node #10093

jessehu opened this issue Feb 3, 2024 · 1 comment · Fixed by #10196
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jessehu
Copy link
Contributor

jessehu commented Feb 3, 2024

What steps did you take and what happened?

  1. Create a 1 CP or 3 CP cluster successfully without turning on MHC.
  2. Update the existing KCP with a wrong apiserver params (e.g. add a space between two names in clusterConfiguration.apiServer.extraArgs.tls-cipher-suites).
    • Then the new CP Node will become ready, but its APIServer can not start and the APIServerPodHealthy condition with False status is added on the CP Machine (shown in picture 1).
    • The KCP won't become ready forever as expected
  3. If update the KCP with a correct apiserver params (same as the params when creating the cluster), then the CP Machine with unhealthy APIServer is deleted, and the KCP will become ready as expected.
  4. If update the KCP with a correct apiserver params (different from the params when creating the cluster), then KCP tries to delete the oldest ready CP Machine (not the one with unhealthy APIServer), and fails at KCP preflight check. So the KCP won't become ready forever (not as expected)。

图片

What did you expect to happen?

In step 4, KCP controller should delete the Machine with unhealthy APIServer, then the KCP rollout can succeed.

Cluster API version

1.5.2

Kubernetes version

1.25.15

Anything else you would like to add?

The root cause is either reconcileUnhealthyMachines() for MHC or upgradeControlPlane(ctx, controlPlane, machinesNeedingRollout) / selectMachineForScaleDown() do not consider the Machine CR with false APIServerPodHealthy condition.

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 3, 2024
@sbueringer
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants