Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Delete out of date machines with unhealthy control plane component conditions when rolling out KCP #10196

Merged
merged 14 commits into from
Apr 11, 2024
Merged

🐛 Delete out of date machines with unhealthy control plane component conditions when rolling out KCP #10196

merged 14 commits into from
Apr 11, 2024

Conversation

Levi080513
Copy link
Contributor

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Fix #10093

Test

  1. Create 1 CP 1 Worker CAPI Cluster.
kubectl get cluster,kcp,machine -n default | grep hw-sks-test-unhealthy-cp
cluster.cluster.x-k8s.io/hw-sks-test-unhealthy-cp   Provisioned   8m2s   
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane   hw-sks-test-unhealthy-cp   true          true                   1          1       1         0             8m2s   v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-9nbvm             hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-9nbvm   elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e   Running        7m58s   v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp    hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-node-l7ml5           elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282   Running        8m1s    v1.25.15

  1. Update KCP by adding space between two names in KCP.spec.kubeadmConfigSpec.clusterConfiguration.apiServer.extraArgs.tls-cipher-suites.
kubectl get cluster,kcp,machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks-test-unhealthy-cp
NAME                                                PHASE         AGE   VERSION
cluster.cluster.x-k8s.io/hw-sks-test-unhealthy-cp   Provisioned   18m   

NAME                                                                                      CLUSTER                    INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane   hw-sks-test-unhealthy-cp   true          true                   2          2       1         0             18m   v1.25.15

NAME                                                                            CLUSTER                    NODENAME                                      PROVIDERID                                   PHASE     AGE     VERSION
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-9nbvm            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-9nbvm   elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e   Running   17m     v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-lwm6b            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-lwm6b   elf://ebacfd9a-fd18-488f-959a-35a4fe2275fe   Running   7m14s   v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp   hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-node-l7ml5           elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282   Running   17m     v1.25.15
  1. Then the new CP Node will become ready, but its APIServer can not start and the APIServerPodHealthy condition with False status is added on the CP Machine. And the KCP won't become ready forever.
kubectl get  machine hw-sks-test-unhealthy-cp-controlplane-lwm6b -n default -ojson | jq '.status.conditions'
[
  {
    "lastTransitionTime": "2024-02-26T10:20:01Z",
    "status": "True",
    "type": "Ready"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:41Z",
    "message": "CrashLoopBackOff",
    "reason": "PodFailed",
    "severity": "Error",
    "status": "False",
    "type": "APIServerPodHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:12:11Z",
    "status": "True",
    "type": "BootstrapReady"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:06Z",
    "status": "True",
    "type": "ControllerManagerPodHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:09Z",
    "status": "True",
    "type": "EtcdMemberHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:07Z",
    "status": "True",
    "type": "EtcdPodHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:20:01Z",
    "status": "True",
    "type": "InfrastructureReady"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:24Z",
    "status": "True",
    "type": "NodeHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:15:24Z",
    "status": "True",
    "type": "SchedulerPodHealthy"
  }
]

kubectl get -n default kcp hw-sks-test-unhealthy-cp-controlplane -ojson | jq '.status.conditions'
[
  {
    "lastTransitionTime": "2024-02-26T10:12:12Z",
    "message": "Rolling 1 replicas with outdated spec (1 replicas up to date)",
    "reason": "RollingUpdateInProgress",
    "severity": "Warning",
    "status": "False",
    "type": "Ready"
  },
  {
    "lastTransitionTime": "2024-02-26T10:03:01Z",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2024-02-26T10:01:27Z",
    "status": "True",
    "type": "CertificatesAvailable"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:10Z",
    "message": "Following machines are reporting control plane errors: hw-sks-test-unhealthy-cp-controlplane-lwm6b",
    "reason": "ControlPlaneComponentsUnhealthy",
    "severity": "Error",
    "status": "False",
    "type": "ControlPlaneComponentsHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:10Z",
    "status": "True",
    "type": "EtcdClusterHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:01:48Z",
    "status": "True",
    "type": "MachinesCreated"
  },
  {
    "lastTransitionTime": "2024-02-26T10:26:33Z",
    "status": "True",
    "type": "MachinesReady"
  },
  {
    "lastTransitionTime": "2024-02-26T10:12:12Z",
    "message": "Rolling 1 replicas with outdated spec (1 replicas up to date)",
    "reason": "RollingUpdateInProgress",
    "severity": "Warning",
    "status": "False",
    "type": "MachinesSpecUpToDate"
  },
  {
    "lastTransitionTime": "2024-02-26T10:12:12Z",
    "message": "Scaling down control plane to 1 replicas (actual 2)",
    "reason": "ScalingDown",
    "severity": "Warning",
    "status": "False",
    "type": "Resized"
  }
]

  1. Update KCP, delete the spaces previously added in the KCP.spec.kubeadmConfigSpec.clusterConfiguration.apiServer.extraArgs.tls-cipher-suites and delete any one of suites configuration to ensure that the current configuration is different from the one originally created.
  2. First delete the CP Machine that is abnormal and outdated.
kubectl get machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks-test-unhealthy-cp
NAME                                                                            CLUSTER                    NODENAME                                      PROVIDERID                                   PHASE      AGE   VERSION
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-9nbvm            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-9nbvm   elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e   Running    34m   v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-lwm6b            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-lwm6b   elf://ebacfd9a-fd18-488f-959a-35a4fe2275fe   Deleting   24m   v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp   hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-node-l7ml5           elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282   Running    34m   v1.25.15

  1. Then create a new CP Machine.
 kubectl get machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks-test-unhealthy-cp 
NAME                                                   CLUSTER                    NODENAME                                      PROVIDERID                                   PHASE     AGE    VERSION
hw-sks-test-unhealthy-cp-controlplane-9nbvm            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-9nbvm   elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e   Running   37m    v1.25.15
hw-sks-test-unhealthy-cp-controlplane-d8ffd            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-d8ffd   elf://955ce3f7-3fde-4119-a885-09fc3ccd4e6e   Running   2m9s   v1.25.15
hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp   hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-node-l7ml5           elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282   Running   37m    v1.25.15
  1. Finally delete the machine that is in Ready state but outdated.
kubectl get machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks
-test-unhealthy-cp -w
NAME                                                   CLUSTER                    NODENAME                                      PROVIDERID                                   PHASE      AGE     VERSION
hw-sks-test-unhealthy-cp-controlplane-9nbvm            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-9nbvm   elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e   Deleting   40m     v1.25.15
hw-sks-test-unhealthy-cp-controlplane-d8ffd            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-d8ffd   elf://955ce3f7-3fde-4119-a885-09fc3ccd4e6e   Running    4m37s   v1.25.15
hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp   hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-node-l7ml5           elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282   Running    40m     v1.25.15

  1. Cluster, KCP, Machines are Ready.
kubectl get cluster,kcp,machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks-test-unhealthy-cp
NAME                                                PHASE         AGE   VERSION
cluster.cluster.x-k8s.io/hw-sks-test-unhealthy-cp   Provisioned   41m   

NAME                                                                                      CLUSTER                    INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane   hw-sks-test-unhealthy-cp   true          true                   1          1       1         0             41m   v1.25.15

NAME                                                                            CLUSTER                    NODENAME                                      PROVIDERID                                   PHASE     AGE     VERSION
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-d8ffd            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-d8ffd   elf://955ce3f7-3fde-4119-a885-09fc3ccd4e6e   Running   6m18s   v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp   hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-node-l7ml5           elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282   Running   41m     v1.25.15

/area provider/control-plane-kubeadm

@k8s-ci-robot k8s-ci-robot added area/provider/control-plane-kubeadm Issues or PRs related to KCP cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 26, 2024
@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 26, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @Levi080513. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 26, 2024
@Levi080513 Levi080513 changed the title 🐛 Prioritize deletion of abnormal outdated CP machines when scaling down KCP 🐛 Delete out of date machines with unhealthy control plane component conditions when rolling out KCP Feb 28, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 29, 2024
@fabriziopandini
Copy link
Member

/assign

cc @vincepri @enxebre for an opinion

@jessehu
Copy link
Contributor

jessehu commented Mar 1, 2024

LGTM! Waiting for CAPI team to take a look. Thanks!

util/collections/machine_filters.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/control_plane.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/control_plane.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/control_plane.go Outdated Show resolved Hide resolved
@Levi080513 Levi080513 requested a review from vincepri March 7, 2024 07:49
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 7, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 11, 2024
@fabriziopandini
Copy link
Member

Great work and thank you for taking care of all our comments, appreciated
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 11, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 9734518eccbaf837f4c5e0ba1127232c4bbca143

@sbueringer
Copy link
Member

Thank you very much!!

/approve

@sbueringer
Copy link
Member

/cherry-pick release-1.7

@k8s-infra-cherrypick-robot

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.7 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

/cherry-pick release-1.6

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 11, 2024
@k8s-infra-cherrypick-robot

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.6 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

/cherry-pick release-1.5

@k8s-infra-cherrypick-robot

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.5 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

Let's see if we get lucky with the automated cherry-picks

@k8s-ci-robot k8s-ci-robot merged commit c88e464 into kubernetes-sigs:main Apr 11, 2024
21 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.8 milestone Apr 11, 2024
@k8s-infra-cherrypick-robot

@sbueringer: new pull request created: #10421

In response to this:

/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@sbueringer: #10196 failed to apply on top of branch "release-1.6":

Applying: Prioritize deletion of abnormal outdated CP machines when scaling down KCP
Using index info to reconstruct a base tree...
M	controlplane/kubeadm/internal/control_plane.go
M	controlplane/kubeadm/internal/controllers/remediation_test.go
M	controlplane/kubeadm/internal/controllers/scale.go
M	controlplane/kubeadm/internal/controllers/scale_test.go
M	docs/proposals/20191017-kubeadm-based-control-plane.md
M	util/collections/machine_filters.go
Falling back to patching base and 3-way merge...
Auto-merging util/collections/machine_filters.go
CONFLICT (content): Merge conflict in util/collections/machine_filters.go
Auto-merging docs/proposals/20191017-kubeadm-based-control-plane.md
Auto-merging controlplane/kubeadm/internal/controllers/scale_test.go
Auto-merging controlplane/kubeadm/internal/controllers/scale.go
Auto-merging controlplane/kubeadm/internal/controllers/remediation_test.go
Auto-merging controlplane/kubeadm/internal/control_plane.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Prioritize deletion of abnormal outdated CP machines when scaling down KCP
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@sbueringer: #10196 failed to apply on top of branch "release-1.5":

Applying: Prioritize deletion of abnormal outdated CP machines when scaling down KCP
Using index info to reconstruct a base tree...
M	controlplane/kubeadm/internal/control_plane.go
M	controlplane/kubeadm/internal/controllers/remediation_test.go
M	controlplane/kubeadm/internal/controllers/scale.go
M	controlplane/kubeadm/internal/controllers/scale_test.go
M	docs/proposals/20191017-kubeadm-based-control-plane.md
M	util/collections/machine_filters.go
Falling back to patching base and 3-way merge...
Auto-merging util/collections/machine_filters.go
CONFLICT (content): Merge conflict in util/collections/machine_filters.go
Auto-merging docs/proposals/20191017-kubeadm-based-control-plane.md
Auto-merging controlplane/kubeadm/internal/controllers/scale_test.go
Auto-merging controlplane/kubeadm/internal/controllers/scale.go
Auto-merging controlplane/kubeadm/internal/controllers/remediation_test.go
Auto-merging controlplane/kubeadm/internal/control_plane.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Prioritize deletion of abnormal outdated CP machines when scaling down KCP
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jessehu
Copy link
Contributor

jessehu commented Apr 11, 2024

Thanks @sbueringer @fabriziopandini a lot for your review and patience!
The auto cherry-pick failed for release-1.6 and 1.5. My team member @Levi080513 can create new PR for release-1.6 separatly if needed.

@sbueringer
Copy link
Member

sbueringer commented Apr 11, 2024

Sounds good! Feel free to go ahead with the cherry-pick(s) (also to 1.5, if you want to, if you don't need it there it's fine to only cherry-pick into 1.6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/control-plane-kubeadm Issues or PRs related to KCP cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KCP rollout with new config will stuck when there is an unhealthy APIServer node
10 participants