Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 requeue KCP object if ControlPlaneComponentsHealthyCondition is not yet true #9032

Conversation

chrischdi
Copy link
Member

What this PR does / why we need it:

When the cluster is in the state that:

  • The KCP in general is ready and
  • All its conditions exist and are true, except the ControlPlaneComponentsHealthyCondition and
  • KCP reconciles and has a no-op

Then the controller does reach the end of the reconcile function and does a return ctrl.Result{}, nil.

At the time of the relevant workload pods (etcd, kube-apiserver, kube-controller-manager, kube-scheduler) getting ready and reporting their ready state inside the workload cluster, no new additional event gets injected for the KCP object.

The KCP controller has to wait for an different change to the watched objects, or to reach the resync period to mark the condition to true.

This delays provisioning when the preflight checks for MachineSets are active, which also leads to flaky tests due to reaching the timeout of the test before reaching the resync period.

This PR solves this delay by ensuring to requeue in this special case.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Fixes #8786

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 21, 2023
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 21, 2023
Comment on lines 235 to 239
// Make sure KCP gets requeued if ControlPlaneComponentsHealthyCondition is still false.
// Otherwise KCP would only get requeued when KCP or the Cluster gets a change or via reaching the resyncperiod.
// That would lead to a delay in provisioning MachineDeployments when preflight checks are enabled.
// The alternative solution to this requeue would be watching the relevant pods inside each workload
// cluster which would be very expensive.
Copy link
Contributor

@killianmuldoon killianmuldoon Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Make sure KCP gets requeued if ControlPlaneComponentsHealthyCondition is still false.
// Otherwise KCP would only get requeued when KCP or the Cluster gets a change or via reaching the resyncperiod.
// That would lead to a delay in provisioning MachineDeployments when preflight checks are enabled.
// The alternative solution to this requeue would be watching the relevant pods inside each workload
// cluster which would be very expensive.
// Make KCP requeue if ControlPlaneComponentsHealthyCondition is false so we can check for control plane component status without waiting for a full resync (by default 10 minutes). Only requeue if there is no error, Requeue or RequeueAfter and the object does not have a deletion timestamp.
Otherwise this condition can lead to a delay in provisioning MachineDeployments when MachineSet preflight checks are enabled.
// The alternative solution to this requeue would be watching the relevant pods inside each workload cluster which would be very expensive.

@sbueringer
Copy link
Member

Looks good to me +/- the nits above. I would say let's get those fixed and then merge before the weekend so we get some CI coverage.

I would also propose to cherry-pick onto release-1.5.

I think overall the change is safe because we just requeue a bit more while control plane components are unhealthy

@chrischdi chrischdi force-pushed the pr-kcp-requeue-condition-components branch from bf812d8 to 205a1f5 Compare July 21, 2023 12:45
@chrischdi
Copy link
Member Author

Updated the comments + moved the IsZero up. I hope I did catch all suggestions of it 👍

@chrischdi
Copy link
Member Author

/area provider/control-plane-kubeadm

@k8s-ci-robot k8s-ci-robot added the area/provider/control-plane-kubeadm Issues or PRs related to KCP label Jul 21, 2023
Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 21, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 16b2fd21a507788b34975152ae972c85889cee59

@killianmuldoon
Copy link
Contributor

/retest

@sbueringer
Copy link
Member

Thx!!

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2023
@sbueringer
Copy link
Member

/cherry-pick release-1.5

@k8s-infra-cherrypick-robot

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.5 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

/cherry-pick release-1.4

@k8s-infra-cherrypick-robot

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.4 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

/cherry-pick release-1.3

@k8s-infra-cherrypick-robot

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.3 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

Let's see if it's cherry-pick'able in 1.4 and 1.3 as well

@k8s-ci-robot k8s-ci-robot merged commit 4dd60f5 into kubernetes-sigs:main Jul 21, 2023
10 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.6 milestone Jul 21, 2023
@k8s-infra-cherrypick-robot

@sbueringer: new pull request created: #9034

In response to this:

/cherry-pick release-1.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@sbueringer: new pull request created: #9035

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@sbueringer: new pull request created: #9036

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/control-plane-kubeadm Issues or PRs related to KCP cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test flaking with Timed out waiting for 1 nodes to be created for MachineDeployment
5 participants