-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 requeue KCP object if ControlPlaneComponentsHealthyCondition is not yet true #9032
🐛 requeue KCP object if ControlPlaneComponentsHealthyCondition is not yet true #9032
Conversation
// Make sure KCP gets requeued if ControlPlaneComponentsHealthyCondition is still false. | ||
// Otherwise KCP would only get requeued when KCP or the Cluster gets a change or via reaching the resyncperiod. | ||
// That would lead to a delay in provisioning MachineDeployments when preflight checks are enabled. | ||
// The alternative solution to this requeue would be watching the relevant pods inside each workload | ||
// cluster which would be very expensive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Make sure KCP gets requeued if ControlPlaneComponentsHealthyCondition is still false. | |
// Otherwise KCP would only get requeued when KCP or the Cluster gets a change or via reaching the resyncperiod. | |
// That would lead to a delay in provisioning MachineDeployments when preflight checks are enabled. | |
// The alternative solution to this requeue would be watching the relevant pods inside each workload | |
// cluster which would be very expensive. | |
// Make KCP requeue if ControlPlaneComponentsHealthyCondition is false so we can check for control plane component status without waiting for a full resync (by default 10 minutes). Only requeue if there is no error, Requeue or RequeueAfter and the object does not have a deletion timestamp. | |
Otherwise this condition can lead to a delay in provisioning MachineDeployments when MachineSet preflight checks are enabled. | |
// The alternative solution to this requeue would be watching the relevant pods inside each workload cluster which would be very expensive. |
Looks good to me +/- the nits above. I would say let's get those fixed and then merge before the weekend so we get some CI coverage. I would also propose to cherry-pick onto release-1.5. I think overall the change is safe because we just requeue a bit more while control plane components are unhealthy |
bf812d8
to
205a1f5
Compare
Updated the comments + moved the |
/area provider/control-plane-kubeadm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
LGTM label has been added. Git tree hash: 16b2fd21a507788b34975152ae972c85889cee59
|
/retest |
Thx!! /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbueringer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-1.5 |
@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.5 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-1.4 |
@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.4 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-1.3 |
@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.3 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Let's see if it's cherry-pick'able in 1.4 and 1.3 as well |
@sbueringer: new pull request created: #9034 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@sbueringer: new pull request created: #9035 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@sbueringer: new pull request created: #9036 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
When the cluster is in the state that:
ControlPlaneComponentsHealthyCondition
andThen the controller does reach the end of the
reconcile
function and does areturn ctrl.Result{}, nil
.At the time of the relevant workload pods (etcd, kube-apiserver, kube-controller-manager, kube-scheduler) getting ready and reporting their ready state inside the workload cluster, no new additional event gets injected for the KCP object.
The KCP controller has to wait for an different change to the watched objects, or to reach the resync period to mark the condition to true.
This delays provisioning when the preflight checks for MachineSets are active, which also leads to flaky tests due to reaching the timeout of the test before reaching the resync period.
This PR solves this delay by ensuring to requeue in this special case.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #8786