-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ KCP adopts existing machines #2489
✨ KCP adopts existing machines #2489
Conversation
@sethp-nr: You must be a member of the kubernetes-sigs/cluster-api-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Maintainers and have them propose you as an additional delegate for this responsibility. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/milestone v0.3.0 |
/assign to review later |
6c81d4a
to
c37afb5
Compare
OK, I think this is RFR – I left the hold label on there so we can make a judgement call about whether it needs to be feature-gated after a review cycle or two. |
cb435e1
to
17da6e3
Compare
// We do an uncached full quorum read against the KCP to avoid re-adopting Machines the garbage collector just intentionally orphaned | ||
// See https://github.com/kubernetes/kubernetes/issues/42639 | ||
uncached := controlplanev1.KubeadmControlPlane{} | ||
err := r.uncachedClient.Get(ctx, client.ObjectKey{Namespace: kcp.Namespace, Name: kcp.Name}, &uncached) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we adopt a Machine that's going to be deleted, wouldn't we get a notification and KCP reconciles again afterwards?
I guess what I'm trying to understand is the actual problem this extra client is trying to solve, the comment explains part of it, but I'm missing why this would be a problem given that the controller will keep reconciling the resource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an in-depth explanation of the flow in the test suite, but the issue has to do with a deleted KCP possibly re-adopting (and then deleting) dependents that we don't want cleaned up.
There is another, different race with orphaning dependents (see #2505) but from my brief survey of k/k's adoption code it all does a check like this one before it actually takes ownership of anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a common pattern for adoption, we likely should create an issue to track standardizing on this approach. I also wonder if it would be worth it to try and get support for this more directly in controller-runtime somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call – post-release when I file the follow-up "let's talk about deletes" issue for 2505 there I'll make sure to include this part.
controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go
Outdated
Show resolved
Hide resolved
controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go
Outdated
Show resolved
Hide resolved
controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go
Outdated
Show resolved
Hide resolved
controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go
Outdated
Show resolved
Hide resolved
err := r.Client.Update(ctx, &ss) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use Patch instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to using patch here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Patch that's implemented is strategic merge. We'd need to use a different strategy to do a targeted remove of the old controller reference before adding the new one.
if err != nil { | ||
logger.Error(err, "failed to retrieve control plane machines for cluster") | ||
return ctrl.Result{}, err | ||
} | ||
|
||
adoptableMachines := controlPlaneMachines.Filter(internal.AdoptableControlPlaneMachines(cluster.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this behavior should be opt-in rather than having it run by default, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One would already need to create KCP resource to trigger the adoption, or do you think that an additional safeguard would be needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure yet, that's a good point though, it's opt-in if you create and link the object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that felt like the opt-in edge to me. Otherwise, the KCP creates a new, separate control plane that's "for" the same cluster as the old one but otherwise independent.
@@ -93,10 +94,34 @@ func OwnedControlPlaneMachines(controlPlaneName string) MachineFilter { | |||
if controllerRef == nil { | |||
return false | |||
} | |||
return controllerRef.Kind == "KubeadmControlPlane" && controllerRef.Name == controlPlaneName | |||
return controllerRef.Kind == "KubeadmControlPlane" && controllerRef.Name == owner.GetName() && controllerRef.UID == owner.GetUID() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that the addition of UID here isn't introducing any edge cases around management cluster pivoting and that adoption should reconcile the UID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very good point. I don't think the adoption code will reconcile the UID properly as it's currently written, but maybe clusterctl move
knows how to handle that already? Otherwise, I'd expect the GC to cause trouble for pivoting anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that clusterctl move does update owner references as objects are moved across:
UID: ownerNode.newUID, // Use the owner's newUID read from the target management cluster (instead of the UID read during discovery). |
That is an impressive piece of code!
I think what that means for us is that UIDs are valid in post-clusterctl v2 workflows. Is there a support matrix where we might need to consider a clusterctl v1 pivot with a KCP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we necessarily need to consider the v1 pivot, but I do think that we need to consider the use case of someone backing up and restoring resources from backup, which I'm not sure that tools in that space would necessarily do the right thing with the UID there.
It seems like the controller-runtime code handles overwriting the controller ref if the api version or the UID are out of date, and I suspect we should likely do the same (even if part of a different workflow than the comparison here)
// Usage: managementCluster.GetMachinesForCluster(ctx, cluster, OwnedControlPlaneMachines(controlPlane.Name)) | ||
func OwnedControlPlaneMachines(controlPlaneName string) MachineFilter { | ||
// Usage: managementCluster.GetMachinesForCluster(ctx, cluster, OwnedControlPlaneMachines(controlPlane)) | ||
func OwnedControlPlaneMachines(owner metav1.Object) func(machine *clusterv1.Machine) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: based on the behavior, I'm wondering if this should be renamed OwnedMachines
or should also apply the ControlPlaneMachines filter added below
}, | ||
ControlPlaneMachines(clusterName), | ||
func(machine *clusterv1.Machine) bool { | ||
return metav1.GetControllerOf(machine) == nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the UID check introduced above, I suspect that this should also take into account the case of the controller referencing the kind/name of the controller ref matching the kcp but the UID not matching.
controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go
Outdated
Show resolved
Hide resolved
err := r.Client.Update(ctx, &ss) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to using patch here
controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go
Outdated
Show resolved
Hide resolved
controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go
Outdated
Show resolved
Hide resolved
controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go
Outdated
Show resolved
Hide resolved
controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go
Outdated
Show resolved
Hide resolved
8ced727
to
5b0e5e9
Compare
5b0e5e9
to
ab53134
Compare
I can rebase past the e2e removal easily enough. For api changes, it sounds like I should be adding new methods & deprecating old ones? |
Yeah that seems the way to go |
6c5538d
to
891f4e8
Compare
Ok, I introduced new methods in 891f4e8 – if that looks ok, I'll squash it in. |
Looking good. |
Doing another pass |
TypeMeta: metav1.TypeMeta{ | ||
APIVersion: clusterv1.GroupVersion.String(), | ||
Kind: "Cluster", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because PointsToObject
checks Kind and group (though not version) per Jason's suggestion to have it match on the object kind / group / name triple.
managementClusterUncached internal.ManagementCluster | ||
|
||
uncachedClient client.Reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we dedupe uncachedClient
with managementClusterUncached
? We can add more methods to internal.ManagementCluster
if needed
test/framework/convenience.go
Outdated
// TypeToKindChecked returns the Kind without the package prefix. Pass in a pointer to a struct | ||
// This will panic if used incorrectly. | ||
func TypeToKindChecked(i runtime.Object) string { | ||
return reflect.ValueOf(i).Elem().Type().Name() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TypeToKindChecked returns the Kind without the package prefix. Pass in a pointer to a struct | |
// This will panic if used incorrectly. | |
func TypeToKindChecked(i runtime.Object) string { | |
return reflect.ValueOf(i).Elem().Type().Name() | |
} | |
// ObjectToKind returns the Kind without the package prefix. Pass in a pointer to a struct | |
// This will panic if used incorrectly. | |
func ObjectToKind(i runtime.Object) string { | |
return reflect.ValueOf(i).Elem().Type().Name() | |
} |
util/util.go
Outdated
} | ||
|
||
// Returns true if ref refers to obj | ||
func refersTo(ref *metav1.OwnerReference, obj Object) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This internal method and the other one are almost the same :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, interface vs struct in the second position. I'll move 'em next to each other at least.
util/util.go
Outdated
type Object interface { | ||
metav1.Object | ||
GroupVersionKind() schema.GroupVersionKind | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this interface private? There is kubernetes-sigs/controller-runtime#898 upstream that we might be able to use at some point, although I don't want to expose this interface just yet (and make it an API)
util/util.go
Outdated
// PointsToObject returns true if any of the owner references point to the given target | ||
func PointsToObject(obj metav1.Object, target Object) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PointsToObject
is a little too generic, should this be IsOwnedByObject
?
5a11862
to
22d7846
Compare
22d7846
to
78b1795
Compare
@sethp-nr: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
78b1795
to
bf58224
Compare
util/util.go
Outdated
type object interface { | ||
metav1.Object | ||
GroupVersionKind() schema.GroupVersionKind | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we rebase on master, we should be able to consume controllerutil.Object
from kubernetes-sigs/controller-runtime#898
/milestone v0.3.7 |
The KCP controller identifies Machines that belong to the control plane of an existing cluster and adopts them, including finding PKI materials that may be owned by the machine's bootstrap config and pivoting their ownership to the KCP as well. Prior to adopting machines (which, if unsuccessful, will block the KCP from taking any management actions), it runs a number of safety checks including: - Ensuring the KCP has not been deleted (to prevent re-adoption of orphans, though this process races with the garbage collector) - Checking that the machine's bootstrap provider was KubeadmConfig - Verifying that the Machine is no further than one minor version off of the KCP's spec Additionally, we set set a "best guess" value for the kubeadm.controlplane.cluster.x-k8s.io/hash on the adopted machine as if it were generated by a KCP in the past. The intent is that a KCP will adopt machines matching its "spec" (to the best of its ability) without modification, which in practice works well for adopting machines with the same spec'd version. We make an educated guess at what parts of the kubeadm config the operator considers important, and feed that into the hash function. The goal here is to be able to: 1. Create a KCP over existing machines with ~the same configuration (for then new api/equality's package definition of the same) and have the KCP make no changes 2. Create a KCP over existing machines with a different configuration and have the KCP upgrade those machines to the new config Finally, replaces PointsTo and introduces IsControlledBy, both of which search through the owner reference list of their first argument (similar to metav1.IsControlledBy). Unlike the metav1 function, they're looking for a match on name plus api group (not version) and kind. PointsTo returns true if any OwnerReference matches the pointee, whereas IsControlledBy only returns true if the sole permitted (by the API) controller reference matches. fix: handle JoinConfiguration hashing refactor: util should not check UIDs This is a behavior change not least for tests, which typically do not set either Kind or GroupVersion on various objects that end up acting as referents. Co-Authored-By: mnguyen <[email protected]> Co-Authored-By: Jason DeTiberus <[email protected]> Co-authored-by: Vince Prignano <[email protected]>
bf58224
to
f792d74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chuckha, sethp-nr, vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
What this PR does / why we need it:
A mostly-functional prototype for having the KCP controller identify
Machines that belong to the control plane of an existing cluster and
adopt them.
Additionally, we set set a "best guess" value for the
kubeadm.controlplane.cluster.x-k8s.io/hash on the adopted machine as if
it were generated by a KCP in the past.
Co-authored-by: mnguyen [email protected]
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #2214
TODO
The bits of this work that are still in flight are:
FilterableMachineCollection
a validating webhook that rejects two major version bumps to a KCPSee KCP: validate version changes are no further than one minor version #2578/hold
/milestone v0.3.0