Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPI won't delete Nodes in "managed" control planes because there are no control plane machines #3631

Closed
rudoi opened this issue Sep 11, 2020 · 20 comments · Fixed by #3673
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.
Milestone

Comments

@rudoi
Copy link
Contributor

rudoi commented Sep 11, 2020

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]

  • Create cluster with EKS providers in CAPA
  • Create some Machines
  • Delete those Machines
  • Observe nodes not going away, weird scheduling behaviors, etc

What did you expect to happen:

Deleting Machines deletes Nodes too.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

I'm fairly certain the check here is the issue:

machines, err := getActiveMachinesInCluster(ctx, r.Client, machine.Namespace, machine.Labels[clusterv1.ClusterLabelName])
if err != nil {
return err
}
// Whether or not it is okay to delete the NodeRef depends on the
// number of remaining control plane members and whether or not this
// machine is one of them.
switch numControlPlaneMachines := len(util.GetControlPlaneMachines(machines)); {
case numControlPlaneMachines == 0:
// Do not delete the NodeRef if there are no remaining members of
// the control plane.
return errNoControlPlaneNodes
default:
// Otherwise it is okay to delete the NodeRef.
return nil
}
}

No control plane Machines in an EKS cluster.

Environment:

  • Cluster-api version: latest
  • Minikube/KIND version: v0.8
  • Kubernetes version: (use kubectl version): v1.17.9
  • OS (e.g. from /etc/os-release): Amazon Linux 2

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 11, 2020
@vincepri
Copy link
Member

This is indeed a bug / use case we never thought of. Looking at the code, we don't seem to take into account that a control plane provider might be set, maybe we should check if there are remaining control plane nodes in the actual cluster instead of checking Machines

@dthorsen
Copy link
Contributor

/assign

@dthorsen
Copy link
Contributor

dthorsen commented Sep 22, 2020

Trying to think of the best approach here. Is this section of code necessary at all? As I understand, by the time we get to this code the infrastructure will be deleted regardless of the result of this check, so this code just seems to result in orphaned nodes.

@vincepri
Copy link
Member

Copying my comment from the PR:

We can't assume the Kind here is always going to be of type KubeadmControlPlane, there might be other infrastructure providers, or single-machine control plane nodes (outside of KCP) that are using this feature. The change here makes it opt in only for KCP clusters, which is a behavioral breaking change.
In alternative, we could have a status field that reports the number of available control plane machines, and the machine controller can use it here. We'd have to make this field fail gracefully, to support the current logic in case the field isn't available or it doesn't exist.

@dthorsen These checks were added a while back to make sure we wouldn't get stuck trying to delete a node if the control plane nodes were being deleted, or if there weren't any more machines available.

I'd try to see if we can agree on a status field and have a fallback that uses the current logic

@dthorsen
Copy link
Contributor

dthorsen commented Sep 23, 2020

@vincepri What do you think about a status field on the controlplane objects like nodesRegistered: true that we could use to gate this section? We could make AWSManagedControlPlane set this new status field to false but have it act as true if unset so that we do not change default behavior but Managed control plane objects can opt out of this logic.

@vincepri
Copy link
Member

🤔 That could be interesting. One question before we proceed, does EKS or other cloud provider expose the number of control plane nodes available somehow? Or is it all done behind the scenes?

@dthorsen
Copy link
Contributor

EKS, GKE, and AKS do not expose the details of the number of control plane nodes or anything like that. The control plane is totally provided as a managed service and how they design it is internal to those organizations. Another possible name for the status field could be something like managedControlPlane: true. Such a status field may be useful for similar reasons elsewhere in the code that we have not yet discovered.

@ncdc
Copy link
Contributor

ncdc commented Sep 24, 2020

I like the managedControlPlane idea. Where would that live? On the provider-specific ManagedControlPlane struct in a status field?

@dthorsen
Copy link
Contributor

I like the managedControlPlane idea. Where would that live? On the provider-specific ManagedControlPlane struct in a status field?

Yes exactly. I have most of the code written now, just testing some things out locally.

@vincepri
Copy link
Member

Given that this is internal, we could also use an annotation on the control plane referenced object itself, that signals cluster api that this is an externally managed control plane.

@dthorsen
Copy link
Contributor

Given that this is internal, we could also use an annotation on the control plane referenced object itself, that signals cluster api that this is an externally managed control plane.

This seems like something that should be set by the controller though as opposed to the user. That is why I gravitated toward a status field.

@ncdc
Copy link
Contributor

ncdc commented Sep 24, 2020

Wouldn't using an actual field make more sense / haven't we been trying to get away from API-in-annotations?

@ncdc
Copy link
Contributor

ncdc commented Sep 24, 2020

Definitely should be set by the controller

@vincepri
Copy link
Member

I asked myself, "is this useful for users to see/know" and couldn't come up with a clear answer, also being a managed control plane, is that something that should be in status?

It all seems internal to Cluster API, rather than something we'd want to expose, but I'm ok either way.

One other benefit of having it as an annotation is using ObjectMeta in code, so that Cluster API doesn't have to write custom code to get status.<field> like we do today for the other fields, but instead it can use the unstructured objects' metadata helper methods

@dthorsen
Copy link
Contributor

dthorsen commented Sep 24, 2020

Was thinking something like this in util.go

// IsManagedControlPlane returns a bool indicating whether the control plane referenced
// in the passed ObjectReference is a managed control plane
func IsManagedControlPlane(controlPlaneRef *corev1.ObjectReference) bool {
	controlPlane := ObjectReferenceToUnstructured(*controlPlaneRef)
	managed, found, err := unstructured.NestedBool(controlPlane.Object, "status", "managedControlPlane")
	if err != nil || !found {
		return false
	}
	return managed
}

@vincepri
Copy link
Member

vincepri commented Sep 24, 2020

LGTM, maybe have IsExternalManagedControlPlane? Given that managed could mean a number of things

@dthorsen
Copy link
Contributor

LGTM, maybe have IsExternalManagedControlPlane? Given that managed could mean a number of things

Works for me

@ncdc
Copy link
Contributor

ncdc commented Sep 24, 2020

/lifecycle active

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Sep 24, 2020
@dthorsen
Copy link
Contributor

PR #3673 is ready for review and should fix this. I will open a follow up PR to the AWS provider to add. the externalManagedControlPlane status field once this is merged.

@vincepri
Copy link
Member

/milestone v0.3.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants