Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement clusterctl delete cluster #406

Merged
merged 1 commit into from
Jul 17, 2018

Conversation

spew
Copy link
Contributor

@spew spew commented Jun 26, 2018

What this PR does / why we need it:
This PR implements the command clusterctl delete cluster.

Release note:

Add a `delete cluster` command to `clusterctl`. 

@kubernetes/kube-deploy-reviewers

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 26, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: spew

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 26, 2018
@spew spew force-pushed the clusterctl-delete branch 2 times, most recently from 27bfd42 to 05b9e2c Compare June 26, 2018 22:54
@spew
Copy link
Contributor Author

spew commented Jun 26, 2018

/assign @roberthbailey @k4leung4 @mkjelland

defer closeClient(externalClient, "external")

glog.Info("Applying Cluster API stack to external cluster")
err = d.applyClusterAPIStack(externalClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := d.apply...(); err != nil { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed all instances of this -- I need to burn this pattern into my brain :P Also forgot that you had moved the Create(...) method to this style in another PR.

}

glog.Info("Deleting Cluster API Provider Components from internal cluster")
err = internalClient.Delete(d.providerComponents)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := ...; err != nil { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

}

glog.Info("Copying objects from internal cluster to external cluster")
err = pivot(internalClient, externalClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := ...; err != nil { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

}

glog.Info("Deleting objects from external cluster")
err = deleteObjects(externalClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := ...; err != nil { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

if err != nil {
return nil, fmt.Errorf("unable to get internal cluster kubeconfig: %v", err)
}

err = d.writeKubeconfig(internalKubeconfig)
err = d.writeKubeconfig(internalKubeconfig, kubeconfigOutput)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := ...; err != nil { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

errors = append(errors, err.Error())
}
glog.Infof("Deleting machine sets")
err = client.DeleteMachineSetObjects()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := ...; err != nil { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

func deleteObjects(client ClusterClient) error {
var errors []string
glog.Infof("Deleting machine deployments")
err := client.DeleteMachineDeploymentObjects()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := ...; err != nil { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

errors = append(errors, err.Error())
}
glog.Infof("Deleting clusters")
err = client.DeleteClusterObjects()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := ...; err != nil { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@@ -433,3 +528,10 @@ func containsMasterRole(roles []clustercommon.MachineRole) bool {
}
return false
}

func closeClient(client ClusterClient, name string) {
err := client.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := ...; err != nil { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

co.CleanupExternalCluster)
return d.Create(c, m, pcsFactory)
err = d.Create(c, m, pd, co.KubeconfigOutput, pcsFactory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you aren't going to log this error, just do return d.Create(...) like it was before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, not sure why I did this.

@roberthbailey
Copy link
Contributor

I added a first round of comments, but don't block merging on me.

@k4leung4
Copy link
Contributor

lgtm


func (d *ClusterDeployer) Delete(internalClient ClusterClient) error {
glog.Info("Creating external cluster")
externalClient, cleanupExternalCluster, err := d.createExternalCluster()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the external cluster already exists?

Copy link
Contributor Author

@spew spew Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will fail -- much like the CreateCluster command. However, I'll test this manually to make sure we understand the behavior and report back in this comment.

Copy link
Contributor Author

@spew spew Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried this and this is the behavior:

$ minikube start
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
Loading cached images from config file.

$ clusterctl delete cluster -p provider-components.yaml
I0627 14:17:22.206083   51664 clusterdeployer.go:194] Creating external cluster
F0627 14:27:56.001433   51664 delete_cluster.go:47] could not create external cluster: could not create external control plane: error running command 'minikube start --bootstrapper=kubeadm': exit status 1

Given that we do nothing special in kubectl create cluster if there is an existing minikube cluster I'd prefer to punt on the problems as it is an existing issue and not a change in the architecture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we skip the external cluster creation if the minikube cluster already exists (pretty much the minikube status result). Not needed in this PR, but an issue+TODO would do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a less prescriptive issue -- I think there are some other possible solutions. Here it is: #413


glog.Info("Deleting Cluster API Provider Components from internal cluster")
if err = internalClient.Delete(d.providerComponents); err != nil {
glog.Infof("Error while shutting down provider components on internal cluster: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "shutting down" and deleting are different things, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, the underlying action in the clusterclient for this method is a kubectl delete which according to this doc (thanks for the link btw!) attempts a graceful shutdown of the pods:

https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

I could change the log to be more accurate but a bit less user friendly as follows:

Error while executing kubectl delete on provider components on internal cluster: %v

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about "error while removing provider components from internal cluster: %v"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update with the new messaging.


glog.Info("Deleting objects from external cluster")
if err = deleteObjects(externalClient); err != nil {
return fmt.Errorf("unable to finish deleting objects in external cluster, resources may have been leaked: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty scary log line. Can you provide some guidance here as to what resources might have leaked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to build a generalized line of text here because the underlying resources are sort of provider specific. For example, in GCP when the cluster deletion fails it means that you may have 'leaked' a firewall rule. However, that would likely not be true for a vsphere implementation.

Do you have any ideas on how to improve the messaging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good solution right now, but perhaps we could make a generic statement ("VMs, firewall rules, LB rules etc") for now?

Copy link
Contributor Author

@spew spew Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that is actually more clear because the resources in question are extremely specific to a given provider, for example, firewall rule is a GCP concept, in AWS there is "security groups", VMs might be safe, but in AWS land those would be called instances. When talking about load balancers, within AWS there are at least 3 different types of load balancers with different APIs, etc, (application, network, and classic).

I see two options:

  1. Add another method to the ProviderDeployer interface. This is the interface that is specific to the provider / cloud. This method could be something like this GetDeleteMachineErrorMessage(...) string and there would be one for Clusters, Machines, MachineSets, and MachineDeployments. This method would return a provider specific message of the kinds of resources that can be leaked.

  2. Change the message to this, substituting 'google' or 'aws' or the like for the provider name:

fmt.Errorf("unable to finish deleting objects in external cluster, the associated resources specific to the %v provider may have been leaked: %v", providerName, err)

Do you think we should implement either of these or leave things how they are?

errors = append(errors, err.Error())
}
glog.Infof("Deleting machines")
if err := client.DeleteMachineObjects(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe considering some machines will be deleted by deletion of MachineSets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safe in the sense that we are simply calling the 'delete all' method in the cluster API. It is up to the cluster API to properly handle machines that already have a deletion in progress due to the fact that their parent object has been deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still not great because attempting to delete a machine twice (or three times) could spit a couple of error logs which would be misleading unless the user knows exactly how this code works, and exactly how MachineSets work. I think here, we should delete all Machines that are not in a MachineSet.

Copy link
Contributor Author

@spew spew Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no actual deletion going on here, there is just a call to the cluster-api "delete all" methods, i.e. delete all machines, delete all machinesets, delete all machine deployments. Those are asynchronous methods that don't actually do the deletion -- the deletion itself is eventually reconciled by the controllers. IN this case, the underlying deletion for machines & machine sets is done by the machine controller.

I'm not convinced there actually is a problem or that the cluster API would do the wrong thing. Rather than put a lot of complicated deletion logic in clusterctl it seems better to have the controllers do the right thing.

Copy link
Contributor Author

@spew spew Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I synced with @k4leung4 and he confirmed that there is not an issue here, there are a couple scenarios / points that we discussed:

  1. Deleting existing Machines?: DeleteCollection(...) (the actual delete methods being used) do not return errors when there are already objects that are in the process of being deleted (i.e. marked as being deleted). We are not passing in any machines we are simply call ing a "delete all" method.
  2. Concurrent Deletes?: We are calling the DeleteCollection(...) method and passing in the propogation policy of DeletePropogationForeground. This means the delete call will block until all child objects (i.e. machines that belong to a machine set are deleted. This comes from the comments for DeletePropagationForeground. I think what that actually means is they are marked for deleted in etcd. I copied the doc below.
  3. Deleting Machines before MachineSets: Even if we did things in the wrong order and called DeleteCollection on Machines, and that method actually deleted machines associated with a MachineSet (I've not yet drilled into whether that would even happen) all that would occur is the controller would attempt to recreate the machine and then when the DeleteCollection was called on the machine sets the machines would be deleted again.

Here is the DeletePropagationForeground doc for reference:

	// The object exists in the key-value store until the garbage collector
	// deletes all the dependents whose ownerReference.blockOwnerDeletion=true
	// from the key-value store.  API sever will put the "foregroundDeletion"
	// finalizer on the object, and sets its deletionTimestamp.  This policy is
	// cascading, i.e., the dependents will be deleted with Foreground.
	DeletePropagationForeground DeletionPropagation = "Foreground"

@spew
Copy link
Contributor Author

spew commented Jul 2, 2018

Updated the PR with a new error message as per comment above

@roberthbailey roberthbailey assigned karan and unassigned roberthbailey Jul 13, 2018
@roberthbailey
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2018
@k8s-ci-robot k8s-ci-robot merged commit babe910 into kubernetes-sigs:master Jul 17, 2018
jayunit100 pushed a commit to jayunit100/cluster-api that referenced this pull request Jan 31, 2020
…-fix

default vmfolder set for cloudconfig
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants