-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement clusterctl delete cluster #406
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: spew The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
27bfd42
to
05b9e2c
Compare
/assign @roberthbailey @k4leung4 @mkjelland |
defer closeClient(externalClient, "external") | ||
|
||
glog.Info("Applying Cluster API stack to external cluster") | ||
err = d.applyClusterAPIStack(externalClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := d.apply...(); err != nil { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed all instances of this -- I need to burn this pattern into my brain :P Also forgot that you had moved the Create(...) method to this style in another PR.
} | ||
|
||
glog.Info("Deleting Cluster API Provider Components from internal cluster") | ||
err = internalClient.Delete(d.providerComponents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := ...; err != nil { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
} | ||
|
||
glog.Info("Copying objects from internal cluster to external cluster") | ||
err = pivot(internalClient, externalClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := ...; err != nil { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
} | ||
|
||
glog.Info("Deleting objects from external cluster") | ||
err = deleteObjects(externalClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := ...; err != nil { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
if err != nil { | ||
return nil, fmt.Errorf("unable to get internal cluster kubeconfig: %v", err) | ||
} | ||
|
||
err = d.writeKubeconfig(internalKubeconfig) | ||
err = d.writeKubeconfig(internalKubeconfig, kubeconfigOutput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := ...; err != nil { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
errors = append(errors, err.Error()) | ||
} | ||
glog.Infof("Deleting machine sets") | ||
err = client.DeleteMachineSetObjects() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := ...; err != nil { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
func deleteObjects(client ClusterClient) error { | ||
var errors []string | ||
glog.Infof("Deleting machine deployments") | ||
err := client.DeleteMachineDeploymentObjects() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := ...; err != nil { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
errors = append(errors, err.Error()) | ||
} | ||
glog.Infof("Deleting clusters") | ||
err = client.DeleteClusterObjects() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := ...; err != nil { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
@@ -433,3 +528,10 @@ func containsMasterRole(roles []clustercommon.MachineRole) bool { | |||
} | |||
return false | |||
} | |||
|
|||
func closeClient(client ClusterClient, name string) { | |||
err := client.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := ...; err != nil { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
clusterctl/cmd/create_cluster.go
Outdated
co.CleanupExternalCluster) | ||
return d.Create(c, m, pcsFactory) | ||
err = d.Create(c, m, pd, co.KubeconfigOutput, pcsFactory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you aren't going to log this error, just do return d.Create(...)
like it was before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, not sure why I did this.
I added a first round of comments, but don't block merging on me. |
lgtm |
|
||
func (d *ClusterDeployer) Delete(internalClient ClusterClient) error { | ||
glog.Info("Creating external cluster") | ||
externalClient, cleanupExternalCluster, err := d.createExternalCluster() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the external cluster already exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will fail -- much like the CreateCluster command. However, I'll test this manually to make sure we understand the behavior and report back in this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried this and this is the behavior:
$ minikube start
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
Loading cached images from config file.
$ clusterctl delete cluster -p provider-components.yaml
I0627 14:17:22.206083 51664 clusterdeployer.go:194] Creating external cluster
F0627 14:27:56.001433 51664 delete_cluster.go:47] could not create external cluster: could not create external control plane: error running command 'minikube start --bootstrapper=kubeadm': exit status 1
Given that we do nothing special in kubectl create cluster
if there is an existing minikube cluster I'd prefer to punt on the problems as it is an existing issue and not a change in the architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we skip the external cluster creation if the minikube cluster already exists (pretty much the minikube status result). Not needed in this PR, but an issue+TODO would do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a less prescriptive issue -- I think there are some other possible solutions. Here it is: #413
|
||
glog.Info("Deleting Cluster API Provider Components from internal cluster") | ||
if err = internalClient.Delete(d.providerComponents); err != nil { | ||
glog.Infof("Error while shutting down provider components on internal cluster: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "shutting down" and deleting are different things, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, the underlying action in the clusterclient for this method is a kubectl delete
which according to this doc (thanks for the link btw!) attempts a graceful shutdown of the pods:
https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods
I could change the log to be more accurate but a bit less user friendly as follows:
Error while executing kubectl delete on provider components on internal cluster: %v
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about "error while removing provider components from internal cluster: %v"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update with the new messaging.
|
||
glog.Info("Deleting objects from external cluster") | ||
if err = deleteObjects(externalClient); err != nil { | ||
return fmt.Errorf("unable to finish deleting objects in external cluster, resources may have been leaked: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty scary log line. Can you provide some guidance here as to what resources might have leaked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to build a generalized line of text here because the underlying resources are sort of provider specific. For example, in GCP when the cluster deletion fails it means that you may have 'leaked' a firewall rule. However, that would likely not be true for a vsphere implementation.
Do you have any ideas on how to improve the messaging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good solution right now, but perhaps we could make a generic statement ("VMs, firewall rules, LB rules etc") for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that is actually more clear because the resources in question are extremely specific to a given provider, for example, firewall rule is a GCP concept, in AWS there is "security groups", VMs might be safe, but in AWS land those would be called instances. When talking about load balancers, within AWS there are at least 3 different types of load balancers with different APIs, etc, (application, network, and classic).
I see two options:
-
Add another method to the ProviderDeployer interface. This is the interface that is specific to the provider / cloud. This method could be something like this
GetDeleteMachineErrorMessage(...) string
and there would be one for Clusters, Machines, MachineSets, and MachineDeployments. This method would return a provider specific message of the kinds of resources that can be leaked. -
Change the message to this, substituting 'google' or 'aws' or the like for the provider name:
fmt.Errorf("unable to finish deleting objects in external cluster, the associated resources specific to the %v provider may have been leaked: %v", providerName, err)
Do you think we should implement either of these or leave things how they are?
errors = append(errors, err.Error()) | ||
} | ||
glog.Infof("Deleting machines") | ||
if err := client.DeleteMachineObjects(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe considering some machines will be deleted by deletion of MachineSets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safe in the sense that we are simply calling the 'delete all' method in the cluster API. It is up to the cluster API to properly handle machines that already have a deletion in progress due to the fact that their parent object has been deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still not great because attempting to delete a machine twice (or three times) could spit a couple of error logs which would be misleading unless the user knows exactly how this code works, and exactly how MachineSets work. I think here, we should delete all Machines that are not in a MachineSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no actual deletion going on here, there is just a call to the cluster-api "delete all" methods, i.e. delete all machines, delete all machinesets, delete all machine deployments. Those are asynchronous methods that don't actually do the deletion -- the deletion itself is eventually reconciled by the controllers. IN this case, the underlying deletion for machines & machine sets is done by the machine controller.
I'm not convinced there actually is a problem or that the cluster API would do the wrong thing. Rather than put a lot of complicated deletion logic in clusterctl it seems better to have the controllers do the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I synced with @k4leung4 and he confirmed that there is not an issue here, there are a couple scenarios / points that we discussed:
- Deleting existing Machines?: DeleteCollection(...) (the actual delete methods being used) do not return errors when there are already objects that are in the process of being deleted (i.e. marked as being deleted). We are not passing in any machines we are simply call ing a "delete all" method.
- Concurrent Deletes?: We are calling the DeleteCollection(...) method and passing in the propogation policy of
DeletePropogationForeground
. This means the delete call will block until all child objects (i.e. machines that belong to a machine set are deleted. This comes from the comments forDeletePropagationForeground
. I think what that actually means is they are marked for deleted in etcd. I copied the doc below. - Deleting Machines before MachineSets: Even if we did things in the wrong order and called DeleteCollection on Machines, and that method actually deleted machines associated with a MachineSet (I've not yet drilled into whether that would even happen) all that would occur is the controller would attempt to recreate the machine and then when the DeleteCollection was called on the machine sets the machines would be deleted again.
Here is the DeletePropagationForeground
doc for reference:
// The object exists in the key-value store until the garbage collector
// deletes all the dependents whose ownerReference.blockOwnerDeletion=true
// from the key-value store. API sever will put the "foregroundDeletion"
// finalizer on the object, and sets its deletionTimestamp. This policy is
// cascading, i.e., the dependents will be deleted with Foreground.
DeletePropagationForeground DeletionPropagation = "Foreground"
Updated the PR with a new error message as per comment above |
/lgtm |
…-fix default vmfolder set for cloudconfig
What this PR does / why we need it:
This PR implements the command
clusterctl delete cluster
.Release note:
@kubernetes/kube-deploy-reviewers