Deleting Antrea should do a complete cleanup #181

antoninbas · 2019-12-04T02:00:25Z

Describe the bug
When running kubectl delete -f antrea.yml, one would expect everything created by Antrea to be deleted, including the OVS bridge on each Node, along with the gateway interface. Failure do so may cause networking issues if someone decides to deploy another CNI after deleting Antrea.

Versions:
Antrea version:

===> Version information <===
VERSION: v0.2.0-dev
GIT_SHA: 4a446d4
GIT_TREE_STATE: clean
RELEASE_STATUS: unreleased
DOCKER_IMG_VERSION: v0.2.0-dev-4a446d4

Additional context

We cannot simply delete the bridge, etc. when the agent exists. For example, we may want to preserve connectivity during some Antrea upgrades. Note that currently this is not possible since all flows are deleted when the openvswitch daemon restarts. See these resources:

email thread: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-November/049448.html
submitted patch: https://mail.openvswitch.org/pipermail/ovs-dev/2019-November/364811.html

@jianjuns and I discussed this cleanup issue in a separate PR: Fix Kind support (Linux hosts only) #137 (comment). He suggested having a cleanup daemonset dedicated to this purpose. See also these K8s issues:

The text was updated successfully, but these errors were encountered:

jianjuns · 2019-12-04T19:27:40Z

I think we could either have a separate DaemonSet or add a ConfigMap option to indicate antrea-agent to do the cleanup (the two approaches are not very different from each other, but with the ConfigMap way agent could watch the option in runtime when we support runtime config changes later).

antoninbas · 2019-12-19T21:55:00Z

I'd like to start working on this. I like the DaemonSet approach more since it works even if there is an issue (e.g. buggy agent) or if Antrea has already been deleted from the cluster. Someone might have done kubectl delete -f antrea.yml already, and we need to be able to perform the cleanup without the Antrea agent / controller running.

My proposal is as follows:

define a YAML manifest antrea-cleanup.yaml to take care of cleaning up artifacts created by Antrea
the manifest would define one K8s Job in charge of performing the cleanup, plus necessary RBAC permissions
the Job would be in charge of creating the cleanup DaemonSet, monitoring the status of the cleanup process and deleting the DaemonSet once cleanup is complete

A few considerations:

there is no notion of a DaemonSet that runs to completion and doesn't restart Pods (see Run job on each node once to help with setup kubernetes/kubernetes#64623). But we can use a Controller and have each DaemonSet Pod return completion status to the Controller using CRDs.
the Job should probably return an error if Antrea is still running (and recommend deleting Antrea with kubectl delete -f antrea.yml). Although another option is to have the Job also delete all K8s resources with label app=antrea. I'm open to both solutions.
the DaemonSet needs to be aware of the Antrea agent configuration (as the OVS bridge name / gateway interface name may not be the default ones). This is a bit tricky: if Antrea is not running anymore, the ConfigMap is not available anymore. Rather than ask the user to modify a ConfigMap in the cleanup manifest (antrea-cleanup.yaml), I would rather modify the Antrea agent to have it save its config (antrea-agent.conf) on the host under /var/run/antrea/. I think this is a very non-intrusive change, very much in spirit with how we use /var/run/antrea/ for other things (e.g. OVSDB) that we do not need to persist across reboots.

@jianjuns @salv-orlando @tnqn any thoughts or concerns?

jianjuns · 2019-12-19T22:26:57Z

Questions:

Do you plan to use agent bin to do the clean up or a separate bin?
For shared resources like Controller CRD who will do the clean up?

antoninbas · 2019-12-19T23:56:01Z

No, I was planning to use different binaries (one for the DaemonSet and one for the "orchestrator", which will be run by the K8s Job and talk to the apiserver).
I think the "orchestrator" could take care of this easily, what do you think?

jianjuns · 2019-12-20T00:48:06Z

I was thinking about running the same agent and controller bins, to avoid introducing new bins. But I saw there are benefits of using a separate bin too, that could be executed standalone, including executed manually. In this sense, do you think it is good to make it part of CLI, and so user could run the CLI manually to clean up?
Also note on Windows, we could not clean up Node state with DS.

jianjuns · 2019-12-20T00:49:04Z

Just to add, it might be helpful to be able to clean up node after K8s is down.

antoninbas · 2019-12-20T01:05:43Z

I think it's better to introduce new bins as I like the idea of separating the cleanup logic from the core functionality (rather than adding a new mode of operation for existing binaries).

I definitely think we should have the ability to run the binary directly as a process. I would like to organize the code as a library that can be used by the CLI & or in a DaemonSet. Even though kubectl apply -f antrea-cleanup.yaml may not work on Windows, it's a nice K8s-native way to do cleanup that doesn't require to download any binaries manually.

This is still very much a work-in-progress, I'm opening this PR to gather feedback on the approach. For someone deleting Antrea, the steps would be as follows: * `kubectl delete -f <path to antrea.yml>` * `kubectl apply -f <path to antrea-cleanup.yml>` * check that job has completed with `kubectl -n kube-system get jobs` * `kubectl delete -f <path to antrea-cleanup.yml>` The cleanup manifest creates a DaemonSet that will perform the necessary deletion tasks on each Node. After the tasks have been completed, the "status" is reported to the cleanup controller through a custom resource. Once the controller has received enough statuses (or after a timeout of 1 minutes) the controller job completes and the user can delete the cleanup manifest. Known remaining items: * place cleanup binaries (antrea-cleanup-agent and antrea-cleanup-controller) in a separate docker image to avoid increasing the size of the main Antrea docker image * generate manifest with kustomize? * find a way to test this as part of CI? * update documentation * additional cleanup tasks: as of now we only take care of deleting the OVS bridge * place cleanup CRD in non-default namespace * use kubebuilder instead of the code generator directly (related to antrea-io#16); we probably want to punt this task to a future PR. See antrea-io#181

antoninbas · 2020-01-20T19:53:52Z

Removing this from 0.3.0, as PR #307 still needs a lot of work.

salv-orlando · 2020-03-25T15:28:41Z

Unfortunately there weren't many review in the past few days. @antoninbas do you think we can still make a final push for 0.5.0 inclusion, or do we just move this issue to 0.6.0?

antoninbas · 2020-04-28T21:08:00Z

Pushing this out to next release

Delete the unrequired method of deleting record from record map without lock. Add a method to get the flow updated time for flow given flow key

github-actions · 2022-01-05T00:27:02Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

github-actions · 2022-04-06T00:35:32Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

github-actions · 2022-07-13T00:41:42Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

github-actions · 2022-10-12T01:05:01Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

antoninbas added the bug label Dec 4, 2019

antoninbas added this to the Antrea v0.3.0 release milestone Dec 19, 2019

antoninbas self-assigned this Dec 19, 2019

antoninbas mentioned this issue Jan 10, 2020

WIP: Add yaml manifest to cleanup all Antrea resources #307

Closed

antoninbas modified the milestones: Antrea v0.3.0 release, Antrea v0.4.0 release Jan 20, 2020

McCodeman added the kind/bug Categorizes issue or PR as related to a bug. label Jan 29, 2020

antoninbas mentioned this issue Feb 3, 2020

Preserve OVS data path flows when Antrea is restarted gracefully #355

Closed

salv-orlando removed this from the Antrea v0.4.0 release milestone Feb 12, 2020

antoninbas added this to the Antrea v0.5.0 release milestone Feb 12, 2020

salv-orlando modified the milestones: Antrea v0.5.0 release, Antrea v0.6.0 release Mar 25, 2020

antoninbas modified the milestones: Antrea v0.6.0 release, Antrea v0.7.0 release Apr 28, 2020

antoninbas modified the milestones: Antrea v0.7.0 release, Antrea v0.8.0 release May 29, 2020

antoninbas removed this from the Antrea v0.8.0 release milestone Jun 17, 2020

antoninbas mentioned this issue Dec 11, 2020

Antrea CNI not cleaned up after deletion #1659

Closed

antoninbas added this to the Antrea v0.13.0 release milestone Dec 11, 2020

antoninbas added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Dec 11, 2020

antoninbas modified the milestones: Antrea v0.13.0 release, Antrea v0.14.0 release Jan 15, 2021

antoninbas removed this from the Antrea v1.0 release milestone Apr 7, 2021

antoninbas removed the bug label Apr 19, 2021

antoninbas added the good first issue Good for newcomers label Oct 2, 2021

antoninbas removed their assignment Oct 2, 2021

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2022

antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2022

antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 13, 2022

antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 13, 2022

antoninbas mentioned this issue Sep 29, 2022

Pod traffic doesn't go through OVS when Antrea agent is in networkPolicyOnly mode #4228

Closed

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deleting Antrea should do a complete cleanup #181

Deleting Antrea should do a complete cleanup #181

antoninbas commented Dec 4, 2019

jianjuns commented Dec 4, 2019

antoninbas commented Dec 19, 2019

jianjuns commented Dec 19, 2019

antoninbas commented Dec 19, 2019

jianjuns commented Dec 20, 2019

jianjuns commented Dec 20, 2019

antoninbas commented Dec 20, 2019

antoninbas commented Jan 20, 2020

salv-orlando commented Mar 25, 2020

antoninbas commented Apr 28, 2020

github-actions bot commented Jan 5, 2022

github-actions bot commented Apr 6, 2022

github-actions bot commented Jul 13, 2022

github-actions bot commented Oct 12, 2022

Deleting Antrea should do a complete cleanup #181

Deleting Antrea should do a complete cleanup #181

Comments

antoninbas commented Dec 4, 2019

jianjuns commented Dec 4, 2019

antoninbas commented Dec 19, 2019

jianjuns commented Dec 19, 2019

antoninbas commented Dec 19, 2019

jianjuns commented Dec 20, 2019

jianjuns commented Dec 20, 2019

antoninbas commented Dec 20, 2019

antoninbas commented Jan 20, 2020

salv-orlando commented Mar 25, 2020

antoninbas commented Apr 28, 2020

github-actions bot commented Jan 5, 2022

github-actions bot commented Apr 6, 2022

github-actions bot commented Jul 13, 2022

github-actions bot commented Oct 12, 2022