Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IP may leak after agent restart #4326

Closed
wenyingd opened this issue Oct 21, 2022 · 10 comments · Fixed by #5660
Closed

IP may leak after agent restart #4326

wenyingd opened this issue Oct 21, 2022 · 10 comments · Fixed by #5660
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@wenyingd
Copy link
Contributor

wenyingd commented Oct 21, 2022

Describe the bug

In the existing reconcile logic (), antrea-agent remove OVS ports/ OpenFlow entries / items in the memory interface store if the Pod is deleted when agent is down. But the logic does not call IPAM to recycle the allocated IP. So this may lead to a leak on the IP resources.

To Reproduce

Expected

Actual behavior

Versions:

Antrea: main branch

Additional context

@wenyingd wenyingd added the kind/bug Categorizes issue or PR as related to a bug. label Oct 21, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 21, 2023
@tnqn tnqn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2023
@tnqn tnqn reopened this May 31, 2023
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2023
@antoninbas antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2023
@antoninbas
Copy link
Contributor

I was taking a look at this old issue and I tried to reproduce the IP leakage.
Based on my tests and based on the kubelet code, it seems very difficult to reproduce this situation.
Even with the following steps:

  1. schedule a workload Pod on nodeX
  2. remove the antrea-agent Pod from nodeX by editing the DaemonSet's affinity
  3. force delete the workload Pod with kubectl
  4. wait for a few minutes
  5. kill the kubelet service (it will restart immediately)
  6. revert the change to the antrea-agent DaemonSet

After a few seconds (less than a minute), the new antrea-agent Pod receives a CmdDel command for the old workload Pod, and the IP is released.

It seems that the kubelet garbage collector is taking care of removing the old sandbox. When I increase kubelet log verbosity, I see:

Oct 30 22:20:02 kind-worker kubelet[37935]: I1030 22:20:02.300464   37935 kuberuntime_gc.go:175] "Removing sandbox" sandboxID="4eb3e402b4d2342278fdf12e5d3b33a206363f4ec93af271020a8736604a8442"
Oct 30 22:20:02 kind-worker kubelet[37935]: E1030 22:20:02.322027   37935 kuberuntime_gc.go:180] "Failed to stop sandbox before removing" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"4eb3e402b4d2342278fdf12e5d3b33a206363f4ec93af271020a8736604a8442\": plugin type=\"antrea\" failed (delete): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/antrea/cni.sock: connect: connection refused\"" sandboxID="4eb3e402b4d2342278fdf12e5d3b33a206363f4ec93af271020a8736604a8442"

kubelet is able to list all sandboxes from the container runtime, and the container runtime I am using (containerd v1.7.1) will not remove the sandbox until the network has been deleted:

root@kind-worker:/# crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
5a64643d69ed8       32 minutes ago      NotReady            toolbox-rhc2g       default             0                   (default)
4eb3e402b4d23       34 minutes ago      NotReady            toolbox-qfnfr       default             1                   (default)
b49dac301f1fe       2 hours ago         Ready               kube-proxy-kz7zx    kube-system         0                   (default)

It is possible that some older versions of containerd were not that robust.

@tnqn do you think we should have a more defensive posture in Antrea and add logic to proactively release IPs on agent (re)start?

@tnqn
Copy link
Member

tnqn commented Oct 31, 2023

@antoninbas I have tried the following K8s versions deployed via kind and haven't managed to reproduce IP leak with the steps you described:

  • v1.26.6
  • v1.24.15
  • v1.23.17
  • v1.22.17
  • v1.21.14

So I feel this case may have been taken care of in normal workflow, but it's hard to say it's handled correctly in abnormal case, for example, containerd previously had this issue: containerd/containerd#5569, and I remember you were investigating another leak with newer container versions in the last few weeks which we don't know the root cause yet? And there could be other CRIs which implement the cleanup differently. Therefore, I tend to agree that, instead of relying on kubelet/container runtime to be bug free regarding CNI cleanup, it's more robust to perform a conservative garbage collection on agent start. We should log such cleanup explicitly as we don't actually expect it unless there is a bug in container runtime or antrea.

@antoninbas
Copy link
Contributor

I remember you were investigating another leak with newer container versions in the last few weeks which we don't know the root cause yet?

The person who reported the issue confirmed that when using a more recent version of containerd, the issue cannot be reproduced anymore (perhaps because of this fix: containerd/containerd#5904). With an older containerd version, they were able to reproduce the issue consistently.

Therefore, I tend to agree that, instead of relying on kubelet/container runtime to be bug free regarding CNI cleanup, it's more robust to perform a conservative garbage collection on agent start.

Sounds good to me. I can take a stab at it.
It probably won't be possible to validate that code with e2e testing, but we can have a unit test.

@antoninbas antoninbas self-assigned this Oct 31, 2023
@antoninbas
Copy link
Contributor

@tnqn I wrote some PoC code so see if I could come up with a viable approach: fbfa402

However, it feels very much like an anti-pattern: we have to craft the CNI arguments from "scratch" in order to invoke the IPAM plugin. It can work (both when delegating to host-local IPAM or when using AntreaIPAM), but it feels like a hack.

The code I wrote uses the interfaceStore (populated from the persisted OVSDB) to determine for which containers we should do "IPAM garbage collection". An alternative approach would be to maintain and persist a CNI result cache, which would be updated whenever we invoke the IPAM plugin (ideally, AntreaIPAM too). We can use libcni for this. With this approach, we can use the cached results to easily generate DEL commands for Pods that no longer exist. This may be a cleaner approach (I need to check if it works), but it's probably a bit more disruptive.

Going back to the relevance of this issue, it does feel inconsistent to treat interfaces and IP addresses differently. Either we assume that the container runtime (+ kubelet) will always eventually invoke CNI DEL, in which case that part of the reconciliation logic is not really needed at all. Or we assume that resource leak is possible under some conditions, and we have to take care of all resources consistently.

@tnqn
Copy link
Member

tnqn commented Nov 1, 2023

I remember you were investigating another leak with newer container versions in the last few weeks which we don't know the root cause yet?

The person who reported the issue confirmed that when using a more recent version of containerd, the issue cannot be reproduced anymore (perhaps because of this fix: containerd/containerd#5904). With an older containerd version, they were able to reproduce the issue consistently.

I took a look at the issue and the PR, they seem ensuring CNI will be called repeatedly until it returns success. But I remember in our case there was no CNI invocation, so looks different. Do you know which version the issue can be reproduced and which can not? According to the enhancements that have been made in containerd, it does seem CNI cleanup is ensured now.

However, it feels very much like an anti-pattern: we have to craft the CNI arguments from "scratch" in order to invoke the IPAM plugin. It can work (both when delegating to host-local IPAM or when using AntreaIPAM), but it feels like a hack.

I was thinking something more hacky, accessing the IPAM dir and deleting IPs that shouldn't exist.. I remember we offered a script for users to do the cleanup when they encounted the IP leak: https://github.com/antrea-io/antrea/blob/main/hack/gc-host-local.sh.

Going back to the relevance of this issue, it does feel inconsistent to treat interfaces and IP addresses differently. Either we assume that the container runtime (+ kubelet) will always eventually invoke CNI DEL, in which case that part of the reconciliation logic is not really needed at all. Or we assume that resource leak is possible under some conditions, and we have to take care of all resources consistently.

Agree. When we added this logic, kubelet and container runtime were not that robust, notwithstanding the logic is not complete. But looking at the related issues in containerd and its fixes, I'm also wondering if it's still worth making it more complex to cover this case.

If we want to keep the logic for now, how about just using the simple (but hacky) approach to release known leaked IPs bypassing the IPAM plugin, without adding more cache/storage? It will be very specific to host-local plugin and could be broken if host-local changes its implementation, but it's the only case we have encountered in production, and this is a workaround anway. If IP leak still happens, right direction is still patching and upgrading container runtime.

@antoninbas
Copy link
Contributor

I took a look at the issue and the PR, they seem ensuring CNI will be called repeatedly until it returns success. But I remember in our case there was no CNI invocation, so looks different. Do you know which version the issue can be reproduced and which can not? According to the enhancements that have been made in containerd, it does seem CNI cleanup is ensured now.

Yes that PR seemed like a bit of a stretch, so maybe it was some other patch.
The issue could be reproduced consistently with containerd 1.6.6, but could not be reproduced with containerd 1.6.18.
One it became apparent that the issue was because of containerd, not Antrea, and that it could not be reproduced with a more recent version of containerd, I didn't try to identify the exact patch.
We also have this in Antrea now, which avoids that issue as well: #5548

If we want to keep the logic for now, how about just using the simple (but hacky) approach to release known leaked IPs bypassing the IPAM plugin, without adding more cache/storage?

Hmm, I was thinking it would be nice to have a unified approach which also works with AntreaIPAM. Don't you think the same issue can happen there if there is a missing CNI DEL? I.e., the IPPool CR has leaked IPs.

@tnqn
Copy link
Member

tnqn commented Nov 2, 2023

If we want to keep the logic for now, how about just using the simple (but hacky) approach to release known leaked IPs bypassing the IPAM plugin, without adding more cache/storage?

Hmm, I was thinking it would be nice to have a unified approach which also works with AntreaIPAM. Don't you think the same issue can happen there if there is a missing CNI DEL? I.e., the IPPool CR has leaked IPs.

I remember it has been taken care of by antrea-controller:

// Inspect all IPPools for stale IP Address entries.
// This may happen if controller was down during StatefulSet delete event.
// If such entry is found, enqueue cleanup event for this StatefulSet.
func (c *AntreaIPAMController) cleanupStaleAddresses() {
.

@antoninbas
Copy link
Contributor

OK, thanks for the link. If AntreaIIPAM has its own GC, then I think it's acceptable to have a solution which is specific to host-local.

@antoninbas antoninbas added this to the Antrea v1.15 release milestone Nov 2, 2023
antoninbas added a commit to antoninbas/antrea that referenced this issue Nov 3, 2023
During CNIServer reconciliation, we perform host-local IPAM garbage
collection (GC) by comparing the set of IPs allocated to local Pods and
the set of IPs currently reserved by the plugin. We release any IP
reserved by the plugin that is not in-use by a local Pod. The purpose is
to avoid leaking IP addresses when there is a bug in the container
runtime, which has happened in the past.

Two key design choices that were made:
* We do not invoke CNI DEL to release IPs, instead we access the
  host-local data which is persisted on the Node, and modify it as
  needed.
* We do not rely on the interface store (as persisted to OVSDB) to
  determine the set of IPs that may have been leaked. In case of an
  Antrea bug, it could be possible (although unlikely) for an IP to
  still be allocated by host-local but be missing from the interface
  store. Intead, we list all allocated IPs from the host-local data (an
  allocated IP corresponds to one disk file).

This approach is essentially the same as our existing script:
https://github.com/antrea-io/antrea/blob/main/hack/gc-host-local.sh

Fixes antrea-io#4326

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue Nov 4, 2023
During CNIServer reconciliation, we perform host-local IPAM garbage
collection (GC) by comparing the set of IPs allocated to local Pods and
the set of IPs currently reserved by the plugin. We release any IP
reserved by the plugin that is not in-use by a local Pod. The purpose is
to avoid leaking IP addresses when there is a bug in the container
runtime, which has happened in the past.

Two key design choices that were made:
* We do not invoke CNI DEL to release IPs, instead we access the
  host-local data which is persisted on the Node, and modify it as
  needed.
* We do not rely on the interface store (as persisted to OVSDB) to
  determine the set of IPs that may have been leaked. In case of an
  Antrea bug, it could be possible (although unlikely) for an IP to
  still be allocated by host-local but be missing from the interface
  store. Intead, we list all allocated IPs from the host-local data (an
  allocated IP corresponds to one disk file).

This approach is essentially the same as our existing script:
https://github.com/antrea-io/antrea/blob/main/hack/gc-host-local.sh

Fixes antrea-io#4326

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue Nov 13, 2023
During CNIServer reconciliation, we perform host-local IPAM garbage
collection (GC) by comparing the set of IPs allocated to local Pods and
the set of IPs currently reserved by the plugin. We release any IP
reserved by the plugin that is not in-use by a local Pod. The purpose is
to avoid leaking IP addresses when there is a bug in the container
runtime, which has happened in the past.

Two key design choices that were made:
* We do not invoke CNI DEL to release IPs, instead we access the
  host-local data which is persisted on the Node, and modify it as
  needed.
* We do not rely on the interface store (as persisted to OVSDB) to
  determine the set of IPs that may have been leaked. In case of an
  Antrea bug, it could be possible (although unlikely) for an IP to
  still be allocated by host-local but be missing from the interface
  store. Intead, we list all allocated IPs from the host-local data (an
  allocated IP corresponds to one disk file).

This approach is essentially the same as our existing script:
https://github.com/antrea-io/antrea/blob/main/hack/gc-host-local.sh

Fixes antrea-io#4326

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue Nov 14, 2023
During CNIServer reconciliation, we perform host-local IPAM garbage
collection (GC) by comparing the set of IPs allocated to local Pods and
the set of IPs currently reserved by the plugin. We release any IP
reserved by the plugin that is not in-use by a local Pod. The purpose is
to avoid leaking IP addresses when there is a bug in the container
runtime, which has happened in the past.

Two key design choices that were made:
* We do not invoke CNI DEL to release IPs, instead we access the
  host-local data which is persisted on the Node, and modify it as
  needed.
* We do not rely on the interface store (as persisted to OVSDB) to
  determine the set of IPs that may have been leaked. In case of an
  Antrea bug, it could be possible (although unlikely) for an IP to
  still be allocated by host-local but be missing from the interface
  store. Intead, we list all allocated IPs from the host-local data (an
  allocated IP corresponds to one disk file).

This approach is essentially the same as our existing script:
https://github.com/antrea-io/antrea/blob/main/hack/gc-host-local.sh

Fixes antrea-io#4326

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit that referenced this issue Nov 15, 2023
During CNIServer reconciliation, we perform host-local IPAM garbage
collection (GC) by comparing the set of IPs allocated to local Pods and
the set of IPs currently reserved by the plugin. We release any IP
reserved by the plugin that is not in-use by a local Pod. The purpose is
to avoid leaking IP addresses when there is a bug in the container
runtime, which has happened in the past.

Two key design choices that were made:
* We do not invoke CNI DEL to release IPs, instead we access the
  host-local data which is persisted on the Node, and modify it as
  needed.
* We do not rely on the interface store (as persisted to OVSDB) to
  determine the set of IPs that may have been leaked. In case of an
  Antrea bug, it could be possible (although unlikely) for an IP to
  still be allocated by host-local but be missing from the interface
  store. Instead, we list all allocated IPs from the host-local data (an
  allocated IP corresponds to one disk file).

This approach is essentially the same as our existing script:
https://github.com/antrea-io/antrea/blob/main/hack/gc-host-local.sh

Fixes #4326

Signed-off-by: Antonin Bas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants