-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restarting aws-node leaks all pod IPs after #371 due to use of incorrect container ID #712
Comments
Fixing this without just going back to being Docker-only seems tricky. The only unique identifier available through the CNI is the infra container ID, which as far as I can tell is just flat-out not available through the Kubernetes API. That would seem to imply that the only option would be to go back to talking to the container runtime again (and adding support for the CRI socket protocol). Support for non-Docker runtimes is important to us (we'd switch CNIs first, if we had no other choice); I'm happy to help out with implementation if you'd like. Let me know what you think about my proposed solution! |
@drakedevel Thanks a lot for the detailed bug report! I agree that checking with the container runtime is probably the only way to be sure. Are you proposing to add code to talk to the CRI using the |
@mogren Yep, pretty much! I think we actually only need When I originally submitted this report I was worried there would need to be two codepaths -- one for Docker using Another thought: the CRI interface actually returns enough info to fully populate the Also, quick release logistics / branch management question for you: how will this work interact with the 1.6 branch? Assuming this blocks 1.6.0, would the plan be to land CRI support on master and backport to 1.6 ? Or just revert 1.6 to Docker-only and include CRI support in 1.7? |
@drakedevel Unless it proves to be way too much work, I'd rather wait with releasing 1.6 until the CRI support is merged, just to fix this issue. We know that removing the docker dependency is quite a big change, and that we will need to really test all kinds of corner cases. To get rid of the extra call to API server at startup would be great, but that's something that could wait until v1.6.1 or so. Thanks a lot for taking a look at this! |
@mogren Sounds good on both counts! I should have time to start working on this today; I'll start a draft PR once there's anything to put in it 😄 |
@drakedevel Looking at this comment from 2 years ago, I wonder what the idea was here. Is the issue that the container ID will be no longer be empty in the data store, since it's fetched from the kubelet, and then this code path will never succeed? |
@mogren I think that's right -- digging in, that codepath dates from before the Docker code was added. On startup, the datastore got populated from the Kubernetes API with a nil container ID, so that codepath was useful. Once the Docker code started populating the container ID on startup it stopped working (but as long as the container ID is right, the fallback is unnecessary). (Probably for the best it's dead -- I shudder to think of the issues you could run into with StatefulSets or other non-unique pod names) |
Fix for this has been merged and is available in the v1.6.0-rc5 build. |
For context: we're running 1.5.4 of this CNI plugin on our staging cluster (Kubernetes 1.16.2, CRI-O, non-EKS) with a local backport of #371 for CRI-O support. Everything works pretty well, except we've experienced a significant number of instances of pods failing to start due to node IP exhaustion, even though we have pod limits in place that should prevent this. The introspection interface reveals IPs allocated to long-dead pods. Restarting
aws-node
releases those IPs, but more always crop up eventually.The root cause is some apparent confusion between the
K8S_POD_INFRA_CONTAINER_ID
argument provided to the CNI plugin and the value of the Podstatus.containerStatuses[].containerID
field, introduced in #371.As the name implies, the
K8S_POD_INFRA_CONTAINER_ID
refers to the infrastructure container, typically namedk8s_POD_*
in Docker. ThecontainerStatuses
field, meanwhile, refers to the actual application containers, namedk8s_<name>_*
in Docker. The two IDs are never the same, because they necessarily refer to different containers.This only causes major issues after the
aws-node
process is restarted. When the process starts up after a restart, it checks the Kubernetes API to learn what containers are on the system, and records them in the datastore. At this point, the container ID is populated from the pod status. When a pod is later deleted, the CNI deletion request will fail because the infrastructure container ID in the request won't match the datastore, and the IP will be leaked until the nextaws-node
restart.This issue readily reproduces with the 1.6.0rc4 on default EKS with Docker. I did the following steps (using
eksctl 0.8.0
):Where
cluster.yaml
is:I then SSH'd into one of the nodes, where a couple of
coredns
pods were running. This was the introspection output:docker ps
(limited to coredns) yielded:Note that the container IDs in the introspection output (
93e...
andc32...
) correspond to thek8s_POD_
containers (running/pause
).I then patched the DaemonSet to use 1.6.0-rc4, as described in the release notes:
The introspection output changed, with this diff:
Note that the new IDs are both formatted differently and referring to different containers: they now refer to
390...
and893...
, which are thek8s_coredns_
containers!We can now trigger a leak by deleting one of these pods:
docker ps
(limited to coredns) yields:And this is the diff to the introspection output:
Note the distinct lack of
-
lines there --192.168.63.20
is still listed, which was allocated to the pod we just deleted!ipamd.log
confirms the reason:It received a deletion request for
coredns-6f647f5754-gpm7t
with container ID93e...
(which is the infrastructure container). But, after the restart, the container ID in the datastore had the IDdocker://390...
instead, so the deletion failed.I've attached
aws-cni-support.tar.gz
generated by the support script at the end of this process, as well as a tarball of the completedocker ps
and introspection files I was showing diffs of. (I spun up this cluster to reproduce this bug, there's nothing private anywhere.)aws-cni-support.tar.gz
introspection-docker.tar.gz
The text was updated successfully, but these errors were encountered: