Restarting aws-node leaks all pod IPs after #371 due to use of incorrect container ID #712

drakedevel · 2019-11-09T00:38:40Z

For context: we're running 1.5.4 of this CNI plugin on our staging cluster (Kubernetes 1.16.2, CRI-O, non-EKS) with a local backport of #371 for CRI-O support. Everything works pretty well, except we've experienced a significant number of instances of pods failing to start due to node IP exhaustion, even though we have pod limits in place that should prevent this. The introspection interface reveals IPs allocated to long-dead pods. Restarting aws-node releases those IPs, but more always crop up eventually.

The root cause is some apparent confusion between the K8S_POD_INFRA_CONTAINER_ID argument provided to the CNI plugin and the value of the Pod status.containerStatuses[].containerID field, introduced in #371.

As the name implies, the K8S_POD_INFRA_CONTAINER_ID refers to the infrastructure container, typically named k8s_POD_* in Docker. The containerStatuses field, meanwhile, refers to the actual application containers, named k8s_<name>_* in Docker. The two IDs are never the same, because they necessarily refer to different containers.

This only causes major issues after the aws-node process is restarted. When the process starts up after a restart, it checks the Kubernetes API to learn what containers are on the system, and records them in the datastore. At this point, the container ID is populated from the pod status. When a pod is later deleted, the CNI deletion request will fail because the infrastructure container ID in the request won't match the datastore, and the IP will be leaked until the next aws-node restart.

This issue readily reproduces with the 1.6.0rc4 on default EKS with Docker. I did the following steps (using eksctl 0.8.0):

$ eksctl create cluster -f cluster.yaml

Where cluster.yaml is:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: adrake-cni-repro
  region: us-west-2

nodeGroups:
  - name: ng-1
    instanceType: m5.xlarge
    desiredCapacity: 2
    ssh:
      allow: true

I then SSH'd into one of the nodes, where a couple of coredns pods were running. This was the introspection output:

{
    "coredns-6f647f5754-gpm7t_kube-system_93e629d853d9f70464dd5b54987533ffe87930cb28a6414ddb48c94b69845e85": {
        "DeviceNumber": 0,
        "IP": "192.168.63.20"
    },
    "coredns-6f647f5754-nzs66_kube-system_c32c237fb5293478da938e388f6e22486790facba0d178ff24271829865976cf": {
        "DeviceNumber": 0,
        "IP": "192.168.37.12"
    }
}

docker ps (limited to coredns) yielded:

CONTAINER ID        IMAGE                                                              COMMAND                  CREATED             STATUS              PORTS               NAMES
8933be46aa1d        602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/coredns           "/coredns -conf /etc…"   25 minutes ago      Up 25 minutes                           k8s_coredns_coredns-6f647f5754-nzs66_kube-system_83eeac05-0278-11ea-9fc6-06571088971c_0
390bb768d7bd        602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/coredns           "/coredns -conf /etc…"   25 minutes ago      Up 25 minutes                           k8s_coredns_coredns-6f647f5754-gpm7t_kube-system_83f10b53-0278-11ea-9fc6-06571088971c_0
93e629d853d9        602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause-amd64:3.1   "/pause"                 25 minutes ago      Up 25 minutes                           k8s_POD_coredns-6f647f5754-gpm7t_kube-system_83f10b53-0278-11ea-9fc6-06571088971c_0
c32c237fb529        602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause-amd64:3.1   "/pause"                 25 minutes ago      Up 25 minutes                           k8s_POD_coredns-6f647f5754-nzs66_kube-system_83eeac05-0278-11ea-9fc6-06571088971c_0

Note that the container IDs in the introspection output (93e... and c32...) correspond to the k8s_POD_ containers (running /pause).

I then patched the DaemonSet to use 1.6.0-rc4, as described in the release notes:

$ kubectl patch daemonset aws-node \
-n kube-system \
-p '{"spec": {"template": {"spec": {"containers": [{"image": "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.0-rc4","name":"aws-node"}]}}}}'

The introspection output changed, with this diff:

--- pre-introspect.json	2019-11-08 23:08:32.017719083 +0000
+++ post-introspect.json	2019-11-08 23:12:51.484877866 +0000
@@ -1,9 +1,9 @@
 {
-    "coredns-6f647f5754-gpm7t_kube-system_93e629d853d9f70464dd5b54987533ffe87930cb28a6414ddb48c94b69845e85": {
+    "coredns-6f647f5754-gpm7t_kube-system_docker://390bb768d7bd8d86cdeff9b10d0ca949731ce2f9b4ee1d93bb63b28f07378eac": {
         "DeviceNumber": 0,
         "IP": "192.168.63.20"
     },
-    "coredns-6f647f5754-nzs66_kube-system_c32c237fb5293478da938e388f6e22486790facba0d178ff24271829865976cf": {
+    "coredns-6f647f5754-nzs66_kube-system_docker://8933be46aa1d5f06957bd44d4cd5be2d313f8f2c5bc2b5300eb2a91b87077d23": {
         "DeviceNumber": 0,
         "IP": "192.168.37.12"
     }

Note that the new IDs are both formatted differently and referring to different containers: they now refer to 390... and 893..., which are the k8s_coredns_ containers!

We can now trigger a leak by deleting one of these pods:

$ kubectl delete pod -n kube-system coredns-6f647f5754-gpm7t

docker ps (limited to coredns) yields:

CONTAINER ID        IMAGE                                                              COMMAND                  CREATED             STATUS              PORTS               NAMES
a8a1641783f5        a4de3c20b947                                                       "/coredns -conf /etc…"   52 seconds ago      Up 51 seconds                           k8s_coredns_coredns-6f647f5754-hjrtp_kube-system_80851bfe-027d-11ea-9fc6-06571088971c_0
de613afeed3e        602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause-amd64:3.1   "/pause"                 52 seconds ago      Up 51 seconds                           k8s_POD_coredns-6f647f5754-hjrtp_kube-system_80851bfe-027d-11ea-9fc6-06571088971c_0
8933be46aa1d        602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/coredns           "/coredns -conf /etc…"   31 minutes ago      Up 31 minutes                           k8s_coredns_coredns-6f647f5754-nzs66_kube-system_83eeac05-0278-11ea-9fc6-06571088971c_0
c32c237fb529        602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause-amd64:3.1   "/pause"                 31 minutes ago      Up 31 minutes                           k8s_POD_coredns-6f647f5754-nzs66_kube-system_83eeac05-0278-11ea-9fc6-06571088971c_0

And this is the diff to the introspection output:

--- post-introspect.json	2019-11-08 23:12:51.484877866 +0000
+++ restart-introspect.json	2019-11-08 23:15:00.590447687 +0000
@@ -3,6 +3,10 @@
         "DeviceNumber": 0,
         "IP": "192.168.63.20"
     },
+    "coredns-6f647f5754-hjrtp_kube-system_de613afeed3eabfe151f11507eaefff6e5ed738776c37c80bb0eef28627e92e7": {
+        "DeviceNumber": 2,
+        "IP": "192.168.62.24"
+    },
     "coredns-6f647f5754-nzs66_kube-system_docker://8933be46aa1d5f06957bd44d4cd5be2d313f8f2c5bc2b5300eb2a91b87077d23": {
         "DeviceNumber": 0,
         "IP": "192.168.37.12"

Note the distinct lack of - lines there -- 192.168.63.20 is still listed, which was allocated to the pod we just deleted!

ipamd.log confirms the reason:

2019-11-08T23:14:24.104Z [INFO]         Received DelNetwork for IP <nil>, Pod coredns-6f647f5754-gpm7t, Namespace kube-system, Container 93e629d853d9f70464dd5b54987533ffe87930cb28a6414ddb48c94b69845e85
2019-11-08T23:14:24.104Z [DEBUG]        UnassignPodIPv4Address: IP address pool stats: total:28, assigned 2, pod(Name: coredns-6f647f5754-gpm7t, Namespace: kube-system, Container 93e629d853d9f70464dd5b54987533ffe87930cb28a6414ddb48c94b69845e85)
2019-11-08T23:14:24.104Z [WARN]         UnassignPodIPv4Address: Failed to find pod coredns-6f647f5754-gpm7t namespace kube-system Container 93e629d853d9f70464dd5b54987533ffe87930cb28a6414ddb48c94b69845e85
2019-11-08T23:14:24.104Z [DEBUG]        UnassignPodIPv4Address: IP address pool stats: total:28, assigned 2, pod(Name: coredns-6f647f5754-gpm7t, Namespace: kube-system, Container )
2019-11-08T23:14:24.104Z [WARN]         UnassignPodIPv4Address: Failed to find pod coredns-6f647f5754-gpm7t namespace kube-system Container 
2019-11-08T23:14:24.104Z [INFO]         Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod

It received a deletion request for coredns-6f647f5754-gpm7t with container ID 93e... (which is the infrastructure container). But, after the restart, the container ID in the datastore had the ID docker://390... instead, so the deletion failed.

I've attached aws-cni-support.tar.gz generated by the support script at the end of this process, as well as a tarball of the complete docker ps and introspection files I was showing diffs of. (I spun up this cluster to reproduce this bug, there's nothing private anywhere.)

aws-cni-support.tar.gz
introspection-docker.tar.gz

The text was updated successfully, but these errors were encountered:

drakedevel · 2019-11-09T00:52:17Z

Fixing this without just going back to being Docker-only seems tricky. The only unique identifier available through the CNI is the infra container ID, which as far as I can tell is just flat-out not available through the Kubernetes API. That would seem to imply that the only option would be to go back to talking to the container runtime again (and adding support for the CRI socket protocol).

Support for non-Docker runtimes is important to us (we'd switch CNIs first, if we had no other choice); I'm happy to help out with implementation if you'd like. Let me know what you think about my proposed solution!

mogren · 2019-11-11T18:11:07Z

@drakedevel Thanks a lot for the detailed bug report! I agree that checking with the container runtime is probably the only way to be sure. Are you proposing to add code to talk to the CRI using the ListContainers method instead of the Docker specific API that was previously used?

drakedevel · 2019-11-11T20:43:13Z

@mogren Yep, pretty much! I think we actually only need ListPodSandbox -- that gets us at least the infra pod ID as well as the pod name/namespace/UID, which I believe is everything we're currently using. (The simplest implementation would be basically a drop-in replacement for pkg/docker.)

When I originally submitted this report I was worried there would need to be two codepaths -- one for Docker using docker.sock (as on the 1.5 branch), and one using the CRI interface. However, I'd completely forgotten about dockershim.sock so I'm pretty sure we can just use that and only have one codepath that uses the CRI.

Another thought: the CRI interface actually returns enough info to fully populate the K8SPodInfo struct. That seems like it makes it unnecessary to talk to the Kubernetes API during startup, which eliminates a lot of complexity, not to mention the watch on all pods! (You'd still need API access for allocating new ENIs if you're using a custom network config, but that's a comparatively rare operation.) I wouldn't do this in the first PR but it seems like valuable follow-up work if I'm not missing something.

Also, quick release logistics / branch management question for you: how will this work interact with the 1.6 branch? Assuming this blocks 1.6.0, would the plan be to land CRI support on master and backport to 1.6 ? Or just revert 1.6 to Docker-only and include CRI support in 1.7?

mogren · 2019-11-11T20:56:08Z

@drakedevel Unless it proves to be way too much work, I'd rather wait with releasing 1.6 until the CRI support is merged, just to fix this issue. We know that removing the docker dependency is quite a big change, and that we will need to really test all kinds of corner cases.

To get rid of the extra call to API server at startup would be great, but that's something that could wait until v1.6.1 or so. Thanks a lot for taking a look at this!

drakedevel · 2019-11-11T21:20:01Z

@mogren Sounds good on both counts! I should have time to start working on this today; I'll start a draft PR once there's anything to put in it 😄

mogren · 2019-11-15T20:54:05Z

@drakedevel Looking at this comment from 2 years ago, I wonder what the idea was here. Is the issue that the container ID will be no longer be empty in the data store, since it's fetched from the kubelet, and then this code path will never succeed?

https://github.com/aws/amazon-vpc-cni-k8s/blame/c74841a229e7f6a369858b073d9feaa1ee890146/ipamd/rpc_handler.go#L95-L100

drakedevel · 2019-11-20T03:07:30Z

@mogren I think that's right -- digging in, that codepath dates from before the Docker code was added. On startup, the datastore got populated from the Kubernetes API with a nil container ID, so that codepath was useful. Once the Docker code started populating the container ID on startup it stopped working (but as long as the container ID is right, the fallback is unnecessary).

(Probably for the best it's dead -- I shudder to think of the issues you could run into with StatefulSets or other non-unique pod names)

mogren · 2020-01-20T23:26:23Z

Fix for this has been merged and is available in the v1.6.0-rc5 build.

mogren added bug needs investigation priority/P0 Highest priority. Someone needs to actively work on this. labels Nov 11, 2019

drakedevel mentioned this issue Nov 12, 2019

Use CRI to obtain pod sandbox IDs instead of Kubernetes API #714

Merged

mogren closed this as completed Jan 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting aws-node leaks all pod IPs after #371 due to use of incorrect container ID #712

Restarting aws-node leaks all pod IPs after #371 due to use of incorrect container ID #712

drakedevel commented Nov 9, 2019

drakedevel commented Nov 9, 2019

mogren commented Nov 11, 2019

drakedevel commented Nov 11, 2019

mogren commented Nov 11, 2019

drakedevel commented Nov 11, 2019

mogren commented Nov 15, 2019

drakedevel commented Nov 20, 2019 •

edited

Loading

mogren commented Jan 20, 2020

Restarting aws-node leaks all pod IPs after #371 due to use of incorrect container ID #712

Restarting aws-node leaks all pod IPs after #371 due to use of incorrect container ID #712

Comments

drakedevel commented Nov 9, 2019

drakedevel commented Nov 9, 2019

mogren commented Nov 11, 2019

drakedevel commented Nov 11, 2019

mogren commented Nov 11, 2019

drakedevel commented Nov 11, 2019

mogren commented Nov 15, 2019

drakedevel commented Nov 20, 2019 • edited Loading

mogren commented Jan 20, 2020

drakedevel commented Nov 20, 2019 •

edited

Loading