Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout waiting for IPAM to start causing perpetual failure (1.6.1, EKS 1.15) #1055

Closed
InAnimaTe opened this issue Jun 24, 2020 · 27 comments
Closed
Labels

Comments

@InAnimaTe
Copy link

InAnimaTe commented Jun 24, 2020

We've been plaqued for a while (at least through 1.5.x, maybe even before) with the CNI dying (not sure how) and never coming back up. This sort of issue causes a node to continue accepting workloads of which can't provide any value because networking is broken.

We also have #1054 open presently which represents successful restarts and trying to understand why we're restarting in the first place.

For one of the nodes which this happened, we cordoned and blocked CA from touching it for observation. Any information following relates to the aws-node instance running on that node, aws-node-mq5bk.

We had case 7079142701 open with AWS where they suggested #1019 (issue #1011) would fix the problem but couldn't assist beyond that. We've also reached out to our TAM to setup a meeting with your CNI team and he's failed to respond in over a week.

Therefore, I'm writing this ticket to agglomerate available information from what I'm seeing because I don't quite think this problem is like the others (see list at bottom). Please @mogren, you're my only hope!

I'd really like to verify with engineers that our problem is indeed solved with #1019. We've been told before by support to "just upgrade™" before and still saw issues.

Logs K8s provides (kubectl logs):

Starting IPAM daemon in the background ... ok.
ERROR: logging before flag.Parse: E0624 20:52:52.156133       6 memcache.go:147] couldn't get resource list for external.metrics.k8s.io/v1beta1: <nil>
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:                                                                                                                                                                                                                                       

Support Person found errors in logs (provided via eks-logs-collector.sh)
(ipamd.log):

- {"level":"error","ts":"2020-06-04T13:37:58.660Z","caller":"ipamd/ipamd.go:962","msg":"Failed to retrieve interfaces data from instance metadata RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/:  dial tcp 169.254.169.254:80: connect: connection refused"}
- {"level":"error","ts":"2020-06-04T19:07:03.780Z","caller":"aws-k8s-agent/main.go:30","msg":"Initialization failure: ipamd init: failed to retrieve attached ENIs info"}

(plugin.log):

- {"level":"error","ts":"2020-06-09T04:16:03.253Z","caller":"routed-eni-cni-plugin/cni.go:252","msg":"Error received from DelNetwork gRPC call for pod edge-platform-gateway-66f6864f48-5wbvn namespace default sandbox c7bda6671561a7bfa978017c7f66b440df7991a2efe15b8ebc2ff3bf9c6a26b8: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused\""}

A unique readiness probe from what I think is around the time the container crashed and ceased to come back (~Jun 3/4):

Events from the Pod kube-system/aws-node-mq5bk
1 Unhealthy: Readiness probe failed: runtime: failed to create new OS thread (have 8 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

runtime stack:
runtime.throw(0x5653e61c1951, 0x9)
/usr/local/go/src/runtime/panic.go:774 +0x74
runtime.newosproc(0xc000111500)
/usr/local/go/src/runtime/os_linux.go:153 +0x1bc
runtime.newm1(0xc000111500)
/usr/local/go/src/runtime/proc.go:1853 +0xde
runtime.newm(0x0, 0xc000030500)
/usr/local/go/src/runtime/proc.go:1832 +0x93
runtime.startm(0xc000030500, 0xc000111100)
/usr/local/go/src/runtime/proc.go:1969 +0x12c
runtime.handoffp(0xc000030500)
/usr/local/go/src/runtime/proc.go:1996 +0x54
runtime.entersyscallblock_handoff()
/usr/local/go/src/runtime/proc.go:2937 +0x32
runtime.systemstack(0x0)
/usr/local/go/src/runtime/asm_amd64.s:370 +0x63
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1146

goroutine 23 [syscall]:
runtime.notetsleepg(0x5653e6964660, 0x3b5d979a, 0x0)
/usr/local/go/src/runtime/lock_futex.go:227 +0x38 fp=0xc00004cf60 sp=0xc00004cf30 pc=0x5653e5da1658
runtime.timerproc(0x5653e6964640)
/usr/local/go/src/runtime/time.go:311 +0x2f5 fp=0xc00004cfd8 sp=0xc00004cf60 pc=0x5653e5de2565
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00004cfe0 sp=0xc00004cfd8 pc=0x5653e5df1821
created by runtime.(*timersBucket).addtimerLocked
/usr/local/go/src/runtime/time.go:169 +0x110

goroutine 1 [select]:
google.golang.org/grpc.(*ClientConn).WaitForStateChange(0xc000110e00, 0x5653e6643e60, 0xc00008e840, 0x0, 0x0)
/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:502 +0x103
google.golang.org/grpc.DialContext(0x5653e6643e60, 0xc00008e840, 0x7ffff42f3510, 0x6, 0xc000090800, 0x3, 0x4, 0x0, 0x0, 0x0)
/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:322 +0xca2
main.main()
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/grpc-health-probe/main.go:99 +0x344

goroutine 20 [syscall]:
os/signal.signal_recv(0x0)
/usr/local/go/src/runtime/sigqueue.go:147 +0x9e
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:23 +0x24
created by os/signal.init.0
/usr/local/go/src/os/signal/signal_unix.go:29 +0x43

goroutine 22 [chan receive]:
main.main.func1(0xc00008e780, 0xc000095980)
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/grpc-health-probe/main.go:79 +0x43
created by main.main
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/grpc-health-probe/main.go:78 +0x11d

goroutine 24 [runnable]:
google.golang.org/grpc.(*ccBalancerWrapper).watcher(0xc000090940)
/go/pkg/mod/google.golang.org/[email protected]/balancer_conn_wrappers.go:67
created by google.golang.org/grpc.newCCBalancerWrapper
/go/pkg/mod/google.golang.org/[email protected]/balancer_conn_wrappers.go:60 +0x16f

goroutine 25 [chan receive]:
net.(netFD).connect.func1(0xc000092900, 0xc000092960, 0xc0000829b0, 0xc000104380)
/usr/local/go/src/net/fd_unix.go:117 +0x51
net.(netFD).connect(0xc000104380, 0x5653e6643e60, 0xc00008ea20, 0x0, 0x0, 0x5653e663b8a0, 0xc0000b66c0, 0x5653e663b8a0, 0xc0000261c0, 0x0, ...)
/usr/local/go/src/net/fd_unix.go:173 +0x42e
net.(netFD).dial(0xc000104380, 0x5653e6643e60, 0xc00008ea20, 0x5653e6646160, 0x0, 0x5653e6646160, 0xc00011e6f0, 0x0, 0x0, 0x0)
/usr/local/go/src/net/sock_posix.go:149 +0x101
net.socket(0x5653e6643e60, 0xc00008ea20, 0x5653e61bd657, 0x3, 0x2, 0x1, 0x0, 0x0, 0x5653e6646160, 0x0, ...)
/usr/local/go/src/net/sock_posix.go:70 +0x1c2
net.internetSocket(0x5653e6643e60, 0xc00008ea20, 0x5653e61bd657, 0x3, 0x5653e6646160, 0x0, 0x5653e6646160, 0xc00011e6f0, 0x1, 0x0, ...)
/usr/local/go/src/net/ipsock_posix.go:141 +0x143
net.(sysDialer).doDialTCP(0xc000104300, 0x5653e6643e60, 0xc00008ea20, 0x0, 0xc00011e6f0, 0x5653e659e440, 0x5653e697ce80, 0x0)
/usr/local/go/src/net/tcpsock_posix.go:65 +0xc4
net.(sysDialer).dialTCP(0xc000104300, 0x5653e6643e60, 0xc00008ea20, 0x0, 0xc00011e6f0, 0xc000095bc0, 0x10, 0x10)
/usr/local/go/src/net/tcpsock_posix.go:61 +0xd9
net.(sysDialer).dialSingle(0xc000104300, 0x5653e6643e60, 0xc0000

Wed Jun 03 2020 09:29:22 GMT-0400 (EDT)

I then see this liveness probe failure about 20 seconds after:

 1 Unhealthy: Liveness probe failed: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "process_linux.go:101: executing setns process caused \"exit status 1\"": unknown

Here's more Container details in its current state:

Containers:
  aws-node:
    Container ID:   docker://b1fc0367d03d6172a225b9d34018d8532e49aa2437a4b608a655487519b712e3
    Image:          602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.6.1
    Image ID:       docker-pullable://602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni@sha256:d50a182475c5ee6c18c3b81b01aa649367f30fb0dc60f7a619dcdbf45e10b3a3
    Port:           61678/TCP
    Host Port:      61678/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 24 Jun 2020 16:52:52 -0400
      Finished:     Wed, 24 Jun 2020 16:53:28 -0400
    Ready:          False
    Restart Count:  5203
    Requests:
      cpu:      10m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
    Environment:
      AWS_VPC_K8S_CNI_RANDOMIZESNAT:  prng
      MINIMUM_IP_TARGET:              15
      WARM_IP_TARGET:                 7
      AWS_VPC_K8S_CNI_LOGLEVEL:       DEBUG
      AWS_VPC_K8S_CNI_VETHPREFIX:     eni
      AWS_VPC_ENI_MTU:                9001
      MY_NODE_NAME:                    (v1:spec.nodeName)
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /host/var/log from log-dir (rw)
      /var/run/docker.sock from dockersock (rw)
      /var/run/dockershim.sock from dockershim (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from aws-node-token-mkxkm (ro)

Other suggested issues:

  1. 1.6.1 couldn't get resource list for custom.metrics.k8s.io/v1beta1 #950
  2. IPAM connectivity failed when upgrading from v1.5.5 to 1.6.0 #872
  3. Couldn't get resource list for external.metrics.k8s.io/v1beta1 #486, specifically Couldn't get resource list for external.metrics.k8s.io/v1beta1 #486 (comment) and his follow up Couldn't get resource list for external.metrics.k8s.io/v1beta1 #486 (comment) - Could this really just be IAM? (we use eksctl so this is hard to believe)
  4. CNI pod crashloopbackoff due to incorrect instance metadata while fetching ENI #1008
@InAnimaTe InAnimaTe changed the title Timeout waiting for IPAM to start (1.6.1, EKS 1.15) Timeout waiting for IPAM to start causing perpetual node networking failure (1.6.1, EKS 1.15) Jun 24, 2020
@InAnimaTe InAnimaTe changed the title Timeout waiting for IPAM to start causing perpetual node networking failure (1.6.1, EKS 1.15) Timeout waiting for IPAM to start causing perpetual failure (1.6.1, EKS 1.15) Jun 24, 2020
@mogren
Copy link
Contributor

mogren commented Jun 25, 2020

This is curious indeed:

- {"level":"error","ts":"2020-06-04T13:37:58.660Z","caller":"ipamd/ipamd.go:962","msg":"Failed to retrieve interfaces data from instance metadata RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/:  dial tcp 169.254.169.254:80: connect: connection refused"}

Basically, the instance metadata service is refusing the call. It could again be because of throttling, but this time on the metadata and not EC2. The CNI does not call instance metadata very often, on average around once every 5 seconds. Is there some other service or pod on the node that does a lot more frequent calls?

Also:

1 Unhealthy: Readiness probe failed: runtime: failed to create new OS thread (have 8 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

Suggests that the node has a lot of processes running, or for some reason have very low limits set. Is the load on these nodes very high?

@InAnimaTe
Copy link
Author

This is curious indeed:

- {"level":"error","ts":"2020-06-04T13:37:58.660Z","caller":"ipamd/ipamd.go:962","msg":"Failed to retrieve interfaces data from instance metadata RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/:  dial tcp 169.254.169.254:80: connect: connection refused"}

Basically, the instance metadata service is refusing the call. It could again be because of throttling, but this time on the metadata and not EC2. The CNI does not call instance metadata very often, on average around once every 5 seconds. Is there some other service or pod on the node that does a lot more frequent calls?

I don't think there's anything else on-node that would be using that metadata service. The DaemonSets we have are nginx, datadog, consul and nodelocaldns, and of course kube-proxy. We do leverage Cluster Autoscaler (single instance Deployment) as well which would call the API around ASGs but I don't think that should interfere here.
This is also what Support mentioned and even had me curl the metadata service which worked perfectly every time...

Also:

1 Unhealthy: Readiness probe failed: runtime: failed to create new OS thread (have 8 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

Suggests that the node has a lot of processes running, or for some reason have very low limits set. Is the load on these nodes very high?

If you feel strongly enough about this being part of the issue, I can do more research on the state of the node when the CNI restarted, but it's still in this restarting state. I still have this node available so please feel free to give me a slew of tests you'd like to see done. I feel like this has to do with a long-term change that's happened somehow. Additionally, the node has been drained and cordoned so there's just the DaemonSets running on it, hence less load while still having the CNI in a restarting state.

@ALutchko
Copy link

Looks like I have similar. In /var/logs/aws-routed-eni/ipamd.log I see:

{"level":"error","ts":"2020-07-31T09:47:45.014Z","caller":"ipamd/ipamd.go:322","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-01f111154 eni-0822227]: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 111....111"}

despite I do have both ec2:DescribeNetworkInterfaces and sts:AssumeRoleWithWebIdentity explicitly attached to node role. Please note "caused by: [nothing here]".

@SaranBalaji90
Copy link
Contributor

@InAnimaTe Sorry for being late here. if your still having the issue we can jump on a call and debug the issue.

@SaranBalaji90
Copy link
Contributor

@InAnimaTe Did you upload your logs somewhere? I tried to find the link but couldn't find it.

@InAnimaTe
Copy link
Author

Thanks for responding @SaranBalaji90 !!

I did not upload logs, just the excerpts I've shared. The logs are in our support tickets so I could pull them out if needbe.

We haven't upgraded but the issue is much less of a problem now. I would be interested in possibly getting on a call to talk through this. However, I'm not entirely sure it makes sense if we're aware that newer versions fix this?

@SaranBalaji90
Copy link
Contributor

Reopened #486 to discuss further on this.

@InAnimaTe
Copy link
Author

Reopened #486 to discuss further on this.

Sounds good. I have not yet tested if this still occurs on newer CNI versions and have no real way to replicate it. Seems to be completely random, and hasn't happened to us in a little while anyways.

@SaranBalaji90
Copy link
Contributor

Synced up with @InAnimaTe offline too. Just to update everyone here, we haven't made any changes to ipamd to fix this issue. Hopefully driving #486 to completion will help us to find the actual issue here.

@nigelellis
Copy link

FWIW, I'm running on EKS 1.17 and amazon-k8s-cni:v1.6.3. When I attempt to roll my cni daemonset (for example changing the deamonset node tags), I've frequently seen this issue:

Copying portmap binary ... Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:

The only solution I've found has been to recycle the node which makes this super painful.

@SaranBalaji90
Copy link
Contributor

@nigelellis Can you also paste the ipamd logs?

@nigelellis
Copy link

@SaranBalaji90 how do I get the ipamd logs?

@mogren
Copy link
Contributor

mogren commented Sep 4, 2020

@nigelellis https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni is the relevant doc.

If you have access to the node, running sudo bash /opt/cni/bin/aws-cni-support.sh should give you an archive with all relevant logs. (Note, you might want to send this through AWS support or email, depending on the size of the log file)

@nigelellis
Copy link

nigelellis commented Sep 4, 2020 via email

@nigelellis
Copy link

nigelellis commented Sep 4, 2020

I was able to get a repro - final log line is Getting running pod sandboxes from "unix:///var/run/dockershim.sock":-

[ipamd.log]
{"level":"debug","ts":"2020-09-04T00:25:30.883Z","caller":"awsutils/awsutils.go:253","msg":"Found subnet-id: subnet-0b64fcb84bf6c844a "}
{"level":"debug","ts":"2020-09-04T00:25:30.883Z","caller":"awsutils/awsutils.go:253","msg":"Found vpc-ipv4-cidr-block: 10.10.0.0/16 "}
{"level":"debug","ts":"2020-09-04T00:25:30.884Z","caller":"awsutils/awsutils.go:253","msg":"Found VPC CIDR: 10.10.0.0/16"}
{"level":"debug","ts":"2020-09-04T00:25:30.884Z","caller":"ipamd/ipamd.go:308","msg":"Start node init"}
{"level":"info","ts":"2020-09-04T00:25:30.884Z","caller":"wait/wait.go:133","msg":"Will attempt to clean up AWS CNI leaked ENIs after waiting 2m21s."}
{"level":"debug","ts":"2020-09-04T00:25:30.884Z","caller":"awsutils/awsutils.go:948","msg":"Total number of interfaces found: 2 "}
{"level":"debug","ts":"2020-09-04T00:25:30.884Z","caller":"awsutils/awsutils.go:443","msg":"Found ENI MAC address: 0a:ab:ee:cb:ac:d3"}
{"level":"debug","ts":"2020-09-04T00:25:30.885Z","caller":"awsutils/awsutils.go:457","msg":"Using device number 0 for primary eni: eni-00719958e493c64e0"}
{"level":"debug","ts":"2020-09-04T00:25:30.885Z","caller":"awsutils/awsutils.go:443","msg":"Found ENI: eni-00719958e493c64e0, MAC 0a:ab:ee:cb:ac:d3, device 0"}
{"level":"debug","ts":"2020-09-04T00:25:30.886Z","caller":"awsutils/awsutils.go:463","msg":"Found CIDR 10.10.192.0/19 for ENI 0a:ab:ee:cb:ac:d3"}
{"level":"debug","ts":"2020-09-04T00:25:30.886Z","caller":"awsutils/awsutils.go:463","msg":"Found IP addresses [10.10.196.9 10.10.216.43 10.10.197.8 10.10.203.104 10.10.198.142 10.10.209.14 10.10.220.239 10.10.198.111 10.10.211.130 10.10.200.228 10.10.207.120 10.10.209.152 10.10.205.254 10.10.205.191 10.10.192.253] on ENI 0a:ab:ee:cb:ac:d3"}
{"level":"debug","ts":"2020-09-04T00:25:30.886Z","caller":"awsutils/awsutils.go:443","msg":"Found ENI MAC address: 0a:f0:80:9d:5c:8d"}
{"level":"debug","ts":"2020-09-04T00:25:30.887Z","caller":"awsutils/awsutils.go:443","msg":"Found ENI: eni-09d5b0a5b7cd21df3, MAC 0a:f0:80:9d:5c:8d, device 2"}
{"level":"debug","ts":"2020-09-04T00:25:30.888Z","caller":"awsutils/awsutils.go:463","msg":"Found CIDR 10.10.192.0/19 for ENI 0a:f0:80:9d:5c:8d"}
{"level":"debug","ts":"2020-09-04T00:25:30.888Z","caller":"awsutils/awsutils.go:463","msg":"Found IP addresses [10.10.222.18 10.10.218.168 10.10.195.238 10.10.218.12 10.10.203.45 10.10.215.99 10.10.192.96 10.10.209.192 10.10.199.128 10.10.221.199 10.10.217.229 10.10.210.158 10.10.205.94 10.10.214.159 10.10.198.16] on ENI 0a:f0:80:9d:5c:8d"}
{"level":"debug","ts":"2020-09-04T00:25:31.116Z","caller":"ipamd/ipamd.go:308","msg":"DescribeAllENIs success: ENIs: 2, tagged: 1"}
{"level":"info","ts":"2020-09-04T00:25:31.116Z","caller":"ipamd/ipamd.go:354","msg":"Setting up host network... "}
{"level":"debug","ts":"2020-09-04T00:25:31.116Z","caller":"networkutils/network.go:236","msg":"Trying to find primary interface that has mac : 0a:ab:ee:cb:ac:d3"}
{"level":"debug","ts":"2020-09-04T00:25:31.116Z","caller":"networkutils/network.go:236","msg":"Discovered interface: lo, mac: "}
{"level":"debug","ts":"2020-09-04T00:25:31.117Z","caller":"networkutils/network.go:236","msg":"Discovered interface: eth0, mac: 0a:ab:ee:cb:ac:d3"}
{"level":"info","ts":"2020-09-04T00:25:31.117Z","caller":"networkutils/network.go:236","msg":"Discovered primary interface: eth0"}
{"level":"debug","ts":"2020-09-04T00:25:31.117Z","caller":"ipamd/ipamd.go:354","msg":"Setting RPF for primary interface: net/ipv4/conf/eth0/rp_filter"}
{"level":"debug","ts":"2020-09-04T00:25:31.117Z","caller":"networkutils/network.go:263","msg":"Found the Link that uses mac address 0a:ab:ee:cb:ac:d3 and its index is 2 (attempt 1/5)"}
{"level":"debug","ts":"2020-09-04T00:25:31.118Z","caller":"networkutils/network.go:314","msg":"Setup Host Network: loading existing iptables nat SNAT exclusion rules"}
{"level":"debug","ts":"2020-09-04T00:25:31.125Z","caller":"networkutils/network.go:314","msg":"host network setup: found potentially stale SNAT rule for chain AWS-SNAT-CHAIN-0: [-N AWS-SNAT-CHAIN-0]"}
{"level":"debug","ts":"2020-09-04T00:25:31.125Z","caller":"networkutils/network.go:314","msg":"host network setup: found potentially stale SNAT rule for chain AWS-SNAT-CHAIN-0: [-A AWS-SNAT-CHAIN-0 ! -d 10.10.0.0/16 -m comment --comment AWS SNAT CHAIN -j AWS-SNAT-CHAIN-1]"}
{"level":"debug","ts":"2020-09-04T00:25:31.127Z","caller":"networkutils/network.go:314","msg":"host network setup: found potentially stale SNAT rule for chain AWS-SNAT-CHAIN-1: [-N AWS-SNAT-CHAIN-1]"}
{"level":"debug","ts":"2020-09-04T00:25:31.127Z","caller":"networkutils/network.go:314","msg":"host network setup: found potentially stale SNAT rule for chain AWS-SNAT-CHAIN-1: [-A AWS-SNAT-CHAIN-1 -m comment --comment AWS, SNAT -m addrtype ! --dst-type LOCAL -j SNAT --to-source 10.10.196.9 --random]"}
{"level":"debug","ts":"2020-09-04T00:25:31.127Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: iptables -N AWS-SNAT-CHAIN-0 -t nat"}
{"level":"debug","ts":"2020-09-04T00:25:31.129Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: iptables -N AWS-SNAT-CHAIN-1 -t nat"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: iptables -A POSTROUTING -m comment --comment \"AWS SNAT CHAIN\" -j AWS-SNAT-CHAIN-0"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: iptables -A AWS-SNAT-CHAIN-0 ! -d {10.10.0.0/16 %!s(bool=false)} -t nat -j AWS-SNAT-CHAIN-1"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: synchronising SNAT stale rules"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: stale rule found: nat/AWS-SNAT-CHAIN-0 rule [0] AWS-SNAT-CHAIN-0"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: active rule found: nat/AWS-SNAT-CHAIN-0 rule [1] AWS-SNAT-CHAIN-0"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: stale rule found: nat/AWS-SNAT-CHAIN-1 rule [0] AWS-SNAT-CHAIN-1"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: active rule found: nat/AWS-SNAT-CHAIN-1 rule [1] AWS-SNAT-CHAIN-1"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"iptableRules: [nat/POSTROUTING rule first SNAT rules for non-VPC outbound traffic nat/AWS-SNAT-CHAIN-0 rule [0] AWS-SNAT-CHAIN nat/AWS-SNAT-CHAIN-1 rule last SNAT rule for non-VPC outbound traffic nat/AWS-SNAT-CHAIN-0 rule [0] AWS-SNAT-CHAIN-0 nat/AWS-SNAT-CHAIN-1 rule [0] AWS-SNAT-CHAIN-1]"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : first SNAT rules for non-VPC outbound traffic"}
{"level":"debug","ts":"2020-09-04T00:25:31.132Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : [0] AWS-SNAT-CHAIN"}
{"level":"debug","ts":"2020-09-04T00:25:31.133Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : last SNAT rule for non-VPC outbound traffic"}
{"level":"debug","ts":"2020-09-04T00:25:31.135Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : [0] AWS-SNAT-CHAIN-0"}
{"level":"debug","ts":"2020-09-04T00:25:31.136Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : [0] AWS-SNAT-CHAIN-1"}
{"level":"debug","ts":"2020-09-04T00:25:31.137Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : connmark for primary ENI"}
{"level":"debug","ts":"2020-09-04T00:25:31.138Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : connmark restore for primary ENI"}
{"level":"debug","ts":"2020-09-04T00:25:31.139Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : rule for primary address 10.10.196.9"}
{"level":"debug","ts":"2020-09-04T00:25:31.140Z","caller":"ipamd/ipamd.go:308","msg":"Discovered ENI eni-00719958e493c64e0, trying to set it up"}
{"level":"debug","ts":"2020-09-04T00:25:31.140Z","caller":"ipamd/ipamd.go:742","msg":"DataStore Add an ENI eni-00719958e493c64e0"}
{"level":"info","ts":"2020-09-04T00:25:31.140Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.216.43 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.197.8 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.203.104 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.198.142 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.209.14 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.220.239 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.198.111 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.211.130 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.200.228 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.207.120 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.209.152 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.205.254 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.205.191 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.192.253 to datastore"}
{"level":"debug","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:756","msg":"IP Address Pool stats: total: 14, assigned: 0"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:308","msg":"ENI eni-00719958e493c64e0 set up."}
{"level":"debug","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:308","msg":"Discovered ENI eni-09d5b0a5b7cd21df3, trying to set it up"}
{"level":"debug","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:742","msg":"DataStore Add an ENI eni-09d5b0a5b7cd21df3"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"networkutils/network.go:679","msg":"Setting up network for an ENI with IP address 10.10.222.18, MAC address 0a:f0:80:9d:5c:8d, CIDR 10.10.192.0/19 and route table 2"}
{"level":"debug","ts":"2020-09-04T00:25:31.141Z","caller":"networkutils/network.go:692","msg":"Found the Link that uses mac address 0a:f0:80:9d:5c:8d and its index is 4 (attempt 1/5)"}
{"level":"debug","ts":"2020-09-04T00:25:31.141Z","caller":"networkutils/network.go:679","msg":"Setting up ENI's primary IP 10.10.222.18"}
{"level":"debug","ts":"2020-09-04T00:25:31.141Z","caller":"networkutils/network.go:679","msg":"Deleting existing IP address 10.10.222.18/19 eth1"}
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"networkutils/network.go:679","msg":"Adding IP address 10.10.222.18/19"}
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"networkutils/network.go:679","msg":"Setting up ENI's default gateway 10.10.192.1"}
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"retry/retry.go:69","msg":"Successfully added/replaced route to be 10.10.192.1/0"}
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"retry/retry.go:69","msg":"Successfully added/replaced route to be 0.0.0.0/0"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.218.168 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.195.238 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.218.12 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.203.45 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.215.99 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.192.96 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.209.192 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.199.128 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.221.199 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.217.229 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.210.158 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.205.94 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.214.159 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.198.16 to datastore"}
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:756","msg":"IP Address Pool stats: total: 28, assigned: 0"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:308","msg":"ENI eni-09d5b0a5b7cd21df3 set up."}
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:439","msg":"Getting running pod sandboxes from \"unix:///var/run/dockershim.sock\""}

The pod logs returned:

Copying portmap binary ... Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:

I was tailing the ipamd log and the container exits with:

2020-09-03 17:26:06,800 - ERROR: command terminated with exit code 137

Here's the podspec:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/limit-ranger: 'LimitRanger plugin set: memory request for container
      aws-node; cpu, memory limit for container aws-node'
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2020-09-02T20:54:17Z"
  generateName: aws-node-
  labels:
    controller-revision-hash: 88769d4f9
    k8s-app: aws-node
    pod-template-generation: "9"
  name: aws-node-569fm
  namespace: kube-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: aws-node
    uid: 87a32608-ed1a-11e8-8d8d-026b59889896
  resourceVersion: "183343173"
  selfLink: /api/v1/namespaces/kube-system/pods/aws-node-569fm
  uid: 5a4e21da-b7dd-4f0f-a5fa-2c3039a31a04
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - ip-10-10-196-9.us-west-2.compute.internal
  automountServiceAccountToken: true
  containers:
  - env:
    - name: AWS_VPC_K8S_CNI_LOGLEVEL
      value: DEBUG
    - name: MY_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.4
    imagePullPolicy: Always
    name: aws-node
    ports:
    - containerPort: 61678
      hostPort: 61678
      name: metrics
      protocol: TCP
    resources:
      limits:
        cpu: 300m
        memory: 300Mi
      requests:
        cpu: 10m
        memory: 300Mi
    securityContext:
      allowPrivilegeEscalation: true
      privileged: true
      readOnlyRootFilesystem: false
      runAsGroup: 0
      runAsNonRoot: false
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host/opt/cni/bin
      mountPropagation: None
      name: cni-bin-dir
    - mountPath: /host/etc/cni/net.d
      mountPropagation: None
      name: cni-net-dir
    - mountPath: /host/var/log
      mountPropagation: None
      name: log-dir
    - mountPath: /var/run/docker.sock
      mountPropagation: None
      name: dockersock
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: aws-node-token-vw5dm
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  nodeName: ip-10-10-196-9.us-west-2.compute.internal
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: aws-node
  serviceAccountName: aws-node
  shareProcessNamespace: false
  terminationGracePeriodSeconds: 30
  tolerations:
  - operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:
  - hostPath:
      path: /opt/cni/bin
      type: ""
    name: cni-bin-dir
  - hostPath:
      path: /etc/cni/net.d
      type: ""
    name: cni-net-dir
  - hostPath:
      path: /var/log
      type: ""
    name: log-dir
  - hostPath:
      path: /var/run/docker.sock
      type: ""
    name: dockersock
  - name: aws-node-token-vw5dm
    secret:
      defaultMode: 420
      secretName: aws-node-token-vw5dm
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-09-02T20:54:17Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-09-04T00:31:35Z"
    message: 'containers with unready status: [aws-node]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-09-04T00:31:35Z"
    message: 'containers with unready status: [aws-node]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-09-02T20:54:17Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://3fd7469cd1288c1ca964f50730d8d489692c046094d2f05c81615f07de4caec7
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.4
    imageID: docker-pullable://602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni@sha256:e01133675b0ac9857392593aaaf62e56dd8a28c1bf6a23eac34cd577a9c2da20
    lastState:
      terminated:
        containerID: docker://3fd7469cd1288c1ca964f50730d8d489692c046094d2f05c81615f07de4caec7
        exitCode: 1
        finishedAt: "2020-09-04T00:31:35Z"
        reason: Error
        startedAt: "2020-09-04T00:30:59Z"
    name: aws-node
    ready: false
    restartCount: 7
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=aws-node pod=aws-node-569fm_kube-system(5a4e21da-b7dd-4f0f-a5fa-2c3039a31a04)
        reason: CrashLoopBackOff
  hostIP: 10.10.196.9
  phase: Running
  podIP: 10.10.196.9
  podIPs:
  - ip: 10.10.196.9
  qosClass: Burstable
  startTime: "2020-09-02T20:54:17Z"

To reproduce this issue:

  1. deployed a daemonset running amazon-k8s-cni:v1.6.4 and created a fresh node in the ASG
  2. picked the node and edited the aws-node pod spec via. kubectl edit pod aws-node-xyz and bumped the image tag version from amazon-k8s-cni:v1.6.3 to amazon-k8s-cni:v1.6.4. FWIW I don't think the version makes any difference
  3. ran log tail:
kubectl -n kube-system exec -it aws-node-XXX-- tail -f /host/var/log/aws-routed-eni/ipamd.log | tee ipamd.log

I'm running EKS kubelet v1.17.9-eks-4c6976

@mogren
Copy link
Contributor

mogren commented Sep 4, 2020

Thanks a lot @nigelellis, this should help us reproduce the issue.

Is this the final line in the logs?

{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:439","msg":"Getting running pod sandboxes from \"unix:///var/run/dockershim.sock\""}

So reading from the docker shim is hanging, locking up ipamd is that correct? What does docker version say?

@nigelellis
Copy link

nigelellis commented Sep 4, 2020

@mogren yes, that's the last line in the log before the pod terminates with error 137. My read is it dies while reading from Docker.

I'm running the stock AWS image amazon-k8s-cni:v1.6.4. The kubelet is running EKS v1.17.9-eks-4c6976.

Looks like it's running Docker docker://19.3.6 (found via. kubectl get node | grep containerVersion)

Does that help?

@nigelellis
Copy link

nigelellis commented Sep 10, 2020

@mogren any idea on when this issue might be resolved? LMK if I can help with more logs, etc. Thanks.

@mogren
Copy link
Contributor

mogren commented Sep 13, 2020

@nigelellis Hey, sorry for the late reply on this. In the pod spec you pated, I don't see unix:///var/run/dockershim.sock being mounted into the aws-node pod. The docker.sock is a docker specific API that was only used on CNI v1.5x and earlier.

The missing lines are aws-k8s-cni.yaml#L141-L142 and aws-k8s-cni.yaml#L156-L158.

This was mentioned in the release notes for v1.6.0 and the next 3 releases when updating from earlier versions. Sorry for not noticing this right away. Could you please try adding /var/run/dockershim.sock?

@nigelellis
Copy link

nigelellis commented Sep 13, 2020 via email

@nigelellis
Copy link

@mogren - I can confirm updating the daemonset to mount dockershim.sock addressed the crashlooping issue we experienced when rolling the aws-node pods. I was able to bump to 1.6.4 without any problems. Thank you for the help, very much appreciated -- a good lesson for me to validate all config/manifest changes when bumping version.

@mogren
Copy link
Contributor

mogren commented Sep 14, 2020

Thanks a lot for confirming @nigelellis!

There will be a v1.7.2 out very soon with a lot of fixes. The v1.7.x versions now have an init-container as well, so the config has changed a lot since v1.6.x

@nigelellis
Copy link

nigelellis commented Sep 14, 2020

@mogren will 1.7 be recommended for EKS 1.17? I still see 1.6.3 on the EKS upgrade guide.

@mogren
Copy link
Contributor

mogren commented Sep 14, 2020

@nigelellis Yes, as soon as we make it the default for new clusters. 🙂

@mogren
Copy link
Contributor

mogren commented Sep 16, 2020

@InAnimaTe Are you still seeing this issue? Could you please try using v1.7.3?

@InAnimaTe
Copy link
Author

@InAnimaTe Are you still seeing this issue? Could you please try using v1.7.3?

Hey @mogren so we have not seen this anymore in the past couple months in any of our environments. We are planning to upgrade to 1.7.2/3 in the next few weeks so I'll be looking out for these issues but please don't keep this open on my account. I can reopen if I see clear evidence of this happening again.

Thanks!

@mogren
Copy link
Contributor

mogren commented Sep 23, 2020

Great, thanks a lot @InAnimaTe, and please open a new ticket for if you see any issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants