Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENI warming is delayed for at least for 1 minute, probably caused by #480 #525

Closed
uthark opened this issue Jul 10, 2019 · 28 comments
Closed
Assignees
Milestone

Comments

@uthark
Copy link
Contributor

uthark commented Jul 10, 2019

We also found that #480 changes causes the following interesting behavior:

  1. New node joins the cluster.
  2. API Server assigns new pods to it (particularly from system daemonsets, like aws-cni itself, fluentd, kube-node-monitor and some other that have tolerations)
  3. aws-cni plugins starts and immediately discovers that there are pods without IP address. It then wait for 12*5=1 minute before trying to assign IP addresses to these pods.
  4. And it looks like warming ENIs starts only after this timeout. At least, according to the logs we got:

Relevant log entries from IPAMD:

2019-07-10T20:30:30Z [INFO]     Starting L-IPAMD v1.5.0  ...
...
2019-07-10T20:30:30Z [DEBUG]    Using WARM_ENI_TARGET 4
...
2019-07-10T20:31:32Z [INFO]     Serving metrics on port 61678
2019-07-10T20:31:32Z [INFO]     Serving introspection endpoints on 127.0.0.1:61679
...
2019-07-10T20:31:34Z [INFO]     Received DelNetwork for IP <nil>, Pod kube-node-monitor-ckbsh, Namespace default, Container 9a8410ed9f541e1961f47b5c797fe10cd3d006c09301ba88ccb3c389c51a6017
2019-07-10T20:31:34Z [INFO]     Received AddNetwork for NS /proc/38956/ns/net, Pod kube-node-monitor-ckbsh, NameSpace default, Container 35265389458c297371c825f2ae3f5277583c9c5ab2acebc599ba02a3bb3b6920, ifname eth0
2019-07-10T20:31:34Z [DEBUG]    AssignIPv4Address: IP address pool stats: total: 0, assigned 0
2019-07-10T20:31:34Z [DEBUG]    AssignPodIPv4Address: Skip ENI eni-036df2e82333ebf90 that does not have available addresses
2019-07-10T20:31:34Z [ERROR]    DataStore has no available IP addresses
...
2019-07-10T20:31:35Z [DEBUG]    IP pool stats: total = 0, used = 0, c.maxIPsPerENI = 29
2019-07-10T20:31:35Z [DEBUG]    IP pool is too low: available (0) < ENI target (4) * addrsPerENI (29)
2019-07-10T20:31:35Z [DEBUG]    Starting to increase IP pool size
2019-07-10T20:31:35Z [DEBUG]    Skip the primary ENI for need IP check
...
2019-07-10T20:31:35Z [INFO]     Created a new ENI: eni-0091e4a9e7fa4ecb3

Also, during this 1 minute, all requests to GRPC failed (we see errors like "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused" during startup.

@uthark
Copy link
Contributor Author

uthark commented Jul 10, 2019

Also, related to #330

@uthark uthark changed the title ENI warming is delayed for at least for 1 minutes, probably caused by #480 ENI warming is delayed for at least for 1 minute, probably caused by #480 Jul 11, 2019
@uthark
Copy link
Contributor Author

uthark commented Jul 11, 2019

Another issue caused by long startup is that all pods will fail with FailedCreatePodSandBox

Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "08929bf2e0cd09f9e163925d0bd7de8bc04c01311dc65202f2400ed7665aa771" network for pod "app-server-7c845659bf-dsh52": NetworkPlugin cni failed to set up pod "app-server-7c845659bf-dsh52" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "08929bf2e0cd09f9e163925d0bd7de8bc04c01311dc65202f2400ed7665aa771" network for pod "classic-app-server-7c845659bf-dsh52": NetworkPlugin cni failed to teardown pod "app-server-7c845659bf-dsh52" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

@mogren mogren self-assigned this Jul 13, 2019
@mogren
Copy link
Contributor

mogren commented Jul 13, 2019

Thanks @uthark, this is a good find and something we have to investigate.

@MrSaints
Copy link

MrSaints commented Jul 15, 2019

Setting restartPolicy: OnFailure gets around this for now, though not ideal for one-off jobs where a failure may be desired / intentional.

@uthark
Copy link
Contributor Author

uthark commented Jul 15, 2019

We set restartPolicy to Always, so, we have a lot of FailedCreatePodSandbox from kubelet during first minute.

@uthark
Copy link
Contributor Author

uthark commented Jul 15, 2019

@mogren Why does CNI Plugin wait for 1 minute?
From #480 description:

If ipamd gets restarted while a pod is registering it's newly assigned IP with the apiserver, we might miss the assigned IP because of the eventual consistency of Kubernetes.

As of 1.9 kubernetes uses quorum reads from etcd (kubernetes/kubernetes#53717). So, why wait?

I'd like to submit PR to make number of tries configurable, would you accept such PR?

@mogren
Copy link
Contributor

mogren commented Jul 17, 2019

@uthark The issue we saw with v1.4.1 back in May was that if a pod, demo-pod, got added to the node and ipamd assigns it an IP, let's say 10.1.1.1, and just at that time the CNI gets killed. When a new ipamd comes up and tries to re-sync it's internal data store by comparing the IPs of the current pods on the node with the IPs assigned to the ENIs attached, we noticed multiple times where the latest added pod, demo-pod in this case, did not have any IP. That in turn could lead to ipamd unassigning that IP from the node and returning it to the EC2 control plane, which then assigned the IP to another node that gave it to another pod. Now you have a state where two pods in the same cluster have the IP 10.1.1.1 assigned. If you are not using the WARM_IP_TARGET, this should not happen nearly as often, but still possible since a full ENI could be detached even though a pod is using an IP that belonged to it.

I guess the root cause of this issue is that we make the node available for scheduling pods before the CNI is actually ready.

@uthark
Copy link
Contributor Author

uthark commented Jul 17, 2019

Yes, what we ended up with is using lifecycle hooks to taint/untaint the node.
Updating startup script to prepare CNI config for kubelet after it is actually ready will help to solve issue during node boot. It will not solve the case when CNI plugin restarts, though. I can submit pr with taints using lifecycle hooks, if you’re interested.

@uthark
Copy link
Contributor Author

uthark commented Jul 17, 2019

Also, question — what if CNI plugin stored used IP addresses in some kind of CRD or even as a tag on the ENI?

@mogren mogren added this to the v1.6 milestone Jul 18, 2019
@t0ny-peng
Copy link

This is kinda affecting our production now. Problem can be trace back to in #282

@sc250024
Copy link

What @left4taco said. This needs to be fixed!

@dcherniv
Copy link

Same @sc250024 and @left4taco.
In general state of affairs with EKS is rather appalling compared to GKE.

@uthark
Copy link
Contributor Author

uthark commented Jul 23, 2019

@sc250024
Copy link

sc250024 commented Jul 23, 2019

Thanks @uthark! Are we suppose to edit the VPC CNI DaemonSet with that additional script?

The code of a multi-billion dollar cloud provider, in the end saved by a shell script 🙄

@uthark
Copy link
Contributor Author

uthark commented Jul 23, 2019

we build our own image that has kubectl in it and the hook; and update DaemonSet manifest to use the hook.

@sc250024
Copy link

sc250024 commented Jul 24, 2019

@uthark I tried what you described, but a node doesn't have permissions to taint in a default AWS EKS setup. I'm guessing you also allowed the nodes extra permissions in the aws-auth ConfigMap?

Events:
  Type     Reason               Age                From                                                 Message
  ----     ------               ----               ----                                                 -------
  Normal   Scheduled            60s                default-scheduler                                    Successfully assigned kube-system/aws-node-f9xk5 to <internal-node-ip>.eu-west-1.compute.internal
  Normal   Pulling              27s (x2 over 59s)  kubelet, <internal-node-ip>.eu-west-1.compute.internal  pulling image "<my-custom-repo>/amazon-vpc-cni-k8s:v1.5.0"
  Normal   Killing              27s                kubelet, <internal-node-ip>.eu-west-1.compute.internal  Killing container with id docker://aws-node:FailedPostStartHook
  Normal   Pulled               26s (x2 over 58s)  kubelet, <internal-node-ip>.eu-west-1.compute.internal  Successfully pulled image "<my-custom-repo>/amazon-vpc-cni-k8s:v1.5.0"
  Normal   Created              26s (x2 over 58s)  kubelet, <internal-node-ip>.eu-west-1.compute.internal  Created container
  Normal   Started              26s (x2 over 58s)  kubelet, <internal-node-ip>.eu-west-1.compute.internal  Started container
  Warning  FailedPostStartHook  25s (x2 over 58s)  kubelet, <internal-node-ip>.eu-west-1.compute.internal  Exec lifecycle hook ([/bin/sh -c /usr/bin/lifecycle_hook.sh postStart]) for Container "aws-node" in Pod "aws-node-f9xk5_kube-system(639cd5e3-ae02-11e9-b573-06b8a900565e)" failed - error: command '/bin/sh -c /usr/bin/lifecycle_hook.sh postStart' exited with 1: Error from server (Forbidden): nodes "<internal-node-ip>.eu-west-1.compute.internal" is forbidden: User "system:serviceaccount:kube-system:aws-node" cannot patch resource "nodes" in API group "" at the cluster scope
, message: "Error from server (Forbidden): nodes \"<internal-node-ip>.eu-west-1.compute.internal\" is forbidden: User \"system:serviceaccount:kube-system:aws-node\" cannot patch resource \"nodes\" in API group \"\" at the cluster scope\n"
  Warning  FailedPreStopHook  25s (x2 over 58s)  kubelet, <internal-node-ip>.eu-west-1.compute.internal  Exec lifecycle hook ([/bin/sh -c /usr/bin/lifecycle_hook.sh preStop]) for Container "aws-node" in Pod "aws-node-f9xk5_kube-system(639cd5e3-ae02-11e9-b573-06b8a900565e)" failed - error: command '/bin/sh -c /usr/bin/lifecycle_hook.sh preStop' exited with 1: Error from server (Forbidden): nodes "<internal-node-ip>.eu-west-1.compute.internal" is forbidden: User "system:serviceaccount:kube-system:aws-node" cannot patch resource "nodes" in API group "" at the cluster scope
, message: "Error from server (Forbidden): nodes \"<internal-node-ip>.eu-west-1.compute.internal\" is forbidden: User \"system:serviceaccount:kube-system:aws-node\" cannot patch resource \"nodes\" in API group \"\" at the cluster scope\n"

@sc250024
Copy link

@uthark I actually got your script to work, but it doesn't work 100% of the time actually. A bit of a bummer, really.

@uthark
Copy link
Contributor Author

uthark commented Jul 28, 2019 via email

@sc250024
Copy link

sc250024 commented Jul 28, 2019

EDIT 2019-07-29: @uthark I actually saw why this happens. In the case of our GitLab runners, the images we use for CI are very small (usually Alpine Linux). Sometimes Kubernetes will download the image for a particular GitLab job before the CNI image has been downloaded. Kubernetes will then try to run the particular CI job (i.e. the GitLab CI job is in ContainerCreating state) before the CNI image has a chance to taint that node. It doesn't happen often, but it does happen enough.

@uthark As far as I can tell, it's unpredictable. But the error message is always the same:

Warning FailedCreatePodSandBox 25s kubelet, .eu-west-1.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "" network for pod "runner-7dpmmrh-project--concurrent-05zx5f": NetworkPlugin cni failed to set up pod "runner-7dpmmrh-project--concurrent-05zx5f_gitlab-runner" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "" network for pod "runner-7dpmmrh-project--concurrent-05zx5f": NetworkPlugin cni failed to teardown pod "runner-7dpmmrh-project--concurrent-05zx5f_gitlab-runner" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

Same error as you encountered above. Sometimes it works, and sometimes it doesn't. We use this primarily to scale up / scale down GitLab runner nodes.

@robertsheehy-wf
Copy link

We are experiencing this exact issue for our production EKS environment.

What seems to happen is ipamd gets stuck in a loop by logic introduced in this PR #480.

@mogren What's supposed to be happening here? For our scenario the pods never receive their pod ip, so ipamd never starts up, resulting in tcp connection errors for our CNI.

@mogren
Copy link
Contributor

mogren commented Jul 30, 2019

Hi @robertsheehy-wf,

I explained some of the background for this change in an earlier comment. That said, It should only retry for at most 60 seconds, then continue and assume that the pods don't have an IP. Why this doesn't happen in all cases we still have to investigate more.

@robertsheehy-wf
Copy link

robertsheehy-wf commented Jul 30, 2019

Thanks, sorry I should have read more closely.

Granted my knowledge is limited, but is there maybe a better strategy rather than waiting 60 seconds?

Can we look at the pods more closely to understand why the Pod doesn't have a IP? Or maybe omit the wait loop if we know the CNI has never been ran on the node? (since this a problem we see on node startup)

@acesir
Copy link

acesir commented Jul 31, 2019

This is currently affecting us in all our environment and completely broke our autoscaling in production. We schedule dynamic amounts of batch jobs in kubernetes and autoscale our nodes based on taints/labels/requests. When the nodes autoscale 99% of the pods fail and get stuck in sandbox creation failed/cni errors.

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Jul 31, 2019
mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Jul 31, 2019
mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Jul 31, 2019
@robertsheehy-wf
Copy link

We’ve found that this is a problem when starting new nodes in a cluster. What happens is a pod will get scheduled at the same time as a CNI on a new node. This is where the CNI seems to falter. If the CNI is scheduled concurrently with another pod, the CNI sometimes seems to get into a state where it is unsure how to handle the pod. This results in the pod never receiving an IP. This problem is then exacerbated by the fact the CNI will wait 60 seconds while it waits for eventual consistency of Kubernetes. In the situation the CNI is blocked for 60 seconds while nothing happens, because in reality the pod was never assigned an IP by anyone.

This effectively means the node is unusable for 60 seconds. This is a huge problem if you are cycling nodes in huge batches.

What we’ve concluded is the CNI should always come first on a host before any other pods. We had assumed the CNI could gracefully handle this scenario.

To resolve this we looked at managing taints/tolerations like @uthark had done. The one hangup we had with his approach was we didn’t want to manage a fork of this repo.

This is where we realized for several of our daemonsets we had tolerations which tolerated all taints.

      tolerations:
        - operator: "Exists"

We had done this for our own reasons, but this meant for every node we brought up there was a high probability we would hit this issue. Our daemonset pods were being stood up concurrently with the CNI.

Upon further investigation we found EKS already had a taint we could use on new nodes. Specifically this was the node.kubernetes.io/not-ready:NoSchedule taint. This would always be removed after the CNI had started.

After fixing our daemonset tolerations we hit this issue a lot less often, since now CNIs had time to start before pods were scheduled on the node.

@uthark
Copy link
Contributor Author

uthark commented Jul 31, 2019

@robertsheehy-wf Yeah, we did exact same.

mogren pushed a commit that referenced this issue Aug 1, 2019
@MrSaints
Copy link

MrSaints commented Aug 1, 2019

The linked commit appears to have alleviated the problem, at least for us. We don't see our pods getting stuck.

@acesir
Copy link

acesir commented Aug 2, 2019

The latest release of 1.5.1 RC1 solved this issue when autoscaling nodes and pods getting stuck with sandbox errors. The linked commit + #548 was primarily the 2 patches needed for us.

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Aug 2, 2019
mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Aug 2, 2019
mogren pushed a commit that referenced this issue Aug 2, 2019
@mogren
Copy link
Contributor

mogren commented Aug 22, 2019

Reduced wait time and added a fix to not retry too long for force detached ENIs.

@mogren mogren closed this as completed Aug 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants