Cannot scale new nodes with CNI version 1.5.3 #625

cgkades · 2019-09-25T20:31:04Z

When trying to scale new nodes after creating an EKS cluster, the new nodes will not leave the NotReady state. The aws-node pod is in a CrashLoopBackOff state.

This seems to be from the file 10-aws.conflist not being placed on the node. It looks like the install script had a line removed that added the file, and it was then moved to main.go. The problem is that it never reaches the code to copy the file

bash-4.2# cat /host/var/log/aws-routed-eni/ipamd.log.2019-09-25-20
2019-09-25T20:01:32.800Z [INFO] Starting L-IPAMD v1.5.3  ...
2019-09-25T20:02:02.803Z [INFO] Testing communication with server
2019-09-25T20:02:32.803Z [INFO] Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
2019-09-25T20:02:32.803Z [ERROR]        Failed to create client: error communicating with apiserver: Get https://172.20.0.1:443/version?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout
2019-09-25T20:02:33.901Z [INFO] Starting L-IPAMD v1.5.3  ...
2019-09-25T20:03:03.903Z [INFO] Testing communication with server

I apologize but I wasn't able to dig in far enough to know why it needs the 10-aws.conflist in /etc/cni/net.d first, but that does seem to be the issue.

The text was updated successfully, but these errors were encountered:

matthsmi · 2019-09-25T20:33:03Z

To expand on what @cgkades is saying...

We've found is that k8sapi.CreateKubeClient() fails with Failed to create client: error communicating with apiserver: Get https://172.20.0.1:443/version?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout (https://github.com/aws/amazon-vpc-cni-k8s/blob/release-1.5.3/main.go#L47-L51).

As such, it never gets to copy over the /etc/cni/10-aws.conflist. If we manually create the 10-aws.conflist all is right and aws-node starts up.

This behavior isn't in 1.5.1 and appears to be introduced in commit: 7fd7c93

mogren · 2019-09-25T23:23:44Z

@matthsmi @cgkades Thanks for opening a new ticket. (#283 is not related to this case)

The change to not copy the config until ipamd is running happened in commit f9acdeb and was done to avoid scheduling pods to that node until we had ENIs and IPs available. If ipamd can not talk to the kubernetes API server, it will not copy the plugin binary or the config file, since that is how you update the node to be in the Ready state.

The main issue is that your nodes cannot reach your API server. If this happens again, could you please run the log-collector-script and send it through support, together with cluster ARN?

cgkades · 2019-09-26T19:34:44Z

@mogren this happens with every cluster we launch, i'll work with @matthsmi to see if we can get the logs you need

marcel4303 · 2019-10-08T16:04:02Z

Having the same issue as @cgkades .
I've informed AWS Support and attached the results of the log-collector-script as @mogren requested.

matthsmi · 2019-10-08T16:26:55Z

Thanks for the update, @marcel4303. I'm glad that we're not the only ones experiencing this, and thanks for opening up a ticket with AWS support, too.

We've also got an internal high priority ticket open with AWS. In the mean time, we've been using aws cni 1.5.1. We'll be sure to update this issue with relevant details as we hear back internally.

mogren · 2019-10-08T17:55:24Z

Looking at the reports I have gotten so far, the main issue seems to be the DNS call where the CNI tries to reach the API server. The first two lines in the ipamd logs..

2019-09-25T20:01:32.800Z [INFO] Starting L-IPAMD v1.5.3  ...
2019-09-25T20:02:02.803Z [INFO] Testing communication with server

...usually says it took around 30s to do kubeClient := k8sclient.GetKubeClient() on nodes that have issues. That looks a lot like a 30s timeout. That code is from the operator-framework and it tries to read the KUBERNETES_SERVICE_HOST environment variable, then if that fails, it does a net.LookupHost("kubernetes.default.svc"). For some reason, on these nodes, that takes 30s to time out.

Is there some custom DNS setup on these nodes? What does /etc/resolv.conf look like?

marcel4303 · 2019-10-08T17:55:42Z

Your welcome.
I've tried also to downgrade the cni version to 1.5.1. This results in a ready state node(s) however the pods then encounter issues and remain in the state of ContainerCreating.

Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "f9b30e13de653234be1dec6d91a4fa7fcd9b9aaddfec191f6d8b839d2722b0c4" network for pod "coffee-7dbb5795f6-ckvrc": NetworkPlugin cni failed to set up pod "coffee-7dbb5795f6-ckvrc_cafe" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "f9b30e13de653234be1dec6d91a4fa7fcd9b9aaddfec191f6d8b839d2722b0c4" network for pod "coffee-7dbb5795f6-ckvrc": NetworkPlugin cni failed to teardown pod "coffee-7dbb5795f6-ckvrc_cafe" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
  Normal   SandboxChanged          5s (x93 over 20m)  kubelet, ip-172-31-30-47.eu-central-1.compute.internal  Pod sandbox changed, it will be killed and re-created.

blakelead · 2019-10-10T08:44:21Z

FYI, also had this issue in CNI 1.5.4.
Downgrading to 1.5.1 before scaling nodes resolved this for us.

mogren · 2019-10-10T18:46:00Z

Does this happen on worker nodes using 1.13.7 kubelet? The reason I ask is that there was an upstream bug:

If a pod's restart policy is "Never", and kubelet failed to create the PodSandbox (for whatever reason) once, the pod will be stuck at "ContainerCreating" forever.

1.13.8 and later has this fix kubernetes/kubernetes#79398

yannick · 2020-01-03T23:56:59Z

i seem to be also hit by this. using kops i tear up a cluster, e.g. with kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2019-09-26 on kubernetesVersion: 1.16.4
the masters get up fine but with nodes its hit and miss. i also tried imageName: "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.0-rc5"

mogren · 2020-01-20T23:35:22Z

@yannick Are the pods crash looping unable to talk to the API server, or some other issue?

matthsmi · 2020-02-29T00:50:49Z

I've made some more progress with this, but today using the v1.6.0 of the aws-node CNI.

To re-state what I am seeing. When I initially spin up a cluster, and associated instances in an ASG to the eks cluster, the nodes join the cluster, aws-node starts, and they become Ready.

When I try to expand the ASG, and add new instances, these new instance are able to join the cluster, but aws-node does not start up. You can see them in the cluster via kubectl get nodes, but they are in the NotReady state.

When aws-k8s-agent tries to start up, it reaches this point: https://github.com/aws/amazon-vpc-cni-k8s/blob/release-1.6/pkg/k8sapi/discovery.go#L89

The kubeClient endpoint is: https://172.20.0.1:443, which is not yet available to the system, since no kube-proxy is running at that time. No kube-proxy will start since no CNI files are in place yet, since the aws-k8s-agent hasn't bootstrapped. If aws-k8s-agent doesn't bootstrap in a timely fashion, the container entrypoint.sh will fail out, and not copy any CNI files.

In my testing, I shuffled things around in the entrypoint.sh

diff --git a/scripts/entrypoint.sh b/scripts/entrypoint.sh
index d7dc7f50..636b02d3 100755
--- a/scripts/entrypoint.sh
+++ b/scripts/entrypoint.sh
@@ -58,16 +58,6 @@ echo -n "starting IPAM daemon in background ... "
 ./aws-k8s-agent > "$AGENT_LOG_PATH" 2>&1 &
 echo "ok."
 
-echo -n "checking for IPAM connectivity ... "
-
-if ! wait_for_ipam; then
-    echo " failed."
-    echo "timed out waiting for IPAM daemon to start."
-    exit 1
-fi
-
-echo "ok."
-
 echo -n "copying CNI plugin binaries and config files ... "
 
 cp portmap "$HOST_CNI_BIN_PATH"
@@ -84,6 +74,16 @@ if [[ -f "$HOST_CNI_CONFDIR_PATH/aws.conf" ]]; then
     rm "$HOST_CNI_CONFDIR_PATH/aws.conf"
 fi
 
+echo -n "checking for IPAM connectivity ... "
+
+if ! wait_for_ipam; then
+    echo " failed."
+    echo "timed out waiting for IPAM daemon to start."
+    exit 1
+fi
+
+echo "ok."
+
 # bring the aws-k8s-agent process back into the foreground
 echo "foregrounding IPAM daemon ... "
 fg %1 >/dev/null 2>&1 || { echo "failed (process terminated)" && cat "$AGENT_LOG_PATH" && exit 1; }

I lowered the wait_for_ipam until after the CNI files get copied. At this time kube-proxy starts up, aws-k8s-agent gets its act together, and things start working.

I am not sure the adverse affects of this from the IPAM perspective, but my initial testing is looking good, and I've been able to scale in and out an ASG with no issue with the modified entrypoint.sh.

I'm going to continue testing, but this is where I'm at.

Do you have any thoughts about this @mogren ?

thank you!

jaypipes · 2020-03-02T16:44:03Z

The kubeClient endpoint is: https://172.20.0.1:443, which is not yet available to the system, since no kube-proxy is running at that time. No kube-proxy will start since no CNI files are in place yet, since the aws-k8s-agent hasn't bootstrapped. If aws-k8s-agent doesn't bootstrap in a timely fashion, the container entrypoint.sh will fail out, and not copy any CNI files.

AFAIK, the above statement is not correct. kube-proxy does not have anything to do with the aws-node Pod's ability to contact the Kubernetes API server. kube-proxy enables routing for a Kubernetes Service object's backend Endpoints (other Pods).

The timeout waiting for IPAMd (aws-k8s-agent) to come up is 30 seconds. My suspicion is that there is a problem with DNS resolution, not kube-proxy, that is causing the IPAMd startup to take longer than 30 seconds. On startup, IPAMd attempts to communicate with the Kubernetes API server using the k8sclient.GetKubeClient() function from the [email protected] release (as @mogren notes above).

Is there some custom DNS setup on these nodes? What is the value of your KUBERNETES_SERVICE_HOST env var? What does your /etc/resolv.conf look like?

matthsmi · 2020-03-02T18:40:13Z

Hi @jaypipes ,

Thanks for the input. It seems I missed the comment from @mogren above.

That said, I'm still confident that this isn't a DNS resolution issue.

I had put some additional debug lines inside the k8s-agent, and had it print out the kubeconfig that gets provided from k8sclient.GetCubeConfig() (which should also be what is used for k8sclient.GetKubeClient()).

Here's the Host that aws-node is trying to use:

kubeConfig=&rest.Config{Host:"https://172.20.0.1:443 ", APIPath:"", ContentConfig:rest.ContentConfig{AcceptContentTypes:"",.... stuff removed}

There shouldn't be any DNS resolution happening there.

This also matches up to the ClusterIP that the aws-cni is trying to speak to:

% k get service
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   172.20.0.1   <none>        443/TCP   73m
%

That said here is what the aws-node is attempting to use for KUBERNETES_SERVICE_* and for DNS.

# docker inspect d356efd553b9 | jq '.[0].Config.Env'                           
[
  "AWS_VPC_K8S_CNI_LOGLEVEL=DEBUG",
 [...]
  "KUBERNETES_SERVICE_HOST=172.20.0.1",
  "KUBERNETES_SERVICE_PORT=443",
[...]
  "KUBERNETES_SERVICE_PORT_HTTPS=443",
[...]
  "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]
#

# cat /etc/resolv.conf 
nameserver 169.254.169.253
search us-west-2.compute.internal
#

If aws-node is attempting to communicate to the API Server via ClusterIP, then kube-proxy should be working.

acegrader33 · 2020-03-03T20:43:59Z

The kubeClient endpoint is: https://172.20.0.1:443, which is not yet available to the system, since no kube-proxy is running at that time. No kube-proxy will start since no CNI files are in place yet, since the aws-k8s-agent hasn't bootstrapped.

I believe that adding tolerations to our kube-proxy daemonset that allow it to start on nodes with NotReady status fixed this error for us.

tolerations:
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"

jaypipes · 2020-03-18T22:57:08Z

sorry for the delayed response @matthsmi! You are of course correct. We suspect that this is the same issue as #872 and #865, where we believe the root cause of the problem is, as you allude to, the aws-node Daemonset failing to start when kube-proxy has not (yet) fully come online.

We are going to add a configurable timeout to the entrypoint.sh script and bump the default timeout to a larger than 30 second default. We hope this will address the issues.

Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the entrypoint.sh script. Increases the default timeout from ~30 seconds to 60 seconds. Users can set the IPAMD_TIMEOUT_SECONDS environment variable to change the timeout. Related: aws#625, aws#865 aws#872

* add configurable timeout for ipamd startup Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the entrypoint.sh script. Increases the default timeout from ~30 seconds to 60 seconds. Users can set the IPAMD_TIMEOUT_SECONDS environment variable to change the timeout. Related: #625, #865 #872 * This is a local gRPC call, so just try every 1 second indefinitely Since we have a liveness probe restarting the probe, we can rely on that to kill the pod. Co-authored-by: Claes Mogren <[email protected]>

commit d938e5e Author: Jayanth Varavani <[email protected]> Date: Wed Jul 1 01:19:14 2020 +0000 Json o/p for logs from entrypoint.sh commit 2d20308 Author: Nathan Prabhu <[email protected]> Date: Mon Jun 29 18:06:22 2020 -0500 bugfix: make metrics-helper docker logging statement multi-arch compatible commit bf9ded3 Author: Claes Mogren <[email protected]> Date: Sat Jun 27 14:51:35 2020 -0700 Use install command instead of cp commit e3b7dbb Author: Gyuho Lee <[email protected]> Date: Mon Jun 29 09:40:02 2020 -0700 scripts/lib: bump up tester to v1.4.0 Signed-off-by: Gyuho Lee <[email protected]> commit c369480 Author: Claes Mogren <[email protected]> Date: Sun Jun 28 12:19:27 2020 -0700 Some refresh cleanups commit 8c266e9 Author: Claes Mogren <[email protected]> Date: Sun Jun 28 18:37:46 2020 -0700 Run staticcheck and clean up commit 8dfc5b1 Author: Jayanth Varavani <[email protected]> Date: Sun Jun 28 17:39:20 2020 -0700 Fix integration test script for code pipeline (aws#1062) Co-authored-by: Claes Mogren <[email protected]> commit 52306be Author: Murcherla <[email protected]> Date: Wed Jun 24 23:37:24 2020 -0500 minor nits, fast follow up to PR 903 commit 4ddd248 Author: Claes Mogren <[email protected]> Date: Sun Jun 14 23:20:22 2020 -0700 Add bandwidth plugin commit 6d35fda Author: Robert Sheehy <[email protected]> Date: Fri May 22 21:11:12 2020 -0500 Chain interface to other CNI plugins commit 30f98bd Author: Penugonda <[email protected]> Date: Thu Jun 25 15:14:00 2020 -0400 removed custom networking default vars, introspection var commit aa8b818 Author: Penugonda <[email protected]> Date: Wed Jun 24 19:11:38 2020 -0400 updated manifest configs with default env vars commit a073d66 Author: Nithish Murcherla <[email protected]> Date: Wed Jun 24 16:51:38 2020 -0500 refresh subnet/CIDR information every 30 seconds and update ip rules to map pods (aws#903) Co-authored-by: Claes Mogren <[email protected]> commit a0da387 Author: Claes Mogren <[email protected]> Date: Wed Jun 24 12:30:45 2020 -0700 Default to random-fully (aws#1048) commit 9fea153 Author: Claes Mogren <[email protected]> Date: Sun Jun 14 22:37:10 2020 -0700 Update probe settings * Reduce readiness probe startup delay * Increase liveness polling period * Reduce shutdown grace period to 10 seconds commit ad7df34 Author: Jay Pipes <[email protected]> Date: Wed Jun 24 02:06:23 2020 -0400 Remove timeout for ipamd startup (aws#874) * add configurable timeout for ipamd startup Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the entrypoint.sh script. Increases the default timeout from ~30 seconds to 60 seconds. Users can set the IPAMD_TIMEOUT_SECONDS environment variable to change the timeout. Related: aws#625, aws#865 aws#872 * This is a local gRPC call, so just try every 1 second indefinitely Since we have a liveness probe restarting the probe, we can rely on that to kill the pod. Co-authored-by: Claes Mogren <[email protected]> commit 1af40d2 Author: Jayanth Varavani <[email protected]> Date: Fri Jun 19 10:14:44 2020 -0700 Changelog and config file changes for v1.6.3 commit 14d5135 Author: Ari Becker <[email protected]> Date: Wed Jun 17 09:39:21 2020 +0300 Generated the different configurations commit 00395cb Author: Ari Becker <[email protected]> Date: Tue Jun 16 14:33:55 2020 +0300 Fix discovery RBAC issues in Kubernetes 1.17 commit 7e224af Author: Gyuho Lee <[email protected]> Date: Mon Jun 15 16:04:44 2020 -0700 scripts/lib/aws: bump up tester to v1.3.9 Includes improvements to log fetcher + MNG deletion when metrics server is installed. Signed-off-by: Gyuho Lee <[email protected]> commit 36286ba Author: Claes Mogren <[email protected]> Date: Mon Jun 15 07:56:59 2020 -0700 Remove Printf and format test (aws#1027) commit af54066 Author: Gyuho Lee <[email protected]> Date: Sat Jun 13 01:31:08 2020 -0700 scripts/lib/aws: tester v1.3.6, enable color outputs (aws#1025) Includes various bug fixes + color output if $TERM is supported. Fallback to plain text output automatic. ref. https://github.com/aws/aws-k8s-tester/blob/master/CHANGELOG/CHANGELOG-1.3.md#v136-2020-06-12 Signed-off-by: Gyuho Lee <[email protected]> commit 6d52e1b Author: jayanthvn <[email protected]> Date: Fri Jun 12 16:26:33 2020 -0700 added warning if delete on termination is set to false for the primar… (aws#1024) * Added a warning message if delete on termination is set to false for the primary ENI

mogren · 2020-07-08T18:24:35Z

This all seems to be related to issues with kube-proxy, closing.

mogren added bug priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release labels Sep 25, 2019

jaypipes mentioned this issue Mar 18, 2020

Remove timeout for ipamd startup #874

Merged

mogren closed this as completed Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot scale new nodes with CNI version 1.5.3 #625

Cannot scale new nodes with CNI version 1.5.3 #625

cgkades commented Sep 25, 2019

matthsmi commented Sep 25, 2019

mogren commented Sep 25, 2019

cgkades commented Sep 26, 2019

marcel4303 commented Oct 8, 2019

matthsmi commented Oct 8, 2019

mogren commented Oct 8, 2019

marcel4303 commented Oct 8, 2019

blakelead commented Oct 10, 2019

mogren commented Oct 10, 2019

yannick commented Jan 3, 2020

mogren commented Jan 20, 2020

matthsmi commented Feb 29, 2020

jaypipes commented Mar 2, 2020

matthsmi commented Mar 2, 2020

acegrader33 commented Mar 3, 2020

jaypipes commented Mar 18, 2020

mogren commented Jul 8, 2020

Cannot scale new nodes with CNI version 1.5.3 #625

Cannot scale new nodes with CNI version 1.5.3 #625

Comments

cgkades commented Sep 25, 2019

matthsmi commented Sep 25, 2019

mogren commented Sep 25, 2019

cgkades commented Sep 26, 2019

marcel4303 commented Oct 8, 2019

matthsmi commented Oct 8, 2019

mogren commented Oct 8, 2019

marcel4303 commented Oct 8, 2019

blakelead commented Oct 10, 2019

mogren commented Oct 10, 2019

yannick commented Jan 3, 2020

mogren commented Jan 20, 2020

matthsmi commented Feb 29, 2020

jaypipes commented Mar 2, 2020

matthsmi commented Mar 2, 2020

acegrader33 commented Mar 3, 2020

jaypipes commented Mar 18, 2020

mogren commented Jul 8, 2020