Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containers stuck in ContainerCreating after configuring CNI Custom Networking on extended CIDR #527

Closed
yrotilio opened this issue Jul 12, 2019 · 15 comments
Labels
feature request priority/P2 Low priority, nice to have.

Comments

@yrotilio
Copy link

Hi,
We have an issue on CNI Custom networking & extended CIDR after nodes first boot if we have Pods pending for scheduling.

For example, on a simple nginx workload w/ 10 replicas, after node first boot we have :

> kubectl get pods
NAME                     READY   STATUS              RESTARTS   AGE
nginx-64f497f8fd-2sdl2   1/1     Running             0          5m
nginx-64f497f8fd-7m868   1/1     Running             0          5m
nginx-64f497f8fd-87xjc   1/1     Running             0          5m
nginx-64f497f8fd-8tc2g   1/1     Running             0          5m
nginx-64f497f8fd-8xfgz   1/1     Running             0          5m
nginx-64f497f8fd-gszkq   1/1     Running             0          5m
nginx-64f497f8fd-lz426   1/1     Running             0          5m
nginx-64f497f8fd-rzspt   0/1     ContainerCreating   0          5m
nginx-64f497f8fd-wh6sz   1/1     Running             0          5m
nginx-64f497f8fd-wtx5n   0/1     ContainerCreating   0          5m

For Pods stuck on ContainerCreating, the event shown is FailedCreatePodSandBox

> kubectl describe pod nginx-64f497f8fd-wtx5n
    Warning  FailedCreatePodSandBox  2m55s (x4 over 3m5s)    kubelet, ip-10-156-7-10.eu-west-3.compute.internal  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e864a95102ede98274f377a3df4a694be814be3c9c5c3cf5b2b66b9eb8bcaa1f" network for pod "nginx-64f497f8fd-wtx5n": NetworkPlugin cni failed to set up pod "nginx-64f497f8fd-wtx5n_default" network: add cmd: failed to assign an IP address to container

The only way we've found to solve that issue is to delete stuck Pods.

Kubernetes version : 1.11
Amazon CNI version : 1.5.0

@dennisme
Copy link

What are the aws-vpc-cni logs for those containers that are stuck in ContainerCreating?

@MrSaints
Copy link

I've been coming across a similar problem, seems related to #525

@yrotilio
Copy link
Author

What are the aws-vpc-cni logs for those containers that are stuck in ContainerCreating?

Hi @dennisme, thanks for looking at my issue.

  • The aws-nodes pods do log any useful information.
===== Starting installing AWS-CNI =========
===== Starting amazon-k8s-agent ===========
time="2019-07-12T14:06:16Z" level=error msg="failed to initialize service object for operator metrics: OPERATOR_NAME must be set"
  • ipamd.log is far more interesting :
2019-07-15T09:59:57Z [INFO]     Received AddNetwork for NS /proc/20430/ns/net, Pod nginx-64f497f8fd-5txtg, NameSpace default, Container e2aba20d48d38c1af40ea3667fad3ab45c86fba860973b64afe90824c1a5596c, ifname eth0
2019-07-15T09:59:57Z [INFO]     Received DelNetwork for IP <nil>, Pod nginx-64f497f8fd-5txtg, Namespace default, Container e2aba20d48d38c1af40ea3667fad3ab45c86fba860973b64afe90824c1a5596c
2019-07-15T09:59:57Z [DEBUG]    UnassignPodIPv4Address: IP address pool stats: total:10, assigned 10, pod(Name: nginx-64f497f8fd-5txtg, Namespace: default, Container e2aba20d48d38c1af40ea3667fad3ab45c86fba860973b64afe90824c1a5596c)
2019-07-15T09:59:57Z [WARN]     UnassignPodIPv4Address: Failed to find pod nginx-64f497f8fd-5txtg namespace default Container e2aba20d48d38c1af40ea3667fad3ab45c86fba860973b64afe90824c1a5596c
2019-07-15T09:59:57Z [DEBUG]    UnassignPodIPv4Address: IP address pool stats: total:10, assigned 10, pod(Name: nginx-64f497f8fd-5txtg, Namespace: default, Container )
2019-07-15T09:59:57Z [WARN]     UnassignPodIPv4Address: Failed to find pod nginx-64f497f8fd-5txtg namespace default Container
  • plugin.log
2019-07-15T09:58:17Z [INFO]     Received CNI add request: ContainerID(7ff6bcd77b500414427ed7a91016d74a201364fa9797c6df3861931c13fe610a) Netns(/proc/12213/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=nginx-64f497f8fd-5txtg;K8S_POD_INFRA_CONTAINER_ID=7ff6bcd77b500414427ed7a91016d74a201364fa9797c6df3861931c13fe610a) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-07-15T09:58:17Z [ERROR]    Failed to assign an IP address to pod nginx-64f497f8fd-5txtg, namespace default container 7ff6bcd77b500414427ed7a91016d74a201364fa9797c6df3861931c13fe610a
2019-07-15T09:58:17Z [INFO]     Received CNI del request: ContainerID(7ff6bcd77b500414427ed7a91016d74a201364fa9797c6df3861931c13fe610a) Netns(/proc/12213/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=nginx-64f497f8fd-5txtg;K8S_POD_INFRA_CONTAINER_ID=7ff6bcd77b500414427ed7a91016d74a201364fa9797c6df3861931c13fe610a) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-07-15T09:58:17Z [ERROR]    Failed to process delete request for pod nginx-64f497f8fd-5txtg namespace default container 7ff6bcd77b500414427ed7a91016d74a201364fa9797c6df3861931c13fe610a: <nil>

@dennisme
Copy link

dennisme commented Jul 16, 2019

Im seeing IP address pool stats: total:10, assigned 10 what instance type are you using? Are there pods in other namespaces using IPs?

@yrotilio
Copy link
Author

I'm using t3.medium instances on that cluster right now.

I confirm that there are other pods in other namespaces using IPs, but I can assure you there's sufficient ENIs for all the pods in all namespaces on the cluster for 2 reasons :

  1. Before I apply the Custom CNI configuration, all the Pods are in Running state. It's when the nodes are terminated & recreated to apply the custom CNI that the stuck ContainerCreating happens. I usually have only 2 nodes in that cluster, but i've reproduced the same issue with 3 nodes.
  2. If I delete the ContainerCreating stuck pod, then the recreated one gets attributed an IP in the desired subnet and comes Running.

But I guess the error message you highlighted is in fact why the CNI plug-in does not manage to give a IP to that pod.
What I don't understand is how the CNI custom configuration can affect the scheduler on IP shortage management, and why he's not able to unstuck them.

I'll do some rollout specific tests tomorrow with and without the CNI custom configuration to have some more data.

@yrotilio
Copy link
Author

Here are the results of my last tests :

  • Without the CNI custom configuration : if delete all my nodes at the same time or one by one, the scheduler manages to rechedule the pods on nodes without one getting stuck in ContainerCreating.

  • With the CNI custom configuration and the same pod population, if delete all my nodes at the same time or one by one (rollout), the scheduler continue to schedule pods on nodes in a way that some of them get stuck on ContainerCreating on a node until I delete them.

I've had a look at #525 (comment), but i'm not sure if the workaround discussed recently there applies to my case. I do not use cluster-autoscaler yet so I believe I can't use lifecycle hooks anyway..

Anyone got an idea ?

@mogren
Copy link
Contributor

mogren commented Jul 31, 2019

Hi @yrotilio, sorry for the late reply.

How many pods get successfully scheduled on the node before you start seeing this issue? If you use custom network configuration, you will lose the first ENI, so in order to not schedule too many pods to the nodes, you also need to change the --max-pods parameter to take that into account.

@yrotilio
Copy link
Author

Hi @mogren, thanks a lot for your contribution, it seems like you're right !

After testing on a t3.small node (3 ENIs * 4 = 12 IPs max), ContainerCreating stuck pods appeared after scheduling more than 8 pods on the node, which matches the lost ENI scenario.

Further testing show that setting the --max-pods parameter to (<max_eni_per_instance_type>-1)*<max_ip_per_eni> on a given node solves the issue on that node.

Yet my problem is not solved !

  • --max-pods parameter seems set by the eks optimized ami from a static txt file.
  • I'm not sure how I can smartly update that value using the user-data of my launch-configuration, based on the max number of ENIs/IPs of each instance type which is in a table in a web page..

I guess I'll try the easy way and patch only one instance type with a static number for now..

PS : I don't think one ENI being lost when applying custom network config is documented

@benabineri
Copy link

We've run into this as well. The way I've solved it for the moment is by generating a new eni-max-pods.txt with the correct[1] values and (using Terraform) passing that in to the user-data to overwrite the existing file before bootstrap.sh runs.

  1. Not exactly correct because it depends how many daemonsets with host-network you're running in the cluster, so I've erred on the lower side. Better to have a pod or two of spare capacity on the node than have 2 pods stuck in ContainerCreating.

@RaymondKYLiu
Copy link

RaymondKYLiu commented Nov 4, 2019

The # of max pods formula is changed if you configure CNI custom networking.
The new formula is maxPods = (numInterfaces - 1) * (maxIpv4PerInterface - 1) + 2

For t3.small node, the max pods changes to 8.

a1.medium 5
a1.large 20
a1.xlarge 44
a1.2xlarge 44
a1.4xlarge 205
a1.metal 205
c1.medium 7
c1.xlarge 44
c3.large 20
c3.xlarge 44
c3.2xlarge 44
c3.4xlarge 205
c3.8xlarge 205
c4.large 20
c4.xlarge 44
c4.2xlarge 44
c4.4xlarge 205
c4.8xlarge 205
c5.large 20
c5.xlarge 44
c5.2xlarge 44
c5.4xlarge 205
c5.9xlarge 205
c5.12xlarge 205
c5.18xlarge 688
c5.24xlarge 688
c5.metal 688
c5d.large 20
c5d.xlarge 44
c5d.2xlarge 44
c5d.4xlarge 205
c5d.9xlarge 205
c5d.18xlarge 688
c5n.large 20
c5n.xlarge 44
c5n.2xlarge 44
c5n.4xlarge 205
c5n.9xlarge 205
c5n.18xlarge 688
c5n.metal 688
cc2.8xlarge 205
cr1.8xlarge 205
d2.xlarge 44
d2.2xlarge 44
d2.4xlarge 205
d2.8xlarge 205
f1.2xlarge 44
f1.4xlarge 205
f1.16xlarge 345
g2.2xlarge 44
g2.8xlarge 205
g3s.xlarge 44
g3.4xlarge 205
g3.8xlarge 205
g3.16xlarge 688
g4dn.xlarge 20
g4dn.2xlarge 20
g4dn.4xlarge 20
g4dn.8xlarge 44
g4dn.12xlarge 205
g4dn.16xlarge 688
h1.2xlarge 44
h1.4xlarge 205
h1.8xlarge 205
h1.16xlarge 688
hs1.8xlarge 205
i2.xlarge 44
i2.2xlarge 44
i2.4xlarge 205
i2.8xlarge 205
i3.large 20
i3.xlarge 44
i3.2xlarge 44
i3.4xlarge 205
i3.8xlarge 205
i3.16xlarge 688
i3.metal 688
i3en.large 20
i3en.xlarge 44
i3en.2xlarge 44
i3en.3xlarge 44
i3en.6xlarge 205
i3en.12xlarge 205
i3en.24xlarge 688
i3en.metal 688
m1.small 5
m1.medium 7
m1.large 20
m1.xlarge 44
m2.xlarge 44
m2.2xlarge 89
m2.4xlarge 205
m3.medium 7
m3.large 20
m3.xlarge 44
m3.2xlarge 89
m4.large 11
m4.xlarge 44
m4.2xlarge 44
m4.4xlarge 205
m4.10xlarge 205
m4.16xlarge 205
m5.large 20
m5.xlarge 44
m5.2xlarge 44
m5.4xlarge 205
m5.8xlarge 205
m5.12xlarge 205
m5.16xlarge 688
m5.24xlarge 688
m5.metal 688
m5a.large 20
m5a.xlarge 44
m5a.2xlarge 44
m5a.4xlarge 205
m5a.8xlarge 205
m5a.12xlarge 205
m5a.16xlarge 688
m5a.24xlarge 688
m5ad.large 20
m5ad.xlarge 44
m5ad.2xlarge 44
m5ad.4xlarge 205
m5ad.12xlarge 205
m5ad.24xlarge 688
m5d.large 20
m5d.xlarge 44
m5d.2xlarge 44
m5d.4xlarge 205
m5d.8xlarge 205
m5d.12xlarge 205
m5d.16xlarge 688
m5d.24xlarge 688
m5d.metal 688
m5dn.large 20
m5dn.xlarge 44
m5dn.2xlarge 44
m5dn.4xlarge 205
m5dn.8xlarge 205
m5dn.12xlarge 205
m5dn.16xlarge 688
m5dn.24xlarge 688
m5n.large 20
m5n.xlarge 44
m5n.2xlarge 44
m5n.4xlarge 205
m5n.8xlarge 205
m5n.12xlarge 205
m5n.16xlarge 688
m5n.24xlarge 688
p2.xlarge 44
p2.8xlarge 205
p2.16xlarge 205
p3.2xlarge 44
p3.8xlarge 205
p3.16xlarge 205
p3dn.24xlarge 688
r3.large 20
r3.xlarge 44
r3.2xlarge 44
r3.4xlarge 205
r3.8xlarge 205
r4.large 20
r4.xlarge 44
r4.2xlarge 44
r4.4xlarge 205
r4.8xlarge 205
r4.16xlarge 688
r5.large 20
r5.xlarge 44
r5.2xlarge 44
r5.4xlarge 205
r5.8xlarge 205
r5.12xlarge 205
r5.16xlarge 688
r5.24xlarge 688
r5.metal 688
r5a.large 20
r5a.xlarge 44
r5a.2xlarge 44
r5a.4xlarge 205
r5a.8xlarge 205
r5a.12xlarge 205
r5a.16xlarge 688
r5a.24xlarge 688
r5ad.large 20
r5ad.xlarge 44
r5ad.2xlarge 44
r5ad.4xlarge 205
r5ad.12xlarge 205
r5ad.24xlarge 688
r5d.large 20
r5d.xlarge 44
r5d.2xlarge 44
r5d.4xlarge 205
r5d.8xlarge 205
r5d.12xlarge 205
r5d.16xlarge 688
r5d.24xlarge 688
r5d.metal 688
r5dn.large 20
r5dn.xlarge 44
r5dn.2xlarge 44
r5dn.4xlarge 205
r5dn.8xlarge 205
r5dn.12xlarge 205
r5dn.16xlarge 688
r5dn.24xlarge 688
r5n.large 20
r5n.xlarge 44
r5n.2xlarge 44
r5n.4xlarge 205
r5n.8xlarge 205
r5n.12xlarge 205
r5n.16xlarge 688
r5n.24xlarge 688
t1.micro 3
t2.nano 3
t2.micro 3
t2.small 8
t2.medium 12
t2.large 24
t2.xlarge 30
t2.2xlarge 30
t3.nano 3
t3.micro 3
t3.small 8
t3.medium 12
t3.large 24
t3.xlarge 44
t3.2xlarge 44
t3a.nano 3
t3a.micro 3
t3a.small 5
t3a.medium 12
t3a.large 24
t3a.xlarge 44
t3a.2xlarge 44
u-6tb1.metal 118
u-9tb1.metal 118
u-12tb1.metal 118
u-18tb1.metal 688
u-24tb1.metal 688
x1.16xlarge 205
x1.32xlarge 205
x1e.xlarge 20
x1e.2xlarge 44
x1e.4xlarge 44
x1e.8xlarge 44
x1e.16xlarge 205
x1e.32xlarge 205
z1d.large 20
z1d.xlarge 44
z1d.2xlarge 44
z1d.3xlarge 205
z1d.6xlarge 205
z1d.12xlarge 688
z1d.metal 688

@jaypipes
Copy link
Contributor

jaypipes commented Nov 5, 2019

@RaymondKYLiu please note that there is a default maximum # of pods on a single node that is hard-coded in Kubernetes currently at 110:

https://github.com/kubernetes/kubernetes/blob/b735a17163ac7c7c39d9888932c815260c3ceaba/pkg/kubelet/apis/config/v1beta1/defaults.go#L162

Unless you are starting your kubelets with a different --max-pods configuration, you will be limited to 110 pods per node regardless of whether the instance type can theoretically support more. The EKS AMI does not override the --max-pods setting:

https://github.com/awslabs/amazon-eks-ami/blob/17706d5e72a845d239e6647bdc7b906981d954be/files/kubelet.service#L9-L14

@jaypipes
Copy link
Contributor

jaypipes commented Nov 6, 2019

My mistake, I didn't realize that the bootstrap.sh script for the AMI builder configured the maxPods configuration setting for the kubelet dynamically by default. Please ignore the above comment.

@jaypipes
Copy link
Contributor

jaypipes commented Nov 6, 2019

@mogren what do you think about moving this issue to the AMI builder repo? Seems the ask here is to allow more configurability in the AMI builder's bootstrap.sh script to modify how the maxPods is set when a custom ENIConfig is in use?

@jaypipes jaypipes added feature request priority/P2 Low priority, nice to have. labels Nov 6, 2019
0xlen added a commit to 0xlen/amazon-eks-user-guide that referenced this issue Nov 25, 2019
Refer to the below discussion:
- Maximum Pods ENIConfig aware awsdocs#331: aws/amazon-vpc-cni-k8s#331
- Containers stuck in ContainerCreating after configuring CNI Custom Networking on extended CIDR awsdocs#527: aws/amazon-vpc-cni-k8s#527
@jacksontj
Copy link
Contributor

I have actually run into this same issue (k8s scheduling more pods that require IPs that the CNI has) on other platforms as well. From my experience there are 2 ways to deal with this issue:

  1. controller to terminate "stuck" pods

In this scenario you can simply create a controller that watches for pods stuck in this container creating state and delete them. This will give the scheduler a chance to place the node elsewhere. This works if there is sufficient capacity in the cluster, but has the downside of killing having to kill things if they are in a state for some time (and if there isn't space in the cluster it will just continually kill and respawn the pd).

  1. use custom resources
    In k8s the scheduler can take into consideration any number resources, and in this case the IPs from the CNI plugin are just that-- resources. To do this it would require that (1) the pods have a resource -- this can be added using a mutating admission webhook and (2) that the nodes expose the resource count -- using device plugins. The device plugin could be added to the CNI plugin to just expose the static max-ips for a given box. This has the advantage of properly scheduling (meaning if there are no more IPs on the box k8s won't place a pod on the node) but has the downside of requiring all the pods scheduled to include that resource (easy if done through an admission webhook).

I have done this #2 approach for some other k8s cluster setups -- and adding the device plugin to the CNI would be pretty straight forward.

cc @mogren

@mogren
Copy link
Contributor

mogren commented Mar 11, 2020

Documentation improved in awsdocs/amazon-eks-user-guide#72

@mogren mogren closed this as completed Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request priority/P2 Low priority, nice to have.
Projects
None yet
Development

No branches or pull requests

8 participants