-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containers stuck in ContainerCreating after configuring CNI Custom Networking on extended CIDR #527
Comments
What are the aws-vpc-cni logs for those containers that are stuck in |
I've been coming across a similar problem, seems related to #525 |
Hi @dennisme, thanks for looking at my issue.
|
Im seeing |
I'm using t3.medium instances on that cluster right now. I confirm that there are other pods in other namespaces using IPs, but I can assure you there's sufficient ENIs for all the pods in all namespaces on the cluster for 2 reasons :
But I guess the error message you highlighted is in fact why the CNI plug-in does not manage to give a IP to that pod. I'll do some rollout specific tests tomorrow with and without the CNI custom configuration to have some more data. |
Here are the results of my last tests :
I've had a look at #525 (comment), but i'm not sure if the workaround discussed recently there applies to my case. I do not use cluster-autoscaler yet so I believe I can't use lifecycle hooks anyway.. Anyone got an idea ? |
Hi @yrotilio, sorry for the late reply. How many pods get successfully scheduled on the node before you start seeing this issue? If you use custom network configuration, you will lose the first ENI, so in order to not schedule too many pods to the nodes, you also need to change the |
Hi @mogren, thanks a lot for your contribution, it seems like you're right ! After testing on a Further testing show that setting the Yet my problem is not solved !
I guess I'll try the easy way and patch only one instance type with a static number for now.. PS : I don't think one ENI being lost when applying custom network config is documented |
We've run into this as well. The way I've solved it for the moment is by generating a new
|
The # of max pods formula is changed if you configure CNI custom networking. For
|
@RaymondKYLiu please note that there is a default maximum # of pods on a single node that is hard-coded in Kubernetes currently at 110: Unless you are starting your kubelets with a different |
My mistake, I didn't realize that the |
@mogren what do you think about moving this issue to the AMI builder repo? Seems the ask here is to allow more configurability in the AMI builder's |
Refer to the below discussion: - Maximum Pods ENIConfig aware awsdocs#331: aws/amazon-vpc-cni-k8s#331 - Containers stuck in ContainerCreating after configuring CNI Custom Networking on extended CIDR awsdocs#527: aws/amazon-vpc-cni-k8s#527
I have actually run into this same issue (k8s scheduling more pods that require IPs that the CNI has) on other platforms as well. From my experience there are 2 ways to deal with this issue:
In this scenario you can simply create a controller that watches for pods stuck in this container creating state and delete them. This will give the scheduler a chance to place the node elsewhere. This works if there is sufficient capacity in the cluster, but has the downside of killing having to kill things if they are in a state for some time (and if there isn't space in the cluster it will just continually kill and respawn the pd).
I have done this #2 approach for some other k8s cluster setups -- and adding the device plugin to the CNI would be pretty straight forward. cc @mogren |
Documentation improved in awsdocs/amazon-eks-user-guide#72 |
Hi,
We have an issue on CNI Custom networking & extended CIDR after nodes first boot if we have Pods pending for scheduling.
For example, on a simple nginx workload w/ 10 replicas, after node first boot we have :
For Pods stuck on ContainerCreating, the event shown is FailedCreatePodSandBox
The only way we've found to solve that issue is to delete stuck Pods.
Kubernetes version : 1.11
Amazon CNI version : 1.5.0
The text was updated successfully, but these errors were encountered: