Pods sometimes get stuck in Terminating state. #1109

Dirbaio · 2017-09-14T17:56:32Z

Very rarely, when deleting a Pod, it gets stuck forever in Terminating state.

When it happens, the kubelet and the kernel get stuck in an endless loop of the following errors:

Sep 14 17:16:43 vk-prod4-node-v175-z0jf kubelet[7031]: 2017-09-14 17:16:43.037 [INFO][32369] calico.go 312: Extracted identifiers Node="vk-prod4-node-v175-z0jf" Orchestrator="k8s" Workload="nginx-ingress.nginx-ingress-3585766022-fbmh8"
Sep 14 17:16:43 vk-prod4-node-v175-z0jf kubelet[7031]: 2017-09-14 17:16:43.037 [INFO][32369] utils.go 250: Configured environment: [PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin KUBELET_EXTRA_ARGS=--cloud-provider=gce --cloud-config=/etc/kubernetes/cloud-config 
Sep 14 17:16:43 vk-prod4-node-v175-z0jf kubelet[7031]: 2FzydBawC2He5eYRgx8NxDHrF4rC6ODi9OvOAVcsV5yLXIkshS4yWBX7OO5PVIp7kpypqRTEwuQ9w]
Sep 14 17:16:43 vk-prod4-node-v175-z0jf kubelet[7031]: 2017-09-14 17:16:43.037 [INFO][32369] client.go 202: Loading config from environment
Sep 14 17:16:43 vk-prod4-node-v175-z0jf kubelet[7031]: Calico CNI releasing IP address
Sep 14 17:16:43 vk-prod4-node-v175-z0jf kubelet[7031]: 2017-09-14 17:16:43.048 [INFO][32369] utils.go 94: Using a dummy podCidr to release the IP Workload="nginx-ingress.nginx-ingress-3585766022-fbmh8" podCidr="0.0.0.0/0"
Sep 14 17:16:43 vk-prod4-node-v175-z0jf kubelet[7031]: Calico CNI deleting device in netns /proc/16673/ns/net
Sep 14 17:16:44 vk-prod4-node-v175-z0jf ntpd[1958]: Deleting interface #345 cali9613ee578f4, fe80::1479:b8ff:fe28:c627%1568#123, interface stats: received=0, sent=0, dropped=0, active_time=608310 secs
Sep 14 17:16:50 vk-prod4-node-v175-z0jf kubelet[7031]: W0914 17:16:50.432478    7031 prober.go:98] No ref for container "docker://5289e32db5286329f9551b8ee5da340735e976e745155c7a53a083234608c075" (nginx-ingress-3585766022-fbmh8_nginx-ingress(2360ebc1-93e8-11e7-9a1b-42010a84000a):ng
Sep 14 17:16:53 vk-prod4-node-v175-z0jf kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 2
Sep 14 17:17:00 vk-prod4-node-v175-z0jf kubelet[7031]: W0914 17:17:00.431273    7031 prober.go:98] No ref for container "docker://5289e32db5286329f9551b8ee5da340735e976e745155c7a53a083234608c075" (nginx-ingress-3585766022-fbmh8_nginx-ingress(2360ebc1-93e8-11e7-9a1b-42010a84000a):ng
Sep 14 17:17:03 vk-prod4-node-v175-z0jf kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 2
Sep 14 17:17:10 vk-prod4-node-v175-z0jf kubelet[7031]: W0914 17:17:10.431339    7031 prober.go:98] No ref for container "docker://5289e32db5286329f9551b8ee5da340735e976e745155c7a53a083234608c075" (nginx-ingress-3585766022-fbmh8_nginx-ingress(2360ebc1-93e8-11e7-9a1b-42010a84000a):ng
Sep 14 17:17:13 vk-prod4-node-v175-z0jf kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 2
Sep 14 17:17:20 vk-prod4-node-v175-z0jf kubelet[7031]: W0914 17:17:20.431517    7031 prober.go:98] No ref for container "docker://5289e32db5286329f9551b8ee5da340735e976e745155c7a53a083234608c075" (nginx-ingress-3585766022-fbmh8_nginx-ingress(2360ebc1-93e8-11e7-9a1b-42010a84000a):ng
Sep 14 17:17:23 vk-prod4-node-v175-z0jf kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 2
Sep 14 17:17:30 vk-prod4-node-v175-z0jf kubelet[7031]: W0914 17:17:30.431500    7031 prober.go:98] No ref for container "docker://5289e32db5286329f9551b8ee5da340735e976e745155c7a53a083234608c075" (nginx-ingress-3585766022-fbmh8_nginx-ingress(2360ebc1-93e8-11e7-9a1b-42010a84000a):ng
Sep 14 17:17:34 vk-prod4-node-v175-z0jf kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 2
Sep 14 17:17:40 vk-prod4-node-v175-z0jf kubelet[7031]: W0914 17:17:40.431483    7031 prober.go:98] No ref for container "docker://5289e32db5286329f9551b8ee5da340735e976e745155c7a53a083234608c075" (nginx-ingress-3585766022-fbmh8_nginx-ingress(2360ebc1-93e8-11e7-9a1b-42010a84000a):ng
Sep 14 17:17:44 vk-prod4-node-v175-z0jf kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 2
Sep 14 17:17:50 vk-prod4-node-v175-z0jf kubelet[7031]: W0914 17:17:50.431396    7031 prober.go:98] No ref for container "docker://5289e32db5286329f9551b8ee5da340735e976e745155c7a53a083234608c075" (nginx-ingress-3585766022-fbmh8_nginx-ingress(2360ebc1-93e8-11e7-9a1b-42010a84000a):ng
Sep 14 17:17:54 vk-prod4-node-v175-z0jf kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 2
Sep 14 17:18:00 vk-prod4-node-v175-z0jf kubelet[7031]: W0914 17:18:00.431443    7031 prober.go:98] No ref for container "docker://5289e32db5286329f9551b8ee5da340735e976e745155c7a53a083234608c075" (nginx-ingress-3585766022-fbmh8_nginx-ingress(2360ebc1-93e8-11e7-9a1b-42010a84000a):ng
Sep 14 17:18:04 vk-prod4-node-v175-z0jf kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 2
Sep 14 17:18:10 vk-prod4-node-v175-z0jf kubelet[7031]: W0914 17:18:10.438898    7031 prober.go:98] No ref for container "docker://5289e32db5286329f9551b8ee5da340735e976e745155c7a53a083234608c075" (nginx-ingress-3585766022-fbmh8_nginx-ingress(2360ebc1-93e8-11e7-9a1b-42010a84000a):ng
Sep 14 17:18:14 vk-prod4-node-v175-z0jf kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 2
Sep 14 17:18:20 vk-prod4-node-v175-z0jf kubelet[7031]: W0914 17:18:20.431436    7031 prober.go:98] No ref for container "docker://5289e32db5286329f9551b8ee5da340735e976e745155c7a53a083234608c075" (nginx-ingress-3585766022-fbmh8_nginx-ingress(2360ebc1-93e8-11e7-9a1b-42010a84000a):ng

When this happens, the pod Docker containers are gone, but the "pause" container is still there. Running docker inspect on it gives this: https://gist.github.com/Dirbaio/df30fa318327270036d3b02779dd2fa8

The caliXXX interface, and the iptables for it are gone.

This is maybe related to the fact that the container is killed ungracefully -- I'm not 100% sure but I think I've only seen this on ungraceful shutdowns.

I've first seen this sporadically for ~6 months already.

Expected Behavior

Pods get deleted successfully

Current Behavior

Pods sometimes get stuck forever in Terminating state.

Possible Solution

The stuck pods don't go away until I reboot the affected node.

Steps to Reproduce (for bugs)

Deploy some pods
Delete them (manually, or through updating a Deployment or something)
Sometimes they get stuck deleting.

Your Environment

Calico version: quay.io/calico/node:v2.4.0 (but it also happens with earlier versions.)
Orchestrator version (e.g. kubernetes, mesos, rkt): k8s. Affects both 1.5.2 and 1.7.5
Operating System and version: Ubuntu 16.04, kernel 4.10.0-32-generic
Cloud: GCE

The text was updated successfully, but these errors were encountered:

caseydavenport · 2017-09-14T19:17:56Z

Sep 14 17:16:43 vk-prod4-node-v175-z0jf kubelet[7031]: Calico CNI deleting device in netns /proc/16673/ns/net

Looks like we're hitting this teardown code and it appears to be returning successfully. I also don't see any log indicating the CNI plugin returned an error.

It's not obvious to me that this is a Calico issue from the logs above (though that doesn't mean it isn't!).

The "no ref for container" log appears to be coming from the Liveness/Readiness probing code, which seems odd and potentially unrelated, unless perhaps the kubelet is attempting to perform a check on a non-existent container for some reason?

tobad357 · 2017-10-09T17:08:31Z

This could be related to?
moby/moby#5618

tobad357 · 2017-10-09T17:09:48Z

And I believe the kubelet used to set the bridge as promiscuous to avoid the race condition but that changed with K8S 1.7
See this fix in flynn https://github.com/flynn/flynn/pull/4089/files

heschlie · 2017-10-19T19:39:26Z

@Dirbaio @tobad357 This does indeed look to be related to moby/moby#5618. I've done a bit of digging around and it seems this has been an outstanding issue for at least a couple years, with several attempts to fix it in the kernel, see torvalds/linux@f647821, torvalds/linux@b7c8487, and torvalds/linux@d747a7a. At the moment there doesn't seem to be a valid work around aside from restarting the Node.

I'm not sure the fix in Flynn is actually the correct fix (or at least the issue seems to be present in k8s 1.5 and 1.6 from some of my digging) as @Dirbaio also saw this in k8s 1.5.2.

tobad357 · 2017-10-20T01:10:51Z

@heschlie the unqualified feeling I get is that if is a number of race conditions in the kernel related to device teardown some of the fixes have probably mitigated some causes but others are still there. For us K8S 1.6 was rock solid even in a CD/CI env with a lot of pod deletions but 1.7 we can hit it quite regularly.
We have now updated our Centos 7.4 servers to kernel version 4.13 after some promising tests on one node and will see if it helps us

heschlie · 2017-10-26T15:53:57Z

@Dirbaio would you be able to try out a later kernel and see if it reduces (or hopefully stops) these occurrences? As @tobad357 said it is likely that these patches have reduced these issues over time, it does look like 4.13 is the latest stable.

tobad357 · 2017-10-27T02:39:18Z

@heschlie we have upgraded our cluster to 4.13 and been running for a week without issues
I will update here in a couple of weeks with results

tobad357 · 2017-11-07T06:53:38Z

Sorry to say we hit the issue again but not as frequent as in the past
Think the only way is to wait for more kernel versions and hope more issues have been fixed

heschlie · 2017-11-17T17:16:59Z

Closing this issue as it does not seem like a Calico related problem. Feel free to re-open if you believe this to be in error.

r0bj · 2018-07-25T19:44:22Z

@heschlie @caseydavenport It may be related to calico after all, please take a look at comment:
kubernetes/kubernetes#45419 (comment)

ethercflow · 2019-05-05T02:34:05Z

Please look at this: https://pingcap.com/blog/try-to-fix-two-linux-kernel-bugs-while-testing-tidb-operator-in-k8s/

miry · 2019-05-10T08:27:40Z

@ethercflow solution to solve this problem is to patch a kernel?

ethercflow · 2019-05-19T15:34:42Z

@miry Sorry to be so late, we found the patch is the solution.

miry · 2019-05-27T08:34:37Z

@ethercflow Do you know when this patch will be merged?

…s-release-v3.20 [release-v3.20] Semaphore Auto Pin Update

caseydavenport added the priority/P1 label Sep 27, 2017

caseydavenport added this to the Calico v2.6.x milestone Sep 27, 2017

heschlie self-assigned this Oct 19, 2017

heschlie closed this as completed Nov 17, 2017

caseydavenport modified the milestones: Calico v2.6.x, Calico v2.6.3 Nov 18, 2017

integrii mentioned this issue May 23, 2018

Node flapping between Ready/NotReady with PLEG issues kubernetes/kubernetes#45419

Closed

r0bj unassigned heschlie Jul 25, 2018

r0bj mentioned this issue Aug 3, 2018

calico CNI command never terminates, which causes a dockershim server handler to stuck forever #2098

Closed

caseydavenport pushed a commit that referenced this issue Dec 14, 2021

Merge pull request #1109 from projectcalico/semaphore-auto-pin-update…

bcce268

…s-release-v3.20 [release-v3.20] Semaphore Auto Pin Update

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods sometimes get stuck in Terminating state. #1109

Pods sometimes get stuck in Terminating state. #1109

Dirbaio commented Sep 14, 2017

caseydavenport commented Sep 14, 2017

tobad357 commented Oct 9, 2017

tobad357 commented Oct 9, 2017

heschlie commented Oct 19, 2017

tobad357 commented Oct 20, 2017

heschlie commented Oct 26, 2017

tobad357 commented Oct 27, 2017

tobad357 commented Nov 7, 2017

heschlie commented Nov 17, 2017

r0bj commented Jul 25, 2018 •

edited

Loading

ethercflow commented May 5, 2019

miry commented May 10, 2019

ethercflow commented May 19, 2019

miry commented May 27, 2019

Pods sometimes get stuck in Terminating state. #1109

Pods sometimes get stuck in Terminating state. #1109

Comments

Dirbaio commented Sep 14, 2017

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

caseydavenport commented Sep 14, 2017

tobad357 commented Oct 9, 2017

tobad357 commented Oct 9, 2017

heschlie commented Oct 19, 2017

tobad357 commented Oct 20, 2017

heschlie commented Oct 26, 2017

tobad357 commented Oct 27, 2017

tobad357 commented Nov 7, 2017

heschlie commented Nov 17, 2017

r0bj commented Jul 25, 2018 • edited Loading

ethercflow commented May 5, 2019

miry commented May 10, 2019

ethercflow commented May 19, 2019

miry commented May 27, 2019

r0bj commented Jul 25, 2018 •

edited

Loading