-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods sometimes get stuck in Terminating state. #1109
Comments
Looks like we're hitting this teardown code and it appears to be returning successfully. I also don't see any log indicating the CNI plugin returned an error. It's not obvious to me that this is a Calico issue from the logs above (though that doesn't mean it isn't!). The "no ref for container" log appears to be coming from the Liveness/Readiness probing code, which seems odd and potentially unrelated, unless perhaps the kubelet is attempting to perform a check on a non-existent container for some reason? |
This could be related to? |
And I believe the kubelet used to set the bridge as promiscuous to avoid the race condition but that changed with K8S 1.7 |
@Dirbaio @tobad357 This does indeed look to be related to moby/moby#5618. I've done a bit of digging around and it seems this has been an outstanding issue for at least a couple years, with several attempts to fix it in the kernel, see torvalds/linux@f647821, torvalds/linux@b7c8487, and torvalds/linux@d747a7a. At the moment there doesn't seem to be a valid work around aside from restarting the Node. I'm not sure the fix in Flynn is actually the correct fix (or at least the issue seems to be present in k8s 1.5 and 1.6 from some of my digging) as @Dirbaio also saw this in k8s 1.5.2. |
@heschlie the unqualified feeling I get is that if is a number of race conditions in the kernel related to device teardown some of the fixes have probably mitigated some causes but others are still there. For us K8S 1.6 was rock solid even in a CD/CI env with a lot of pod deletions but 1.7 we can hit it quite regularly. |
@heschlie we have upgraded our cluster to 4.13 and been running for a week without issues |
Sorry to say we hit the issue again but not as frequent as in the past |
Closing this issue as it does not seem like a Calico related problem. Feel free to re-open if you believe this to be in error. |
@heschlie @caseydavenport It may be related to calico after all, please take a look at comment: |
@ethercflow solution to solve this problem is to patch a kernel? |
@miry Sorry to be so late, we found the patch is the solution. |
@ethercflow Do you know when this patch will be merged? |
…s-release-v3.20 [release-v3.20] Semaphore Auto Pin Update
Very rarely, when deleting a Pod, it gets stuck forever in Terminating state.
When it happens, the kubelet and the kernel get stuck in an endless loop of the following errors:
When this happens, the pod Docker containers are gone, but the "pause" container is still there. Running docker inspect on it gives this: https://gist.github.com/Dirbaio/df30fa318327270036d3b02779dd2fa8
The caliXXX interface, and the iptables for it are gone.
This is maybe related to the fact that the container is killed ungracefully -- I'm not 100% sure but I think I've only seen this on ungraceful shutdowns.
I've first seen this sporadically for ~6 months already.
Expected Behavior
Pods get deleted successfully
Current Behavior
Pods sometimes get stuck forever in
Terminating
state.Possible Solution
The stuck pods don't go away until I reboot the affected node.
Steps to Reproduce (for bugs)
Your Environment
The text was updated successfully, but these errors were encountered: