Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico CNI command never terminates, which causes a dockershim server handler to stuck forever #2098

Closed
r0bj opened this issue Aug 3, 2018 · 2 comments

Comments

@r0bj
Copy link

r0bj commented Aug 3, 2018

To give some context it's based on already closed issue:
#1109
If this is redundant feel free to close it.

Please take a look at comment which is crucial:
kubernetes/kubernetes#45419 (comment)

Summary:
The bug is caused by a calico CNI command that never terminates, which causes a dockershim server handler to stuck forever. As a result, the RPC calls PodSandboxStatus() for a bad pod always timeout, thus makes the PLEG to be unhealthy.

Expected Behavior

Calico CNI command timeout and terminates with success or failure.

Current Behavior

Calico CNI command that never terminates.

This bug could be probably also fixed in CNI (or kubelet): containernetworking/cni#568

But fixing it directly in calico can increase resiliency of the system.

@caseydavenport
Copy link
Member

Yeah, sounds like we should be handling this better.

Do we know which call is hanging in this specific instance?

@r0bj
Copy link
Author

r0bj commented Aug 21, 2018

I've gathered some corresponding logs. Issue is caused by removing workload opengraph-7cdff79dbb-rsb4g in this case.

kubelet:

I0821 10:33:59.709000   23712 kubelet.go:1853] SyncLoop (DELETE, "api"): "opengraph-7cdff79dbb-rsb4g_prod(13dd6d43-9fc1-11e8-b4ea-525400215c79)"
[...]
E0821 10:36:03.198396   23712 remote_runtime.go:115] StopPodSandbox "1a8202c0a88fe079be790d10b4840444a92a7ca676130b43903aec0e684fd22f" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0821 10:36:03.198473   23712 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "1a8202c0a88fe079be790d10b4840444a92a7ca676130b43903aec0e684fd22f"}
E0821 10:36:03.198523   23712 kubelet.go:1512] error killing pod: failed to "KillPodSandbox" for "13dd6d43-9fc1-11e8-b4ea-525400215c79" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0821 10:36:03.198573   23712 pod_workers.go:186] Error syncing pod 13dd6d43-9fc1-11e8-b4ea-525400215c79 ("opengraph-7cdff79dbb-rsb4g_prod(13dd6d43-9fc1-11e8-b4ea-525400215c79)"), skipping: error killing pod: failed to "KillPodSandbox" for "13dd6d43-9fc1-11e8-b4ea-525400215c79" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0821 10:36:03.344535   23712 kuberuntime_manager.go:859] PodSandboxStatus of sandbox "1a8202c0a88fe079be790d10b4840444a92a7ca676130b43903aec0e684fd22f" for pod "opengraph-7cdff79dbb-rsb4g_prod(13dd6d43-9fc1-11e8-b4ea-525400215c79)" error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0821 10:36:03.344611   23712 generic.go:241] PLEG: Ignoring events for pod opengraph-7cdff79dbb-rsb4g/prod: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Full kubelet log:
https://gist.github.com/r0bj/f698f26e0b83f674c087d3cd8f402c40

calico-node:

2018-08-21 10:34:03.313 [INFO][142] calc_graph.go 359: Local endpoint deleted id=WorkloadEndpoint(node=k8s-worker-r22, orchestrator=k8s, workload=prod/opengraph-7cdff79dbb-rsb4g, name=eth0)

Full calico-node log:
https://gist.github.com/r0bj/70301cc32ab2444c0f9127b1feaab620

dmesg:

[Tue Aug 21 11:51:05 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:51:15 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:51:25 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:51:35 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:51:45 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:51:55 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:52:05 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:52:16 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:52:26 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1

I hope it'll be useful.

tomdee added a commit to tomdee/calico-cni that referenced this issue Aug 27, 2018
If the plugin (either networking or IPAM) takes more than 30 seconds
then panic.

Fixes projectcalico/calico#2098
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants