calico CNI command never terminates, which causes a dockershim server handler to stuck forever #2098

r0bj · 2018-08-03T12:09:13Z

To give some context it's based on already closed issue:
#1109
If this is redundant feel free to close it.

Please take a look at comment which is crucial:
kubernetes/kubernetes#45419 (comment)

Summary:
The bug is caused by a calico CNI command that never terminates, which causes a dockershim server handler to stuck forever. As a result, the RPC calls PodSandboxStatus() for a bad pod always timeout, thus makes the PLEG to be unhealthy.

Expected Behavior

Calico CNI command timeout and terminates with success or failure.

Current Behavior

Calico CNI command that never terminates.

This bug could be probably also fixed in CNI (or kubelet): containernetworking/cni#568

But fixing it directly in calico can increase resiliency of the system.

caseydavenport · 2018-08-15T22:34:19Z

Yeah, sounds like we should be handling this better.

Do we know which call is hanging in this specific instance?

r0bj · 2018-08-21T12:05:21Z

I've gathered some corresponding logs. Issue is caused by removing workload opengraph-7cdff79dbb-rsb4g in this case.

kubelet:

I0821 10:33:59.709000   23712 kubelet.go:1853] SyncLoop (DELETE, "api"): "opengraph-7cdff79dbb-rsb4g_prod(13dd6d43-9fc1-11e8-b4ea-525400215c79)"
[...]
E0821 10:36:03.198396   23712 remote_runtime.go:115] StopPodSandbox "1a8202c0a88fe079be790d10b4840444a92a7ca676130b43903aec0e684fd22f" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0821 10:36:03.198473   23712 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "1a8202c0a88fe079be790d10b4840444a92a7ca676130b43903aec0e684fd22f"}
E0821 10:36:03.198523   23712 kubelet.go:1512] error killing pod: failed to "KillPodSandbox" for "13dd6d43-9fc1-11e8-b4ea-525400215c79" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0821 10:36:03.198573   23712 pod_workers.go:186] Error syncing pod 13dd6d43-9fc1-11e8-b4ea-525400215c79 ("opengraph-7cdff79dbb-rsb4g_prod(13dd6d43-9fc1-11e8-b4ea-525400215c79)"), skipping: error killing pod: failed to "KillPodSandbox" for "13dd6d43-9fc1-11e8-b4ea-525400215c79" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0821 10:36:03.344535   23712 kuberuntime_manager.go:859] PodSandboxStatus of sandbox "1a8202c0a88fe079be790d10b4840444a92a7ca676130b43903aec0e684fd22f" for pod "opengraph-7cdff79dbb-rsb4g_prod(13dd6d43-9fc1-11e8-b4ea-525400215c79)" error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0821 10:36:03.344611   23712 generic.go:241] PLEG: Ignoring events for pod opengraph-7cdff79dbb-rsb4g/prod: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Full kubelet log:
https://gist.github.com/r0bj/f698f26e0b83f674c087d3cd8f402c40

calico-node:

2018-08-21 10:34:03.313 [INFO][142] calc_graph.go 359: Local endpoint deleted id=WorkloadEndpoint(node=k8s-worker-r22, orchestrator=k8s, workload=prod/opengraph-7cdff79dbb-rsb4g, name=eth0)

Full calico-node log:
https://gist.github.com/r0bj/70301cc32ab2444c0f9127b1feaab620

dmesg:

[Tue Aug 21 11:51:05 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:51:15 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:51:25 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:51:35 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:51:45 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:51:55 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:52:05 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:52:16 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[Tue Aug 21 11:52:26 2018] unregister_netdevice: waiting for eth0 to become free. Usage count = 1

I hope it'll be useful.

If the plugin (either networking or IPAM) takes more than 30 seconds then panic. Fixes projectcalico/calico#2098

tmjd added the priority/P1 label Aug 9, 2018

tomdee added a commit to tomdee/calico-cni that referenced this issue Aug 27, 2018

Robustness: Exit if the plugin is taking too long

69569f6

If the plugin (either networking or IPAM) takes more than 30 seconds then panic. Fixes projectcalico/calico#2098

tomdee mentioned this issue Aug 27, 2018

Exit if the plugin is taking too long projectcalico/cni-plugin#581

Merged

3 tasks

tomdee closed this as completed in projectcalico/cni-plugin#581 Sep 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calico CNI command never terminates, which causes a dockershim server handler to stuck forever #2098

calico CNI command never terminates, which causes a dockershim server handler to stuck forever #2098

r0bj commented Aug 3, 2018

caseydavenport commented Aug 15, 2018

r0bj commented Aug 21, 2018

calico CNI command never terminates, which causes a dockershim server handler to stuck forever #2098

calico CNI command never terminates, which causes a dockershim server handler to stuck forever #2098

Comments

r0bj commented Aug 3, 2018

Expected Behavior

Current Behavior

caseydavenport commented Aug 15, 2018

r0bj commented Aug 21, 2018