containerd+kata shim-v2: shim not killed, artifacts left on host #1480

egernst · 2019-04-03T22:04:52Z

Description of problem

With kubernetes 1.14, containerd 1.2.5 and Kata artifacts from the 1.6.0 release, kubectl delete -f doesn't result in the containerd-shim-kata-v2 process being stopped, and artifacts are left on the host in /var/run/vc/sbs, /var/run/vc/vm, etc.

example, starting with a clean containerd+k8s setup, install kata (1.6.0):

install kata 1.6

kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/master/kata-deploy/kata-rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/master/kata-deploy/kata-deploy.yaml

setup appropriate runtimeclass:

kubectl apply -f https://raw.githubusercontent.com/clearlinux/cloud-native-setup/master/clr-k8s-examples/8-kata/kata-qemu-runtimeClass.yaml

in seperate windows, watch for qemu, v2-shim, and monitor the filesystem where Kata stores state:

watch -d 'ls /var/run/vc/sbs/ /var/lib/vc/sbs /var/run/vc/vm/ '
watch -d 'ps -aef | grep -E "qem|containerd-shim-kata-v2" | grep -v grep'

start a pod, then remove after started:

kubectl apply -f https://gist.githubusercontent.com/egernst/9cbbb3d6ba566afca98be2a4e29328d6/raw/0f6e332ba1b7306ed895e5ff91b2939ca3f0bb42/two-pod.yaml

kubectl delete -f https://gist.githubusercontent.com/egernst/9cbbb3d6ba566afca98be2a4e29328d6/raw/0f6e332ba1b7306ed895e5ff91b2939ca3f0bb42/two-pod.yaml

You'll see QEMU come and go as expected, but the v2-shim keep running and storage stay on the filesystem.

The text was updated successfully, but these errors were encountered:

egernst · 2019-04-03T22:05:45Z

/cc @lifupan @mcastelino @sboeuf @bergwolf

egernst · 2019-04-03T22:22:46Z

Some logs around time of deletion:

containerd[5344]:  level=debug msg="sending request" ID=53ffff95b545f254cc36058df22f2442f584a286393b1115fc408023cf5609fd name=grpc.StatsContainerRequest req="container_id:\"6261346f506314999a6d4e94b84dd09d050d7763502827eca9ae334103bdd67a\" " source=virtcontainers subsystem=kata_agent
containerd[5344]:  level=info msg="ExecSync for "7b19ae263fc1870052876ec315542d907f8fbf73a3e7d42bf256fff6c7ac0df0" with command [/bin/sh -ec ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key get foo] and timeout 15 (s)"
containerd[5344]:  level=info msg="Exec process "141f5299e65d21a02c1e80bf1625c4cc78f4c32c29284cbd78ca705baec9632f" exits with exit code 0 and error <nil>"
containerd[5344]:  level=info msg="Finish piping "stderr" of container exec "141f5299e65d21a02c1e80bf1625c4cc78f4c32c29284cbd78ca705baec9632f""
containerd[5344]:  level=info msg="Finish piping "stdout" of container exec "141f5299e65d21a02c1e80bf1625c4cc78f4c32c29284cbd78ca705baec9632f""
containerd[5344]:  level=info msg="ExecSync for "7b19ae263fc1870052876ec315542d907f8fbf73a3e7d42bf256fff6c7ac0df0" returns with exit code 0"
containerd[5344]:  level=error msg="StopPodSandbox for "eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f" failed" error="failed to stop sandbox container "eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f" in '\x01' state: failed to kill sandbox container: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing rpc error: code = DeadlineExceeded desc = timed out connecting to unix socket ////run/vc/vm/eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f/kata.sock": unavailable"
kubelet[32262]: E0403 22:14:26.962281   32262 remote_runtime.go:132] StopPodSandbox "eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f" from runtime service failed: rpc error: code = Unknown desc = failed to stop sandbox container "eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f" in '\x01' state: failed to kill sandbox container: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing rpc error: code = DeadlineExceeded desc = timed out connecting to unix socket ////run/vc/vm/eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f/kata.sock": unavailable
kubelet[32262]: E0403 22:14:26.962348   32262 kuberuntime_manager.go:846] Failed to stop sandbox {"containerd" "eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f"}
kubelet[32262]: E0403 22:14:26.962397   32262 kubelet_pods.go:1093] Failed killing the pod "two-containers-kata-guaranteed": failed to "KillPodSandbox" for "f817d5cf-565a-11e9-bdb5-000d3a6d0876" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop sandbox container \"eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f\" in '\\x01' state: failed to kill sandbox container: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing rpc error: code = DeadlineExceeded desc = timed out connecting to unix socket ////run/vc/vm/eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f/kata.sock\": unavailable"
containerd[5344]:  level=info msg="StopPodSandbox for "eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f""
containerd[5344]: 2019-04-03 22:14:28.728 [INFO][27671] calico.go 442: Extracted identifiers ContainerID="eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f" Node="kata-k8s-containerd" Orchestrator="k8s" WorkloadEndpoint="kata--k8s--containerd-k8s-two--containers--kata--guaranteed-eth0"
containerd[5344]: 2019-04-03 22:14:28.751 [WARNING][27671] workloadendpoint.go 70: Operation Delete is not supported on WorkloadEndpoint type
containerd[5344]: 2019-04-03 22:14:28.751 [INFO][27671] k8s.go 474: Endpoint deletion will be handled by Kubernetes deletion of the Pod. ContainerID="eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"kata--k8s--containerd-k8s-two--containers--kata--guaranteed-eth0", GenerateName:"", Namespace:"default", SelfLink:"", UID:"cc1fcdf1-565d-11e9-bdb5-000d3a6d0876", ResourceVersion:"2657", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63689926446, loc:(*time.Location)(0x1edd720)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"projectcalico.org/serviceaccount":"default", "projectcalico.org/namespace":"default", "projectcalico.org/orchestrator":"k8s"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"kata-k8s-containerd", ContainerID:"", Pod:"two-containers-kata-guaranteed", Endpoint:"eth0", IPNetworks:[]string{"10.244.0.11/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.default", "ksa.default.default"}, InterfaceName:"cali0232257877e", MAC:"", Ports:[]v3.EndpointPort(nil)}}
containerd[5344]: 2019-04-03 22:14:28.751 [INFO][27671] k8s.go 481: Releasing IP address(es) ContainerID="eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f"

egernst · 2019-04-03T22:27:31Z

more complete logs here: https://gist.githubusercontent.com/egernst/9b01743a1f4cf84f84689c5a7df297ed/raw/7df24992bcbb18fb9ae50da67942fcd40f71b9be/issue-1480-logs.md

sboeuf · 2019-04-03T23:12:43Z

@egernst could you try with Kata 1.5 or 1.4 to check if you can reproduce? Let's try to identify if this is coming from something that would have changed in Kata.
If that's not the case, then I suspect we need to look into containerd. The same way, you could try this with different version of containerd to find the culprit.

egernst · 2019-04-03T23:50:33Z

Already tried with containerd 1.2.4; can reproduce.

It is slightly different terrible behavior in 1.5.y and 1.4.3:

The QEMU and v2 shim process remain, as well as items on filesystem, after deleting.

However... after deleting:

If I re-apply the same workload again, I see that the 'dead' one is cleaned up during the starting of the new one.
If I re-apply a different workload, the original is still left on the system, and a second QEMU/v2-shim process is started, and both the deleted and new one are stored on the filesystem.

egernst · 2019-04-04T00:01:49Z

log for scenario w/ 1.5.y and 1.4.y: https://gist.githubusercontent.com/egernst/0deaf8542b5617b7e0210b04d789ba91/raw/fa92014abdd39740bb56dba17f447f251fedc0eb/log%2520using%25201.4

egernst · 2019-04-05T03:52:53Z

@bergwolf @teawater - curious if either of you can reproduce this?

egernst · 2019-04-06T00:03:41Z

This only occurs with macvlan. Not an issue with tc-mirror.

/cc @mcastelino

sboeuf · 2019-04-06T04:06:48Z

@egernst this sounds like something is going wrong with the network deletion then. Make sure we check the traces related to this, it'll help narrowing down faster the issue.

/cc @lifupan

wenlxie · 2019-04-08T07:38:32Z

I am now using kata 1.6 and met the same issue.
There are many kata-v2 shim processes left after runs some test cases.
The network mode is macvtap, then I changed to bridged, the issue still the exists.

@egernst @sboeuf

lifupan · 2019-04-08T07:59:28Z

Hi @wenlxie what's network plugin you used for containerd? standard cni or ?

lifupan · 2019-04-08T08:20:58Z

@egernst this sounds like something is going wrong with the network deletion then. Make sure we check the traces related to this, it'll help narrowing down faster the issue.

/cc @lifupan

I agree with @sboeuf , It seems this issue is related with calico network plugin, I extracted the key log as below:

containerd[5344]: 2019-04-03 22:25:13.882 [INFO][23861] k8s.go 485: Cleaning up netns ContainerID="ac5a7b5ac24812c11253594e1bed082a668d2f3fed829b6d56fe33701e3b1c4b"
containerd[5344]: Calico CNI deleting device in netns /var/run/netns/cni-6637d21d-3031-921d-6467-e0e66692a8f2
containerd[5344]:  level=error msg="StopPodSandbox for "ac5a7b5ac24812c11253594e1bed082a668d2f3fed829b6d56fe33701e3b1c4b" failed" error="failed to destroy network for sandbox "ac5a7b5ac24812c11253594e1bed082a668d2f3fed829b6d56fe33701e3b1c4b": failed to get IP addresses for "eth0": <nil>"
kubelet[32262]: E0403 22:25:13.918205   32262 remote_runtime.go:132] StopPodSandbox "ac5a7b5ac24812c11253594e1bed082a668d2f3fed829b6d56fe33701e3b1c4b" from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for sandbox "ac5a7b5ac24812c11253594e1bed082a668d2f3fed829b6d56fe33701e3b1c4b": failed to get IP addresses for "eth0": <nil>
kubelet[32262]: E0403 22:25:13.918283   32262 kuberuntime_manager.go:846] Failed to stop sandbox {"containerd" "ac5a7b5ac24812c11253594e1bed082a668d2f3fed829b6d56fe33701e3b1c4b"}

From the log, we can see that calico would like to destroy a network namespace by get the interface "eth0" IP first, but for kata's macvtap and bridge mode, the interface "eth0" IP has been passed into qemu, thus the calico failed to get the IP which will cause containerd failed to stop the sandbox, thus
containerd will not send Shutdown() cmd which responsible fro shutdown shimv2 process.

wenlxie · 2019-04-08T08:50:22Z

@lifupan Our own cni network plugin
We have the same logical to teardown the pod's network , which get the ip addr from eth0
But we will get from the pod first. If the pod not exit, then will try to get it from eth0 in the netns.
If pod had been deleted, then the pod's network will not teardown because it can't get the ip addr, and can't remove the route rule.

egernst · 2019-04-08T13:38:50Z

@wenlxie -- tc-mirror is the ideal configuration. Can you test with this?

mcastelino · 2019-04-08T16:24:54Z

@amshinde the issue seems to be that we may not be recovering all the interface information on release. With tc mirror we leave the host side interface and network namespace completely untouched. Hence there is no need to recover any namespace or interface programming.

We should make tc mode the default going forward as it reduces the scope of issues such as this.

amshinde · 2019-04-08T18:27:38Z

@mcastelino @egernst @lifupan I recall that cri-containerd was caching the IP address of the eth0 interface, while looking at the cri code long back.

Here is the issue for it:
containerd/cri#524

And this is the fix they had in place for the above issue"
https://github.com/containerd/cri/blob/b23b406fedd2009f813c36206865ae2a8c51d5bb/pkg/server/sandbox_run.go#L128
They had a workaround to cache the IP address which they say led to a subtle performance improvement as well.

I guess the above workaround is no longer in place for the shimv2 path (havent traced the shimv2 flow).
@lifupan If you are already taking a look at this, can you confirm this.

I wonder if it is worth fixing this, to benefit from slightly better performance with macvtap.
Yes, but I agree we should move to tc as default model. I'll raise a PR to do so.

sboeuf · 2019-04-10T05:26:57Z

@amshinde @mcastelino @egernst

Tc mirroring should be the only configuration, hence the default one.
That being said, I think this is one of the items that need to go to the list of things to make happen in 2.0 release.
I don't think we care enough about the very small performance difference between macvtap and tc-mirroring to keep this in the long term.
This is going to be part of reducing Kata complexity, but I don't see this happening before 2.0, if we care about backward compatibility (and I think we do).

egernst · 2019-04-11T00:36:32Z

Is there a difference from user perspective, aside from improved stability?

mcastelino · 2019-04-11T00:41:00Z

Is there a difference from user perspective, aside from improved stability?

@egernst the switch should be invisible from a user POV. So this is not a breaking change. IMO we should push this early vs waiting for 2.0

wenlxie · 2019-04-11T01:10:41Z

@egernst Seems there is no docs for the details of these network modes.

sboeuf · 2019-04-11T02:08:27Z

@WeiZhang555 we need your input here please.

bergwolf · 2019-04-11T09:08:31Z

@egernst Could you update your calico cni plugin and see if it still reproduces? I looked a bit at it and I think the issue is fixed by projectcalico/cni-plugin@6fef469#diff-cd575165f84eed10d10eb1460b0927b4L120 -- DelLinkByNameAddr does check the endpoint ip address where as DelLinkByName doesn't. calico releases contain it since v3.4.0 (latest is v3.6.1).

That said, I do agree with @sboeuf @mcastelino @amshinde that we should move to tc as default, as it is a very subtle change that even calico didn't mention it in the commit message.

egernst · 2019-04-12T04:00:31Z

I didn't get to test this yet, my apologies @bergwolf

awprice · 2019-04-30T01:10:41Z

I believe we are running into this issue with our setup.

It tends to occur when we schedule a large amount of pods on a node, 60+ pods at a time.

We have found that the processes within the pods are completing, but the VM for the pod isn't being cleaned up. The qemu-system processes continue running, as well as the containerd-shim-kata-v2 process continues running. We can also see the sandbox containers are still in containerd.

The containerd-shim-kata-v2 is also consuming alot of CPU when this occurs, each containerd-shim-kata-v2 consumes 100% of a CPU core and with 40+ containerd-shim-kata-v2 processes, 40+ cores are at 100% usage.

This is resulting in the pod CIDR IP addresses being exhausted and new pods aren't able to start on the node.

We are seeing the following log lines when this occurs:

Apr 29 03:39:38 ip-10-155-106-161.ec2.internal containerd[3100]: time="2019-04-29T03:39:38.879351593Z" level=error msg="StopPodSandbox for \"22abc9d98dfa2585d2484d931c6e931dc2ab872081f76926b72908e19cb9771a\" failed" error="failed to stop sandbox container \"22abc9d98dfa2585d2484d931c6e931dc2ab872081f76926b72908e19cb9771a\": failed to kill sandbox container: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing rpc error: code = DeadlineExceeded desc = timed out connecting to unix socket ////run/vc/vm/22abc9d98dfa2585d2484d931c6e931dc2ab872081f76926b72908e19cb9771a/kata.sock\": unavailable"

We are using the following versions:

Containerd 1.2.6
Kata 1.7.0-alpha1
Kubernetes 1.13.4
Calico 3.5.2

Let me know if you need any more information.

egernst · 2019-04-30T18:19:47Z

@awprice This sounds like the exact issue. Can you update your configuration to utilize tcfilter instead of macvtap (update you configuration.toml for Kata)?

awprice · 2019-05-01T05:02:49Z

@egernst Unfortunately we are already using tcfilter by default with 1.7.0-alpha1 and it hasn't improved the issues.

I've dug up some more logs when this occurs from containerd:

failed to stop sandbox container "fc4781d8206257248df55c4d90fad93e7f50984806b94289459cc2a0130deb64" in '\x01' state: failed to kill sandbox container: Container not ready, running or paused, impossible to signal the container: unknown"

The problem is also prevalent when scheduling a larger amount of pods at once. When scheduling 10 or so at once on a single node, we don't see the problem, but scheduling 100 or more at once brings out the problem straight away.

Could it be a load/scale issue with the Kata components being overwhelmed?

I've dug up the CPU graphs for a Kata node that is affected by it, looks like the kata shim is pegging the CPU at 100% once a certain amount of pods have run on the node.

I've also performed an strace on the kata shims that are causing the CPU to be pinned at 100% -

# strace -p 6115
Process 6115 attached
futex(0x199c728, FUTEX_WAIT_PRIVATE, 0, NULL

egernst added the bug Incorrect behaviour label Apr 3, 2019

lifupan self-assigned this Apr 8, 2019

awprice mentioned this issue May 6, 2019

Pod information left in k8s test #1611

Closed

jodh-intel closed this as completed Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containerd+kata shim-v2: shim not killed, artifacts left on host #1480

containerd+kata shim-v2: shim not killed, artifacts left on host #1480

egernst commented Apr 3, 2019 •

edited

Loading

egernst commented Apr 3, 2019

egernst commented Apr 3, 2019

egernst commented Apr 3, 2019

sboeuf commented Apr 3, 2019

egernst commented Apr 3, 2019 •

edited

Loading

egernst commented Apr 4, 2019

egernst commented Apr 5, 2019

egernst commented Apr 6, 2019

sboeuf commented Apr 6, 2019

wenlxie commented Apr 8, 2019 •

edited

Loading

lifupan commented Apr 8, 2019

lifupan commented Apr 8, 2019

wenlxie commented Apr 8, 2019 •

edited

Loading

egernst commented Apr 8, 2019

mcastelino commented Apr 8, 2019

amshinde commented Apr 8, 2019

sboeuf commented Apr 10, 2019

egernst commented Apr 11, 2019

mcastelino commented Apr 11, 2019

wenlxie commented Apr 11, 2019

sboeuf commented Apr 11, 2019 •

edited

Loading

bergwolf commented Apr 11, 2019

egernst commented Apr 12, 2019

awprice commented Apr 30, 2019

egernst commented Apr 30, 2019

awprice commented May 1, 2019 •

edited

Loading

containerd+kata shim-v2: shim not killed, artifacts left on host #1480

containerd+kata shim-v2: shim not killed, artifacts left on host #1480

Comments

egernst commented Apr 3, 2019 • edited Loading

Description of problem

install kata 1.6

setup appropriate runtimeclass:

in seperate windows, watch for qemu, v2-shim, and monitor the filesystem where Kata stores state:

start a pod, then remove after started:

egernst commented Apr 3, 2019

egernst commented Apr 3, 2019

egernst commented Apr 3, 2019

sboeuf commented Apr 3, 2019

egernst commented Apr 3, 2019 • edited Loading

egernst commented Apr 4, 2019

egernst commented Apr 5, 2019

egernst commented Apr 6, 2019

sboeuf commented Apr 6, 2019

wenlxie commented Apr 8, 2019 • edited Loading

lifupan commented Apr 8, 2019

lifupan commented Apr 8, 2019

wenlxie commented Apr 8, 2019 • edited Loading

egernst commented Apr 8, 2019

mcastelino commented Apr 8, 2019

amshinde commented Apr 8, 2019

sboeuf commented Apr 10, 2019

egernst commented Apr 11, 2019

mcastelino commented Apr 11, 2019

wenlxie commented Apr 11, 2019

sboeuf commented Apr 11, 2019 • edited Loading

bergwolf commented Apr 11, 2019

egernst commented Apr 12, 2019

awprice commented Apr 30, 2019

egernst commented Apr 30, 2019

awprice commented May 1, 2019 • edited Loading

egernst commented Apr 3, 2019 •

edited

Loading

egernst commented Apr 3, 2019 •

edited

Loading

wenlxie commented Apr 8, 2019 •

edited

Loading

wenlxie commented Apr 8, 2019 •

edited

Loading

sboeuf commented Apr 11, 2019 •

edited

Loading

awprice commented May 1, 2019 •

edited

Loading