Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

containerd+kata shim-v2: shim not killed, artifacts left on host #1480

Closed
egernst opened this issue Apr 3, 2019 · 26 comments
Closed

containerd+kata shim-v2: shim not killed, artifacts left on host #1480

egernst opened this issue Apr 3, 2019 · 26 comments
Assignees
Labels
bug Incorrect behaviour

Comments

@egernst
Copy link
Member

egernst commented Apr 3, 2019

Description of problem

With kubernetes 1.14, containerd 1.2.5 and Kata artifacts from the 1.6.0 release, kubectl delete -f doesn't result in the containerd-shim-kata-v2 process being stopped, and artifacts are left on the host in /var/run/vc/sbs, /var/run/vc/vm, etc.

example, starting with a clean containerd+k8s setup, install kata (1.6.0):

install kata 1.6

kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/master/kata-deploy/kata-rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/master/kata-deploy/kata-deploy.yaml

setup appropriate runtimeclass:

kubectl apply -f https://raw.githubusercontent.com/clearlinux/cloud-native-setup/master/clr-k8s-examples/8-kata/kata-qemu-runtimeClass.yaml

in seperate windows, watch for qemu, v2-shim, and monitor the filesystem where Kata stores state:

watch -d 'ls /var/run/vc/sbs/ /var/lib/vc/sbs /var/run/vc/vm/ '
watch -d 'ps -aef | grep -E "qem|containerd-shim-kata-v2" | grep -v grep'

start a pod, then remove after started:

kubectl apply -f https://gist.githubusercontent.com/egernst/9cbbb3d6ba566afca98be2a4e29328d6/raw/0f6e332ba1b7306ed895e5ff91b2939ca3f0bb42/two-pod.yaml
kubectl delete -f https://gist.githubusercontent.com/egernst/9cbbb3d6ba566afca98be2a4e29328d6/raw/0f6e332ba1b7306ed895e5ff91b2939ca3f0bb42/two-pod.yaml

You'll see QEMU come and go as expected, but the v2-shim keep running and storage stay on the filesystem.

@egernst
Copy link
Member Author

egernst commented Apr 3, 2019

@egernst
Copy link
Member Author

egernst commented Apr 3, 2019

Some logs around time of deletion:

containerd[5344]:  level=debug msg="sending request" ID=53ffff95b545f254cc36058df22f2442f584a286393b1115fc408023cf5609fd name=grpc.StatsContainerRequest req="container_id:\"6261346f506314999a6d4e94b84dd09d050d7763502827eca9ae334103bdd67a\" " source=virtcontainers subsystem=kata_agent
containerd[5344]:  level=info msg="ExecSync for "7b19ae263fc1870052876ec315542d907f8fbf73a3e7d42bf256fff6c7ac0df0" with command [/bin/sh -ec ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key get foo] and timeout 15 (s)"
containerd[5344]:  level=info msg="Exec process "141f5299e65d21a02c1e80bf1625c4cc78f4c32c29284cbd78ca705baec9632f" exits with exit code 0 and error <nil>"
containerd[5344]:  level=info msg="Finish piping "stderr" of container exec "141f5299e65d21a02c1e80bf1625c4cc78f4c32c29284cbd78ca705baec9632f""
containerd[5344]:  level=info msg="Finish piping "stdout" of container exec "141f5299e65d21a02c1e80bf1625c4cc78f4c32c29284cbd78ca705baec9632f""
containerd[5344]:  level=info msg="ExecSync for "7b19ae263fc1870052876ec315542d907f8fbf73a3e7d42bf256fff6c7ac0df0" returns with exit code 0"
containerd[5344]:  level=error msg="StopPodSandbox for "eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f" failed" error="failed to stop sandbox container "eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f" in '\x01' state: failed to kill sandbox container: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing rpc error: code = DeadlineExceeded desc = timed out connecting to unix socket ////run/vc/vm/eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f/kata.sock": unavailable"
kubelet[32262]: E0403 22:14:26.962281   32262 remote_runtime.go:132] StopPodSandbox "eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f" from runtime service failed: rpc error: code = Unknown desc = failed to stop sandbox container "eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f" in '\x01' state: failed to kill sandbox container: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing rpc error: code = DeadlineExceeded desc = timed out connecting to unix socket ////run/vc/vm/eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f/kata.sock": unavailable
kubelet[32262]: E0403 22:14:26.962348   32262 kuberuntime_manager.go:846] Failed to stop sandbox {"containerd" "eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f"}
kubelet[32262]: E0403 22:14:26.962397   32262 kubelet_pods.go:1093] Failed killing the pod "two-containers-kata-guaranteed": failed to "KillPodSandbox" for "f817d5cf-565a-11e9-bdb5-000d3a6d0876" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop sandbox container \"eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f\" in '\\x01' state: failed to kill sandbox container: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing rpc error: code = DeadlineExceeded desc = timed out connecting to unix socket ////run/vc/vm/eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f/kata.sock\": unavailable"
containerd[5344]:  level=info msg="StopPodSandbox for "eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f""
containerd[5344]: 2019-04-03 22:14:28.728 [INFO][27671] calico.go 442: Extracted identifiers ContainerID="eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f" Node="kata-k8s-containerd" Orchestrator="k8s" WorkloadEndpoint="kata--k8s--containerd-k8s-two--containers--kata--guaranteed-eth0"
containerd[5344]: 2019-04-03 22:14:28.751 [WARNING][27671] workloadendpoint.go 70: Operation Delete is not supported on WorkloadEndpoint type
containerd[5344]: 2019-04-03 22:14:28.751 [INFO][27671] k8s.go 474: Endpoint deletion will be handled by Kubernetes deletion of the Pod. ContainerID="eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"kata--k8s--containerd-k8s-two--containers--kata--guaranteed-eth0", GenerateName:"", Namespace:"default", SelfLink:"", UID:"cc1fcdf1-565d-11e9-bdb5-000d3a6d0876", ResourceVersion:"2657", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63689926446, loc:(*time.Location)(0x1edd720)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"projectcalico.org/serviceaccount":"default", "projectcalico.org/namespace":"default", "projectcalico.org/orchestrator":"k8s"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"kata-k8s-containerd", ContainerID:"", Pod:"two-containers-kata-guaranteed", Endpoint:"eth0", IPNetworks:[]string{"10.244.0.11/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.default", "ksa.default.default"}, InterfaceName:"cali0232257877e", MAC:"", Ports:[]v3.EndpointPort(nil)}}
containerd[5344]: 2019-04-03 22:14:28.751 [INFO][27671] k8s.go 481: Releasing IP address(es) ContainerID="eaad9157b27429205bfbd5eab91c8a4953899e5790e6e821ad931328859e3a1f"

@egernst
Copy link
Member Author

egernst commented Apr 3, 2019

@egernst egernst added the bug Incorrect behaviour label Apr 3, 2019
@sboeuf
Copy link

sboeuf commented Apr 3, 2019

@egernst could you try with Kata 1.5 or 1.4 to check if you can reproduce? Let's try to identify if this is coming from something that would have changed in Kata.
If that's not the case, then I suspect we need to look into containerd. The same way, you could try this with different version of containerd to find the culprit.

@egernst
Copy link
Member Author

egernst commented Apr 3, 2019

Already tried with containerd 1.2.4; can reproduce.

It is slightly different terrible behavior in 1.5.y and 1.4.3:

The QEMU and v2 shim process remain, as well as items on filesystem, after deleting.

However... after deleting:

  • If I re-apply the same workload again, I see that the 'dead' one is cleaned up during the starting of the new one.
  • If I re-apply a different workload, the original is still left on the system, and a second QEMU/v2-shim process is started, and both the deleted and new one are stored on the filesystem.

@egernst
Copy link
Member Author

egernst commented Apr 4, 2019

@egernst
Copy link
Member Author

egernst commented Apr 5, 2019

@bergwolf @teawater - curious if either of you can reproduce this?

@egernst
Copy link
Member Author

egernst commented Apr 6, 2019

This only occurs with macvlan. Not an issue with tc-mirror.

/cc @mcastelino

@sboeuf
Copy link

sboeuf commented Apr 6, 2019

@egernst this sounds like something is going wrong with the network deletion then. Make sure we check the traces related to this, it'll help narrowing down faster the issue.

/cc @lifupan

@lifupan lifupan self-assigned this Apr 8, 2019
@wenlxie
Copy link

wenlxie commented Apr 8, 2019

I am now using kata 1.6 and met the same issue.
There are many kata-v2 shim processes left after runs some test cases.
The network mode is macvtap, then I changed to bridged, the issue still the exists.

@egernst @sboeuf

@lifupan
Copy link
Member

lifupan commented Apr 8, 2019

Hi @wenlxie what's network plugin you used for containerd? standard cni or ?

@lifupan
Copy link
Member

lifupan commented Apr 8, 2019

@egernst this sounds like something is going wrong with the network deletion then. Make sure we check the traces related to this, it'll help narrowing down faster the issue.

/cc @lifupan

I agree with @sboeuf , It seems this issue is related with calico network plugin, I extracted the key log as below:

containerd[5344]: 2019-04-03 22:25:13.882 [INFO][23861] k8s.go 485: Cleaning up netns ContainerID="ac5a7b5ac24812c11253594e1bed082a668d2f3fed829b6d56fe33701e3b1c4b"
containerd[5344]: Calico CNI deleting device in netns /var/run/netns/cni-6637d21d-3031-921d-6467-e0e66692a8f2
containerd[5344]:  level=error msg="StopPodSandbox for "ac5a7b5ac24812c11253594e1bed082a668d2f3fed829b6d56fe33701e3b1c4b" failed" error="failed to destroy network for sandbox "ac5a7b5ac24812c11253594e1bed082a668d2f3fed829b6d56fe33701e3b1c4b": failed to get IP addresses for "eth0": <nil>"
kubelet[32262]: E0403 22:25:13.918205   32262 remote_runtime.go:132] StopPodSandbox "ac5a7b5ac24812c11253594e1bed082a668d2f3fed829b6d56fe33701e3b1c4b" from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for sandbox "ac5a7b5ac24812c11253594e1bed082a668d2f3fed829b6d56fe33701e3b1c4b": failed to get IP addresses for "eth0": <nil>
kubelet[32262]: E0403 22:25:13.918283   32262 kuberuntime_manager.go:846] Failed to stop sandbox {"containerd" "ac5a7b5ac24812c11253594e1bed082a668d2f3fed829b6d56fe33701e3b1c4b"}

From the log, we can see that calico would like to destroy a network namespace by get the interface "eth0" IP first, but for kata's macvtap and bridge mode, the interface "eth0" IP has been passed into qemu, thus the calico failed to get the IP which will cause containerd failed to stop the sandbox, thus
containerd will not send Shutdown() cmd which responsible fro shutdown shimv2 process.

@wenlxie
Copy link

wenlxie commented Apr 8, 2019

@lifupan Our own cni network plugin
We have the same logical to teardown the pod's network , which get the ip addr from eth0
But we will get from the pod first. If the pod not exit, then will try to get it from eth0 in the netns.
If pod had been deleted, then the pod's network will not teardown because it can't get the ip addr, and can't remove the route rule.

@egernst
Copy link
Member Author

egernst commented Apr 8, 2019

@wenlxie -- tc-mirror is the ideal configuration. Can you test with this?

@mcastelino
Copy link
Contributor

@amshinde the issue seems to be that we may not be recovering all the interface information on release. With tc mirror we leave the host side interface and network namespace completely untouched. Hence there is no need to recover any namespace or interface programming.

We should make tc mode the default going forward as it reduces the scope of issues such as this.

@amshinde
Copy link
Member

amshinde commented Apr 8, 2019

@mcastelino @egernst @lifupan I recall that cri-containerd was caching the IP address of the eth0 interface, while looking at the cri code long back.

Here is the issue for it:
containerd/cri#524

And this is the fix they had in place for the above issue"
https://github.com/containerd/cri/blob/b23b406fedd2009f813c36206865ae2a8c51d5bb/pkg/server/sandbox_run.go#L128
They had a workaround to cache the IP address which they say led to a subtle performance improvement as well.

I guess the above workaround is no longer in place for the shimv2 path (havent traced the shimv2 flow).
@lifupan If you are already taking a look at this, can you confirm this.

I wonder if it is worth fixing this, to benefit from slightly better performance with macvtap.
Yes, but I agree we should move to tc as default model. I'll raise a PR to do so.

@sboeuf
Copy link

sboeuf commented Apr 10, 2019

@amshinde @mcastelino @egernst

Tc mirroring should be the only configuration, hence the default one.
That being said, I think this is one of the items that need to go to the list of things to make happen in 2.0 release.
I don't think we care enough about the very small performance difference between macvtap and tc-mirroring to keep this in the long term.
This is going to be part of reducing Kata complexity, but I don't see this happening before 2.0, if we care about backward compatibility (and I think we do).

@egernst
Copy link
Member Author

egernst commented Apr 11, 2019

Is there a difference from user perspective, aside from improved stability?

@mcastelino
Copy link
Contributor

Is there a difference from user perspective, aside from improved stability?

@egernst the switch should be invisible from a user POV. So this is not a breaking change. IMO we should push this early vs waiting for 2.0

@wenlxie
Copy link

wenlxie commented Apr 11, 2019

@egernst Seems there is no docs for the details of these network modes.

@sboeuf
Copy link

sboeuf commented Apr 11, 2019

@WeiZhang555 we need your input here please.

@bergwolf
Copy link
Member

@egernst Could you update your calico cni plugin and see if it still reproduces? I looked a bit at it and I think the issue is fixed by projectcalico/cni-plugin@6fef469#diff-cd575165f84eed10d10eb1460b0927b4L120 -- DelLinkByNameAddr does check the endpoint ip address where as DelLinkByName doesn't. calico releases contain it since v3.4.0 (latest is v3.6.1).

That said, I do agree with @sboeuf @mcastelino @amshinde that we should move to tc as default, as it is a very subtle change that even calico didn't mention it in the commit message.

@egernst
Copy link
Member Author

egernst commented Apr 12, 2019

I didn't get to test this yet, my apologies @bergwolf

@awprice
Copy link
Contributor

awprice commented Apr 30, 2019

I believe we are running into this issue with our setup.

It tends to occur when we schedule a large amount of pods on a node, 60+ pods at a time.

We have found that the processes within the pods are completing, but the VM for the pod isn't being cleaned up. The qemu-system processes continue running, as well as the containerd-shim-kata-v2 process continues running. We can also see the sandbox containers are still in containerd.

The containerd-shim-kata-v2 is also consuming alot of CPU when this occurs, each containerd-shim-kata-v2 consumes 100% of a CPU core and with 40+ containerd-shim-kata-v2 processes, 40+ cores are at 100% usage.

This is resulting in the pod CIDR IP addresses being exhausted and new pods aren't able to start on the node.

We are seeing the following log lines when this occurs:

Apr 29 03:39:38 ip-10-155-106-161.ec2.internal containerd[3100]: time="2019-04-29T03:39:38.879351593Z" level=error msg="StopPodSandbox for \"22abc9d98dfa2585d2484d931c6e931dc2ab872081f76926b72908e19cb9771a\" failed" error="failed to stop sandbox container \"22abc9d98dfa2585d2484d931c6e931dc2ab872081f76926b72908e19cb9771a\": failed to kill sandbox container: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing rpc error: code = DeadlineExceeded desc = timed out connecting to unix socket ////run/vc/vm/22abc9d98dfa2585d2484d931c6e931dc2ab872081f76926b72908e19cb9771a/kata.sock\": unavailable"

We are using the following versions:

  • Containerd 1.2.6
  • Kata 1.7.0-alpha1
  • Kubernetes 1.13.4
  • Calico 3.5.2

Let me know if you need any more information.

@egernst
Copy link
Member Author

egernst commented Apr 30, 2019

@awprice This sounds like the exact issue. Can you update your configuration to utilize tcfilter instead of macvtap (update you configuration.toml for Kata)?

@awprice
Copy link
Contributor

awprice commented May 1, 2019

@egernst Unfortunately we are already using tcfilter by default with 1.7.0-alpha1 and it hasn't improved the issues.

I've dug up some more logs when this occurs from containerd:

failed to stop sandbox container "fc4781d8206257248df55c4d90fad93e7f50984806b94289459cc2a0130deb64" in '\x01' state: failed to kill sandbox container: Container not ready, running or paused, impossible to signal the container: unknown"

The problem is also prevalent when scheduling a larger amount of pods at once. When scheduling 10 or so at once on a single node, we don't see the problem, but scheduling 100 or more at once brings out the problem straight away.

Could it be a load/scale issue with the Kata components being overwhelmed?

I've dug up the CPU graphs for a Kata node that is affected by it, looks like the kata shim is pegging the CPU at 100% once a certain amount of pods have run on the node.

Screen Shot 2019-05-01 at 3 19 46 pm

I've also performed an strace on the kata shims that are causing the CPU to be pinned at 100% -

# strace -p 6115
Process 6115 attached
futex(0x199c728, FUTEX_WAIT_PRIVATE, 0, NULL

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Incorrect behaviour
Projects
None yet
Development

No branches or pull requests

9 participants