-
Notifications
You must be signed in to change notification settings - Fork 374
containerd+kata shim-v2: shim not killed, artifacts left on host #1480
Comments
Some logs around time of deletion:
|
@egernst could you try with Kata 1.5 or 1.4 to check if you can reproduce? Let's try to identify if this is coming from something that would have changed in Kata. |
Already tried with containerd 1.2.4; can reproduce. It is slightly different terrible behavior in 1.5.y and 1.4.3: The QEMU and v2 shim process remain, as well as items on filesystem, after deleting. However... after deleting:
|
log for scenario w/ 1.5.y and 1.4.y: https://gist.githubusercontent.com/egernst/0deaf8542b5617b7e0210b04d789ba91/raw/fa92014abdd39740bb56dba17f447f251fedc0eb/log%2520using%25201.4 |
This only occurs with macvlan. Not an issue with tc-mirror. /cc @mcastelino |
Hi @wenlxie what's network plugin you used for containerd? standard cni or ? |
I agree with @sboeuf , It seems this issue is related with calico network plugin, I extracted the key log as below:
From the log, we can see that calico would like to destroy a network namespace by get the interface "eth0" IP first, but for kata's macvtap and bridge mode, the interface "eth0" IP has been passed into qemu, thus the calico failed to get the IP which will cause containerd failed to stop the sandbox, thus |
@lifupan Our own cni network plugin |
@wenlxie -- tc-mirror is the ideal configuration. Can you test with this? |
@amshinde the issue seems to be that we may not be recovering all the interface information on release. With tc mirror we leave the host side interface and network namespace completely untouched. Hence there is no need to recover any namespace or interface programming. We should make tc mode the default going forward as it reduces the scope of issues such as this. |
@mcastelino @egernst @lifupan I recall that cri-containerd was caching the IP address of the Here is the issue for it: And this is the fix they had in place for the above issue" I guess the above workaround is no longer in place for the shimv2 path (havent traced the shimv2 flow). I wonder if it is worth fixing this, to benefit from slightly better performance with macvtap. |
@amshinde @mcastelino @egernst Tc mirroring should be the only configuration, hence the default one. |
Is there a difference from user perspective, aside from improved stability? |
@egernst the switch should be invisible from a user POV. So this is not a breaking change. IMO we should push this early vs waiting for 2.0 |
@egernst Seems there is no docs for the details of these network modes. |
@WeiZhang555 we need your input here please. |
@egernst Could you update your calico cni plugin and see if it still reproduces? I looked a bit at it and I think the issue is fixed by projectcalico/cni-plugin@6fef469#diff-cd575165f84eed10d10eb1460b0927b4L120 -- That said, I do agree with @sboeuf @mcastelino @amshinde that we should move to tc as default, as it is a very subtle change that even calico didn't mention it in the commit message. |
I didn't get to test this yet, my apologies @bergwolf |
I believe we are running into this issue with our setup. It tends to occur when we schedule a large amount of pods on a node, 60+ pods at a time. We have found that the processes within the pods are completing, but the VM for the pod isn't being cleaned up. The qemu-system processes continue running, as well as the containerd-shim-kata-v2 process continues running. We can also see the sandbox containers are still in containerd. The containerd-shim-kata-v2 is also consuming alot of CPU when this occurs, each containerd-shim-kata-v2 consumes 100% of a CPU core and with 40+ containerd-shim-kata-v2 processes, 40+ cores are at 100% usage. This is resulting in the pod CIDR IP addresses being exhausted and new pods aren't able to start on the node. We are seeing the following log lines when this occurs:
We are using the following versions:
Let me know if you need any more information. |
@awprice This sounds like the exact issue. Can you update your configuration to utilize tcfilter instead of macvtap (update you configuration.toml for Kata)? |
@egernst Unfortunately we are already using tcfilter by default with 1.7.0-alpha1 and it hasn't improved the issues. I've dug up some more logs when this occurs from containerd:
The problem is also prevalent when scheduling a larger amount of pods at once. When scheduling 10 or so at once on a single node, we don't see the problem, but scheduling 100 or more at once brings out the problem straight away. Could it be a load/scale issue with the Kata components being overwhelmed? I've dug up the CPU graphs for a Kata node that is affected by it, looks like the kata shim is pegging the CPU at 100% once a certain amount of pods have run on the node. I've also performed an strace on the kata shims that are causing the CPU to be pinned at 100% -
|
Description of problem
With kubernetes 1.14, containerd 1.2.5 and Kata artifacts from the 1.6.0 release, kubectl delete -f doesn't result in the containerd-shim-kata-v2 process being stopped, and artifacts are left on the host in /var/run/vc/sbs, /var/run/vc/vm, etc.
example, starting with a clean containerd+k8s setup, install kata (1.6.0):
install kata 1.6
setup appropriate runtimeclass:
in seperate windows, watch for qemu, v2-shim, and monitor the filesystem where Kata stores state:
start a pod, then remove after started:
You'll see QEMU come and go as expected, but the v2-shim keep running and storage stay on the filesystem.
The text was updated successfully, but these errors were encountered: