-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider revert #3931 nsexec: cloned_binary: remove bindfd logic entirely #3973
Comments
@cyphar PTAL, thanks. |
@lifubang does reverting those commit fixes the issue? |
Yes, it works very well after reverted #3931 . |
I can add the contrib program to bind-mount the memfd which should eliminate this issue (see the discussion in the PR). I didn't add it because the primary concern last time was container memory usage, and I really do not want to have to keep bindfd... |
I encountered a similar issue. When using the main branch, my test case failed. However, when I solely used the 'remove bindfd' commit, my test case passed. I suspect that this PR might be conflicting with other PRs on the main branch, leading to a new performance issue. I'd recommend not using the main branch and instead, trying only @cyphar's PR, @lifubang. |
Oh my, this is tough. Let's discuss further in #3931 |
With your suggestion, it can't fix my issue.
I think maybe not this problem. I think obviously it's a design problem, but not a code bug, cyphar has confirmed it, and he has had a solution to resolve this issue. So, maybe your issue is not the same as mine. You can discuss it further in #3931 or open an new issue once you know how to reproduce it clearly. |
The node is in stuck state. Are there any logs? |
Maybe you could test with @cyphar's suggestion. I show you memfd code afternoon. |
I will post a PR, but the code in question is #3931 (comment). If you use it to mount on top of the |
memfd code is
|
I have tried it, it works very well. I think this issue may impact some k8s/dockerswarm cluster when the node is in case of insufficient memory state. So I suggust to revert it before we have an new solution. But if we confirm that it will not impact upstream project, I think we can consider add a contrib program for anyone who needs it. |
The problem is that we cannot just revert it:
From where I'm standing, it seems to me that #3931 is the least bad option. Every option has some kind of cost, but at least the memfd one can be mitigated by users -- the mount-based ones affect you and you cannot disable them.
If a node cannot have 10MB in memory per container during setup, I don't think the node has enough memory to actually run the container process. Also I don't think Docker, containerd, cri-o or Kubernetes keep processes in the |
IMHO, either current main or reverting to the state in runc 1.1 is what seems reasonable. Using cgroups v2 and >= 5.14 kernel is not a valid workaround for these issues in runc 1.1? (Asking due to this comment). If that is the case, maybe it will be nice to have some knob to switch the behavior to what is done in 1.1, if the extra mem is a problem, can be a safe path to have. Also, just curious, why not hiding |
1 Problem PhenomenonPhenomenon 1: systemd's CPU usage consistently exceeds 50% and doesn't decreasePhenomenon 2: On incrementally creating 100 pods, systemd's CPU usage remains at 100%, blocking container creationPhenomenon 3: In fault injection scenarios, persistently killing the runtime 'runc' process results in failed container termination and no automatic recovery.2 Root CausesPersistent >50% CPU usage by systemd
Creating 100 pods incrementally results in 100% CPU usage by systemd and blocks container creation
Fault injection by killing the runc process results in container termination failure and no automatic recoveryFor cgroup v1:
If the second step gets blocked and the corresponding process is terminated, the entire cgroup remains frozen. This eventually leads to the runc init process entering a 'D' (uninterruptible sleep) state, making the kill operation unsuccessful and preventing automatic recovery. SolutionsAddressing the 'D' state problem where the cgroup remains frozen
Resolving the high CPU usage by systemd leading to pod creation failures
Monitoring and Recovery
|
@rata |
Yes, this is true for there are a little containers in a node. But maybe there are many pods in a node, for example, if there are 1000 pods in a node, we should have 13000MB(12GB) memory left in the node if 1000 sidecar pods(only do a curl http://podId:8080) starting at the same time. @cyphar
So you mean we may not face this problem in the real world? @rata |
No, not at all. I have literally no idea if this will be a problem in the real world. I just think @cyphar really want to try this and I think what we have at release 1.1 is not ideal, but people live with it. So either way seems reasonable for me. But I'm not deep into the details of this issue, so I trust what you decide (the revert you suggest, try this, etc.). Please go for what you think it is best :) |
When executing pod exec, it ultimately triggers runc exec and initiates runc init. In version 1.1, frequently executing runc exec places significant strain on systemd, leading to runc init hanging. Therefore, I believe rolling back to version 1.1 won't address this issue. @lifubang |
Or, since that one is closed, we can discuss it here. Once the contrib program @cyphar talks about is available, we can add it as a systemd service and push the distro vendors to properly package it, so the users won't have to. I was also thinking about a separate One issue we have is because |
I think it's not worth to do. Maybe we can take the |
I'm not convinced this is a good idea. The systemd unit will be only needed ""temporarily"" until we have a better fix in the kernel, then it will be dropped, right? IMHO, hacks should always be contained. When the hack starts to affect lot of other projects, the hack is out of control and fixing it becomes more complex. If there was merit to the systemd unit besides this, I'm ok. But if there isn't, I'm not a supporter. |
Now I know it is a real problem in real world. Kubernetes e2e test fail with OOM with how things are in master now. They didn't fail in f73b05d (i.e. when idmap mounts was just merged). See for example here to see how it fails with current main branch:
So, a revert probably fixes that issues. BUT I can confirm the cloned binary rework PR also fixes that issue! |
What cgroup configuration is used on the e2e tests? Memory migration is deprecated in cgroupv1 and in cgroupv2 it was never supported, so I don't see how the We added the bindfd stuff in response to OOMs in Kubernetes e2e tests several years ago, but I never looked into what was actually going on -- as far as I can tell, the memory usage of I get that |
@cyphar I was wondering the same, but I don't really know the node config and Kubernetes e2e tests are kind of painful in several ways, so I'm not eager to look into that 🙈. I didn't expect this to be counted towards the limit either, but SOME limit is being hit, clearly. If you want to dig into that, please go nuts :) The runc error is coming from this: runc/libcontainer/process_linux.go Line 380 in 26a98ea
I don't know what can cause that we hit that. |
Description
With runc built from the main branch:
If I create 450 containers in a 8G memory node, all the memory will be eat by runc.
The node will be in stuck state.
With runc 1.1.8, it will success.
I think when using
memfd_create
, runc binary will be in memory, so if we batch create containers, it will eat node's memory before the container started.In my head, maybe
containerd
usesrunc create
for the first step when run a container.Steps to reproduce the issue
the bash script:
Describe the results you received and expected
received:
The node is in stuck state, because there is no enough memory left in the node.
expected:
All containers should be created normally.
What version of runc are you using?
runc version 1.1.0+dev
commit: v1.1.0-688-g74c125d8
spec: 1.1.0
go: go1.18.4
libseccomp: 2.5.1
Host OS information
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
Host kernel information
Linux iZ2ze4a8qvjqt6lt7ommf8Z 5.15.0-73-generic #80-Ubuntu SMP Mon May 15 15:18:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: