Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ro-bind /proc/self/exe before copying cause systemd's cpu extremely high #3925

Closed
113xiaoji opened this issue Jul 3, 2023 · 4 comments
Closed

Comments

@113xiaoji
Copy link

113xiaoji commented Jul 3, 2023

Description

Upon deploying 150 running Pods on a single node, it was observed that the CPU usage of systemd consistently remained at 50%. When an attempt was made to deploy an additional 100 Pods on the same node, the CPU usage of systemd escalated to 99%, and the newly deployed Pods could not be successfully launched. It is speculated that the issue might be due to an excessive number of mount points. To further investigate this hypothesis, the BCC tools mountsnoop and execsnoop were used to monitor the system calls for mount operations and the process IDs respectively.

mountsnoop is extensively logging.

exe              200868  200868  4026531840  mount("/proc/self/exe", "/run/containerd/runc/k8s.io/cfd7ecb9972a640dd564a8f948baa888d5ef87aa58dcd7af2c5f887811d42ad6/runc.9xc3s6", "", MS_BIND, "") = 0
exe              200868  200868  4026531840  mount("", "/run/containerd/runc/k8s.io/cfd7ecb9972a640dd564a8f948baa888d5ef87aa58dcd7af2c5f887811d42ad6/runc.9xc3s6", "", MS_RDONLY|MS_REMOUNT|MS_BIND, "") = 0
exe              200868  200868  4026531840  umount("/run/containerd/runc/k8s.io/cfd7ecb9972a640dd564a8f948baa888d5ef87aa58dcd7af2c5f887811d42ad6/runc.9xc3s6", MNT_DETACH) = 0
exe              200891  200891  4026531840  mount("/proc/self/exe", "/run/containerd/runc/k8s.io/2a9dcd5bd3c46138d082778d00d776ef65f5b581ec50c38aa2ccf99da86c7ce4/runc.qS7oGO", "", MS_BIND, "") = 0
exe              200891  200891  4026531840  mount("", "/run/containerd/runc/k8s.io/2a9dcd5bd3c46138d082778d00d776ef65f5b581ec50c38aa2ccf99da86c7ce4/runc.qS7oGO", "", MS_RDONLY|MS_REMOUNT|MS_BIND, "") = 0
exe              200891  200891  4026531840  umount("/run/containerd/runc/k8s.io/2a9dcd5bd3c46138d082778d00d776ef65f5b581ec50c38aa2ccf99da86c7ce4/runc.qS7oGO", MNT_DETACH) = 0
exe              200924  200924  4026531840  mount("/proc/self/exe", "/run/containerd/runc/k8s.io/be13b81a11b0d96f9ee9425f9a9a1f5d9189f66b58c17f7e72de6644f62033c8/runc.Sgyug1", "", MS_BIND, "") = 0
exe              200924  200924  4026531840  mount("", "/run/containerd/runc/k8s.io/be13b81a11b0d96f9ee9425f9a9a1f5d9189f66b58c17f7e72de6644f62033c8/runc.Sgyug1", "", MS_RDONLY|MS_REMOUNT|MS_BIND, "") = 0
exe              200924  200924  4026531840  umount("/run/containerd/runc/k8s.io/be13b81a11b0d96f9ee9425f9a9a1f5d9189f66b58c17f7e72de6644f62033c8/runc.Sgyug1", MNT_DETACH) = 0

execsnoop logging

  200825 13648    0 /usr/local/sbin/runc --root /run/containerd/runc/k8s.io --log /run/containerd/io.containerd.runtime.v2.task/k8s.io/a346bdb949b96161afd3f113b3c566b1cf3b068a57fb491228e2e22baac8eba0/log.json --log-format json --systemd-cgroup exec --process /tmp/runc-process3080800455 --detach --pid-file /run/containerd/io.containerd.runtime.v2.task/k8s.io/a346bdb949b96161afd3f113b3c566b1cf3b068a57fb491228e2e22baac8eba0/8b447e3954 a346bdb949b96161afd3f113b3c566b1cf3b068a57fb491228e2e22baac8eba0
runc             200831 58925    0 /usr/local/sbin/runc --root /run/containerd/runc/k8s.io --log /run/containerd/io.containerd.runtime.v2.task/k8s.io/742343aa192b54db02973214712f3d8b4304c36813311d1557075f60dcc4257f/log.json --log-format json --systemd-cgroup exec --process /tmp/runc-process3028372082 --detach --pid-file /run/containerd/io.containerd.runtime.v2.task/k8s.io/742343aa192b54db02973214712f3d8b4304c36813311d1557075f60dcc4257f/4c9a6ac8cd 742343aa192b54db02973214712f3d8b4304c36813311d1557075f60dcc4257f
exe              200838 200825   0 /proc/self/exe init
runc             200843 135685   0 /usr/local/sbin/runc --root /run/containerd/runc/k8s.io --log /run/containerd/io.containerd.runtime.v2.task/k8s.io/cfd7ecb9972a640dd564a8f948baa888d5ef87aa58dcd7af2c5f887811d42ad6/log.json --log-format json --systemd-cgroup exec --process /tmp/runc-process3591463774 --detach --pid-file /run/containerd/io.containerd.runtime.v2.task/k8s.io/cfd7ecb9972a640dd564a8f948baa888d5ef87aa58dcd7af2c5f887811d42ad6/9aa0cc725b cfd7ecb9972a640dd564a8f948baa888d5ef87aa58dcd7af2c5f887811d42ad6
exe              200849 200831   0 /proc/self/exe init

We can see that a large number of mount points were created by runc init.

Analyze the runc source code libcontainer/nsenter/cloned_binary.c.

static int clone_binary(void)
{
	int binfd, execfd;
	struct stat statbuf = { };
	size_t sent = 0;
	int fdtype = EFD_NONE;

	/*
	 * Before we resort to copying, let's try creating an ro-binfd in one shot
	 * by getting a handle for a read-only bind-mount of the execfd.
	 */
	execfd = try_bindfd();
	if (execfd >= 0)
		return execfd;

	/*
	 * Dammit, that didn't work -- time to copy the binary to a safe place we
	 * can seal the contents.
	 */
	execfd = make_execfd(&fdtype);
	if (execfd < 0 || fdtype == EFD_NONE)
		return -ENOTRECOVERABLE;

	binfd = open("/proc/self/exe", O_RDONLY | O_CLOEXEC);
	if (binfd < 0)
		goto error;

	if (fstat(binfd, &statbuf) < 0)
		goto error_binfd;

	while (sent < statbuf.st_size) {
		int n = sendfile(execfd, binfd, NULL, statbuf.st_size - sent);
		if (n < 0) {
			/* sendfile can fail so we fallback to a dumb user-space copy. */
			n = fd_to_fd(execfd, binfd);
			if (n < 0)
				goto error_binfd;
		}
		sent += n;
	}
	close(binfd);
	if (sent != statbuf.st_size)
		goto error;

	if (seal_execfd(&execfd, fdtype) < 0)
		goto error;

	return execfd;

error_binfd:
	close(binfd);
error:
	close(execfd);
	return -EIO;
}

In try_bindfd(), it says 'We need somewhere to mount it, mounting anything over /proc/self is a BAD idea on the host -- even if we do it temporarily.'

static int try_bindfd(void)
{
	int fd, ret = -1;
	char template[PATH_MAX] = { 0 };
	char *prefix = getenv("_LIBCONTAINER_STATEDIR");

	if (!prefix || *prefix != '/')
		prefix = "/tmp";
	if (snprintf(template, sizeof(template), "%s/runc.XXXXXX", prefix) < 0)
		return ret;

	/*
	 * We need somewhere to mount it, mounting anything over /proc/self is a
	 * BAD idea on the host -- even if we do it temporarily.
	 */
	fd = mkstemp(template);
	if (fd < 0)
		return ret;
	close(fd);

	/*
	 * For obvious reasons this won't work in rootless mode because we haven't
	 * created a userns+mntns -- but getting that to work will be a bit
	 * complicated and it's only worth doing if someone actually needs it.
	 */
	ret = -EPERM;
	if (mount("/proc/self/exe", template, "", MS_BIND, "") < 0)
		goto out;
	if (mount("", template, "", MS_REMOUNT | MS_BIND | MS_RDONLY, "") < 0)
		goto out_umount;

	/* Get read-only handle that we're sure can't be made read-write. */
	ret = open(template, O_PATH | O_CLOEXEC);

out_umount:
	/*
	 * Make sure the MNT_DETACH works, otherwise we could get remounted
	 * read-write and that would be quite bad (the fd would be made read-write
	 * too, invalidating the protection).
	 */
	if (umount2(template, MNT_DETACH) < 0) {
		if (ret >= 0)
			close(ret);
		ret = -ENOTRECOVERABLE;
	}

out:
	/*
	 * We don't care about unlink errors, the worst that happens is that
	 * there's an empty file left around in STATEDIR.
	 */
	unlink(template);
	return ret;
}

Check the commit history.

nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying

The usage of memfd_create(2) and other copying techniques is quite
wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR.
memfd_create(2) added ~10M of memory usage to the cgroup associated with
the container, which can result in some setups getting OOM'd (or just
hogging the hosts' memory when you have lots of created-but-not-started
containers sticking around).

Question 1: Is there a better way to avoid the need for a large number of frequent mount and unmount operations?

Question 2: Is the mentioned 10MB memory usage deducted from the memory quota inside the container?

Steps to reproduce the issue

a single node deploys more than 100 pod at the same time with k8s

Describe the results you received and expected

The CPU usage of systemd has significantly decreased.

What version of runc are you using?

1.1.2

Host OS information

x86

Host kernel information

Linux master1 5.10.0-60.18.0.50.h665.eulerosv2r11.x86_64 #1 SMP Fri Dec 23 16:12:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux @kolyshkin @cyphar

@cyphar
Copy link
Member

cyphar commented Jul 4, 2023

This is a duplicate of #2532 afaics.

Question 1: Is there a better way to avoid the need for a large number of frequent mount and unmount operations?

Using memfds, or merging #3599 (which moves the bindfd stuff into a separate mount namespace). I'm personally not a fan of how complicated #3599 makes this code, which is why it hasn't been merged yet. I'm also working on some kernel patches which will eliminate the need for this entirely and the protections against this will be moved in-kernel.

Question 2: Is the mentioned 10MB memory usage deducted from the memory quota inside the container?

Yes, this caused some test failures in Kubernetes's e2e tests. However, the issue is only temporary (when runc is first spawning the container) because the runc binary itself exits as container setup is completed. Personally I think the whole bindfd thing was a bad idea in retrospect, we should've just told Kubernetes that we don't support containers with 5MB memory limits.

@cyphar
Copy link
Member

cyphar commented Jul 4, 2023

I reviewed and merged #3599, which should fix this issue.

@cyphar cyphar closed this as completed Jul 4, 2023
@113xiaoji
Copy link
Author

very good. Thank you

@113xiaoji
Copy link
Author

I reviewed and merged #3599, which should fix this issue.

I have done tests, and this PR did not solve my problem. Instead, from a black-box perspective, it has made my problem worse. My scenario involves deploying 100 pods simultaneously on a single node, with each pod having a container.

Before the changes, the high CPU usage of systemd caused many pods to fail deployment. However, there is a chance that the pods could successfully restart. After this PR was merged, the CPU usage of systemd did indeed decrease. But the hang up issue still persists.

Many pods still fail to deploy, and retries also continue to fail. Not using try_bindfd, but using memfd has proven to be very effective, all pods succeed in deployment on the first attempt. Even when deploying 170 pods at once, the deployment is still successful.

I believe try_bindfd should be discarded and replaced solely with the memfd code.

I think it's related to #3599 (comment).
@cyphar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants