Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many mount/umount syscalls #2532

Closed
axot opened this issue Jul 29, 2020 · 14 comments · Fixed by #3599
Closed

too many mount/umount syscalls #2532

axot opened this issue Jul 29, 2020 · 14 comments · Fixed by #3599

Comments

@axot
Copy link

axot commented Jul 29, 2020

I'm not sure this is the right place to discuss. We are observing high cpu usage by systemd init process[1]. After some digging, it is possible caused by many mount/umount syscalls. Is it because runc was be executed?[3] At same time there is no new pod be scheduled to this instance or be deleted. What is the purpose runc was executed in this case?

Thanks!

[1] high cpu usage init process.
image

[2]

/# mountsnoop
COMM             PID     TID     MNT_NS      CALL
exe              1774545 1774545 4026531840  mount("/proc/self/exe", "/var/run/docker/runtime-runc/moby/910fdb2d258cb4e4c504d1414c9c635074d64aa4554671123d4cc652c007dd2e/runc.GC2atV", "", MS_BIND, "") = 0
exe              1774545 1774545 4026531840  mount("", "/var/run/docker/runtime-runc/moby/910fdb2d258cb4e4c504d1414c9c635074d64aa4554671123d4cc652c007dd2e/runc.GC2atV", "", MS_RDONLY|MS_REMOUNT|MS_BIND, "") = 0
exe              1774545 1774545 4026531840  umount("/var/run/docker/runtime-runc/moby/910fdb2d258cb4e4c504d1414c9c635074d64aa4554671123d4cc652c007dd2e/runc.GC2atV", MNT_DETACH) = 0
exe              1774598 1774598 4026531840  mount("/proc/self/exe", "/var/run/docker/runtime-runc/moby/41d0d870275682fd5c20ada27dec21ac2302cc2d2c3064e719094870934e30fb/runc.4yGzla", "", MS_BIND, "") = 0
exe              1774598 1774598 4026531840  mount("", "/var/run/docker/runtime-runc/moby/41d0d870275682fd5c20ada27dec21ac2302cc2d2c3064e719094870934e30fb/runc.4yGzla", "", MS_RDONLY|MS_REMOUNT|MS_BIND, "") = 0
exe              1774598 1774598 4026531840  umount("/var/run/docker/runtime-runc/moby/41d0d870275682fd5c20ada27dec21ac2302cc2d2c3064e719094870934e30fb/runc.4yGzla", MNT_DETACH) = 0

[3]
nsenter was import from https://github.com/opencontainers/runc/blob/master/init.go#L10
mount syscall executed from https://github.com/opencontainers/runc/blob/master/libcontainer/nsenter/cloned_binary.c#L402

Additional information:
OS: ubuntu 18.04
GKE 1.14

@axot
Copy link
Author

axot commented Jul 29, 2020

Is it possible caused by exec probe?
image
image

@axot
Copy link
Author

axot commented Jul 29, 2020

related issues:

kubernetes/kubernetes#82440
systemd/systemd#15464

@cyphar
Copy link
Member

cyphar commented Jul 30, 2020

Yes, if you run runc exec very frequently you will see mount and umount syscalls. These are necessary to protect against CVE-2019-5736 (see 0a8e411 and #1984 which added this logic). The solution is to create a sealed memfd and bind-mount it on top of /usr/bin/runc (runc detects this and will avoid copying if that is the case) -- but to do this you need to have some C code running at system startup.

@axot
Copy link
Author

axot commented Jul 30, 2020

We could have mitigated the situation by remove exec probe currently. Any improve could do in runc side? In systemd, there are trying to add rate limit for sd_event.

@cloud-66
Copy link

cloud-66 commented Aug 3, 2020

Hello, I also have high CPU usage by /lib/systemd/systemd --user and /sbin/init processes. I tried different perf command, but haven't seen such clear view CPU usage. Can you tell me how to record and view such CPU usage screen? Thank you.

@axot
Copy link
Author

axot commented Aug 4, 2020

Hello @cloud-66, I think you need debug symbol for this, please following the instruction from https://wiki.ubuntu.com/Debug%20Symbol%20Packages

Once you have done, perform these commands,

# add debug symbol for systemd
$ apt install systemd-dbgsym

# record with call graph and sample rate here is 99Hz
$ perf record --call-graph=dwarf -F99 -p 1

# wait few seconds, check the report
$ perf report --stdio -G

@Nowaker
Copy link

Nowaker commented Sep 30, 2020

Yes, if you run runc exec very frequently you will see mount and umount syscalls.

Here's my finding repasted from kubernetes/kubernetes#82440

Liveness probe period of 30 seconds, and readiness probe period of 5 seconds. However, I can see in pstree that a runc process is constantly spawned in any container that has probes defined.

It's reasonable to expect some syscall activity every exec probe period. It's also reasonable to expect a reduction in the number of syscalls per minute when exec probe period is longer (e.g. 30 seconds instead of 5 seconds). However, this is not a case. A mere presence of an exec probe causes a lot of syscalls and overall increased load on CPU, regardless of the exec probe period.

I suggest reading through kubernetes/kubernetes#82440 where I provided a lot of data on the problem. I don't know if it's a Kubernetes or a Runc bug.

EDIT: Thanks for @cyphar for explaining it's not a Runc issue below! ❤️

@cyphar
Copy link
Member

cyphar commented Sep 30, 2020

(Copying this comment from the Kubernetes issue.)

Taking a quick look, kubernetes/kubernetes#82440 (comment) is describing Kubernetes effectively spamming runc status (the output of which is the JSON blob described). I don't think it's directly related to #2532 (we don't do any mounts in runc status), but I'm also not sure why Kubernetes (or I guess containerd) is requesting the status so many times.

On the runc side of things, in theory runc events could be used instead of runc status to switch away from a polling model for getting container states, though I don't think runc events has all of the information you might need to replace runc status.

@cyphar
Copy link
Member

cyphar commented Sep 30, 2020

@axot

Any improve could do in runc side? In systemd, there are trying to add rate limit for sd_event.

Not really. The mount setup was actually added to avoid other issues (we used to copy the binary rather than ro-mount it, which caused container and system memory usage to increase by 10MB each time you attached or configured a container). I am working on some kernel patches to remove the need for these protections (by blocking chained opens through a read-only fd), but they'd need to be merged first and then you'd need to upgrade as well.

@axot
Copy link
Author

axot commented Sep 30, 2020

@cyphar Good to hear that!

@cyphar
Copy link
Member

cyphar commented Sep 30, 2020

I was about to post a program to create a bind-mount of a memfd (which would allow you to avoid this overhead) but I just figured out that you can't bind-mount memfds. Oops. 🤔

@kolyshkin
Copy link
Contributor

So, this is mostly not a problem in runc per se, but a problem in systemd which re-reads the proverbial mountinfo on every mount/umount.

The ultimate fix to that belongs in the kernel, and some work is being done in that direction (see https://lkml.org/lkml/2020/3/18/569) but we're not quite there yet.

In the meantime, systemd workarounds are being worked on -- if someone is interested to look into that, start from here: systemd/systemd#15464

@kolyshkin
Copy link
Contributor

Sorry I haven't realized the systemd issue was mentioned here (right in the issue description).

Still, reducing runc overhead would be a good thing to do.

@axot
Copy link
Author

axot commented Jul 4, 2023

Big thanks to all of you for the hard work you've put into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants