too many mount/umount syscalls #2532

axot · 2020-07-29T13:59:51Z

I'm not sure this is the right place to discuss. We are observing high cpu usage by systemd init process[1]. After some digging, it is possible caused by many mount/umount syscalls. Is it because runc was be executed?[3] At same time there is no new pod be scheduled to this instance or be deleted. What is the purpose runc was executed in this case?

Thanks!

[1] high cpu usage init process.

[2]

/# mountsnoop
COMM             PID     TID     MNT_NS      CALL
exe              1774545 1774545 4026531840  mount("/proc/self/exe", "/var/run/docker/runtime-runc/moby/910fdb2d258cb4e4c504d1414c9c635074d64aa4554671123d4cc652c007dd2e/runc.GC2atV", "", MS_BIND, "") = 0
exe              1774545 1774545 4026531840  mount("", "/var/run/docker/runtime-runc/moby/910fdb2d258cb4e4c504d1414c9c635074d64aa4554671123d4cc652c007dd2e/runc.GC2atV", "", MS_RDONLY|MS_REMOUNT|MS_BIND, "") = 0
exe              1774545 1774545 4026531840  umount("/var/run/docker/runtime-runc/moby/910fdb2d258cb4e4c504d1414c9c635074d64aa4554671123d4cc652c007dd2e/runc.GC2atV", MNT_DETACH) = 0
exe              1774598 1774598 4026531840  mount("/proc/self/exe", "/var/run/docker/runtime-runc/moby/41d0d870275682fd5c20ada27dec21ac2302cc2d2c3064e719094870934e30fb/runc.4yGzla", "", MS_BIND, "") = 0
exe              1774598 1774598 4026531840  mount("", "/var/run/docker/runtime-runc/moby/41d0d870275682fd5c20ada27dec21ac2302cc2d2c3064e719094870934e30fb/runc.4yGzla", "", MS_RDONLY|MS_REMOUNT|MS_BIND, "") = 0
exe              1774598 1774598 4026531840  umount("/var/run/docker/runtime-runc/moby/41d0d870275682fd5c20ada27dec21ac2302cc2d2c3064e719094870934e30fb/runc.4yGzla", MNT_DETACH) = 0

[3]
nsenter was import from https://github.com/opencontainers/runc/blob/master/init.go#L10
mount syscall executed from https://github.com/opencontainers/runc/blob/master/libcontainer/nsenter/cloned_binary.c#L402

Additional information:
OS: ubuntu 18.04
GKE 1.14

The text was updated successfully, but these errors were encountered:

axot · 2020-07-29T14:16:57Z

Is it possible caused by exec probe?

axot · 2020-07-29T14:20:02Z

related issues:

kubernetes/kubernetes#82440
systemd/systemd#15464

cyphar · 2020-07-30T03:35:33Z

Yes, if you run runc exec very frequently you will see mount and umount syscalls. These are necessary to protect against CVE-2019-5736 (see 0a8e411 and #1984 which added this logic). The solution is to create a sealed memfd and bind-mount it on top of /usr/bin/runc (runc detects this and will avoid copying if that is the case) -- but to do this you need to have some C code running at system startup.

axot · 2020-07-30T09:03:20Z

We could have mitigated the situation by remove exec probe currently. Any improve could do in runc side? In systemd, there are trying to add rate limit for sd_event.

cloud-66 · 2020-08-03T21:43:49Z

Hello, I also have high CPU usage by /lib/systemd/systemd --user and /sbin/init processes. I tried different perf command, but haven't seen such clear view CPU usage. Can you tell me how to record and view such CPU usage screen? Thank you.

axot · 2020-08-04T00:30:42Z

Hello @cloud-66, I think you need debug symbol for this, please following the instruction from https://wiki.ubuntu.com/Debug%20Symbol%20Packages

Once you have done, perform these commands,

# add debug symbol for systemd
$ apt install systemd-dbgsym

# record with call graph and sample rate here is 99Hz
$ perf record --call-graph=dwarf -F99 -p 1

# wait few seconds, check the report
$ perf report --stdio -G

Nowaker · 2020-09-30T03:50:10Z

Yes, if you run runc exec very frequently you will see mount and umount syscalls.

Here's my finding repasted from kubernetes/kubernetes#82440

Liveness probe period of 30 seconds, and readiness probe period of 5 seconds. However, I can see in pstree that a runc process is constantly spawned in any container that has probes defined.

It's reasonable to expect some syscall activity every exec probe period. It's also reasonable to expect a reduction in the number of syscalls per minute when exec probe period is longer (e.g. 30 seconds instead of 5 seconds). However, this is not a case. A mere presence of an exec probe causes a lot of syscalls and overall increased load on CPU, regardless of the exec probe period.

I suggest reading through kubernetes/kubernetes#82440 where I provided a lot of data on the problem. I don't know if it's a Kubernetes or a Runc bug.

EDIT: Thanks for @cyphar for explaining it's not a Runc issue below! ❤️

cyphar · 2020-09-30T04:23:42Z

(Copying this comment from the Kubernetes issue.)

Taking a quick look, kubernetes/kubernetes#82440 (comment) is describing Kubernetes effectively spamming runc status (the output of which is the JSON blob described). I don't think it's directly related to #2532 (we don't do any mounts in runc status), but I'm also not sure why Kubernetes (or I guess containerd) is requesting the status so many times.

On the runc side of things, in theory runc events could be used instead of runc status to switch away from a polling model for getting container states, though I don't think runc events has all of the information you might need to replace runc status.

cyphar · 2020-09-30T04:29:37Z

@axot

Any improve could do in runc side? In systemd, there are trying to add rate limit for sd_event.

Not really. The mount setup was actually added to avoid other issues (we used to copy the binary rather than ro-mount it, which caused container and system memory usage to increase by 10MB each time you attached or configured a container). I am working on some kernel patches to remove the need for these protections (by blocking chained opens through a read-only fd), but they'd need to be merged first and then you'd need to upgrade as well.

axot · 2020-09-30T05:11:28Z

@cyphar Good to hear that!

cyphar · 2020-09-30T05:26:31Z

I was about to post a program to create a bind-mount of a memfd (which would allow you to avoid this overhead) but I just figured out that you can't bind-mount memfds. Oops. 🤔

kolyshkin · 2020-10-01T18:30:29Z

So, this is mostly not a problem in runc per se, but a problem in systemd which re-reads the proverbial mountinfo on every mount/umount.

The ultimate fix to that belongs in the kernel, and some work is being done in that direction (see https://lkml.org/lkml/2020/3/18/569) but we're not quite there yet.

In the meantime, systemd workarounds are being worked on -- if someone is interested to look into that, start from here: systemd/systemd#15464

kolyshkin · 2020-10-01T19:13:36Z

Sorry I haven't realized the systemd issue was mentioned here (right in the issue description).

Still, reducing runc overhead would be a good thing to do.

axot · 2023-07-04T13:57:08Z

Big thanks to all of you for the hard work you've put into this.

axot mentioned this issue Sep 30, 2020

High system load/CPU utilization with trivial liveness and readiness exec probes kubernetes/kubernetes#82440

Open

ElisaMeng mentioned this issue Mar 3, 2021

[FEATURE] Avoid exec as liveness probe longhorn/longhorn#2294

Closed

corhere mentioned this issue Sep 9, 2022

libct/nsenter: namespace the bindfd shuffle #3599

Merged

113xiaoji mentioned this issue Jun 28, 2023

When the systemd is busy, runc init will hang and cannot exit, eventually ending up in D state. #3904

Open

cyphar mentioned this issue Jul 4, 2023

ro-bind /proc/self/exe before copying cause systemd's cpu extremely high #3925

Closed

cyphar closed this as completed in #3599 Jul 4, 2023

cyphar mentioned this issue Mar 14, 2024

release 1.2.0-rc.1 #4221

Merged

rata mentioned this issue Oct 2, 2024

Release v1.1.15 #4422

Merged

samuelkarp mentioned this issue Oct 7, 2024

update runc binary to 1.1.15 containerd/containerd#10787

Merged

austinvazquez mentioned this issue Oct 9, 2024

Dockerfile: update runc binary to 1.1.15 moby/buildkit#5417

Closed

github-actions bot mentioned this issue Oct 13, 2024

Bump runc from v1.1.13 to v1.1.15 kokyhm/kubespray#52

Open

rata mentioned this issue Oct 17, 2024

dmz: use overlayfs to write-protect /proc/self/exe if possible #4448

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many mount/umount syscalls #2532

too many mount/umount syscalls #2532

axot commented Jul 29, 2020

axot commented Jul 29, 2020

axot commented Jul 29, 2020 •

edited

Loading

cyphar commented Jul 30, 2020

axot commented Jul 30, 2020

cloud-66 commented Aug 3, 2020

axot commented Aug 4, 2020 •

edited

Loading

Nowaker commented Sep 30, 2020 •

edited

Loading

cyphar commented Sep 30, 2020

cyphar commented Sep 30, 2020

axot commented Sep 30, 2020

cyphar commented Sep 30, 2020

kolyshkin commented Oct 1, 2020

kolyshkin commented Oct 1, 2020

axot commented Jul 4, 2023

too many mount/umount syscalls #2532

too many mount/umount syscalls #2532

Comments

axot commented Jul 29, 2020

axot commented Jul 29, 2020

axot commented Jul 29, 2020 • edited Loading

cyphar commented Jul 30, 2020

axot commented Jul 30, 2020

cloud-66 commented Aug 3, 2020

axot commented Aug 4, 2020 • edited Loading

Nowaker commented Sep 30, 2020 • edited Loading

cyphar commented Sep 30, 2020

cyphar commented Sep 30, 2020

axot commented Sep 30, 2020

cyphar commented Sep 30, 2020

kolyshkin commented Oct 1, 2020

kolyshkin commented Oct 1, 2020

axot commented Jul 4, 2023

axot commented Jul 29, 2020 •

edited

Loading

axot commented Aug 4, 2020 •

edited

Loading

Nowaker commented Sep 30, 2020 •

edited

Loading