nsexec: spring cleaning #3953

cyphar · 2023-08-01T10:41:48Z

This includes several cleanups to nsexec.c:

(WIP) Moving the id-mapping and bindmount source-fd logic from nsexec.c with all of the complicated rewriting and opening magic to just using open_tree() in the Go portion of runc and passing the file descriptors down to rootfs_linux.go to just do a simple move_mount(2). This also allows us to easily implement handling of different id-mappings for mounts and is thus a replacement for Support for ID map mounts without userns #3943.
Remove the special-casing of CLONE_NEWCGROUP that was arguably never necessary and definitely isn't necessary now.
Remove the repetitive (and error-prone) sane_kill error path handling by using prctl(PR_SET_PDEATHSIG, SIGKILL) to auto-kill our runc C code if the parent runc process dies. We clear pdeathsig once setup is complete.
(Depends on nsexec: cloned_binary: remove bindfd logic entirely #3931.) Move the cloned binary logic entirely to Go so that we can save time on an execve by directly executing the copy when we spawn runc init.

This includes several major steps towards the (maybe possible) goal of #3951.

adrianreber · 2023-08-04T16:08:15Z

I know that CRIU uses the new mount API if available. Maybe this is related if the kernel is too old, but @Snorch is the expert when it comes to mounts.

@Snorch do you have any ideas regarding @cyphar's question?

cyphar · 2023-08-06T02:02:01Z

.github/workflows/test.yml

+    - name: procfs mount
+      run: |
+        # Get the list of mounts to help with debugging.
+        cat /proc/self/mounts
+        # Create a procfs mount that is not masked, to ensure that container
+        # procfs mounts will succeed.
+        sudo mkdir -p /tmp/.procfs-stashed-mount
+        sudo unshare -pf mount -t proc -o subset=pid proc /tmp/.procfs-stashed-mount


I don't know why this has become necessary with this PR (the "problem" commit is the /proc/self/exe cloning change) but this solves the issue and this is a CI-weirdness issue.

@cyphar Does that mean that runc won't work as-is inside "your usual" GHA, or is it only for the sake of testing?

cyphar · 2023-08-06T12:16:24Z

@adrianreber @Snorch I figured out what the underlying issue is, the nested mount is being leaked to the host when using open_tree(2) for all bind-mounts (and is actually an existing issue with the old code, it's just that allowing for remapping by non-userns mounts causes the mount to be leaked). Basically the core issue is that the rootfs mount propagation doesn't apply this early in the runc startup process. Working on a solution...

We also really need a separate integration test for the leaking behaviour. The fact that only this criu test fails for this fairly broken behaviour (and doesn't fail on newer kernels) is quite concerning.

Snorch · 2023-08-07T04:20:25Z

Hello, @cyphar

In particular is there a reason to think that (on older kernels) criu would struggle to deal with bind mounts created with the new mount API?

I would say, there is no.

If criu lacks some new mount API (move_mount(..._SET_GROUP) and openat2(RESOLVE_NO_XDEV)) it switches to "old" mount engine which uses only old mount API.

Long story short: We have two mount engines in latest CRIU, "new" and "old" and there is no big difference between them in simple cases. The "new" is though more precise so it can fail in some cases where "old" just silently did something wrong, at the same time "new" supports complex cases related to propagation so in some cases where "old" would fail or do something wrong "new" will succeed =)

But I would not say that there is some general rule that "old" can't restore (or badly restore) mounts created with new mount API. Only thing we can do is investigate each case.

I figured out what the underlying issue is, the nested mount is being leaked to the host when using open_tree(2) for all bind-mounts (and is actually an existing issue with the old code, it's just that allowing for remapping by non-userns mounts causes the mount to be leaked).

Not sure that I fully understand it, but from context it looks like you talk about issue in runc, just detected on CRIU test.

Basically the core issue is that the rootfs mount propagation doesn't apply this early in the runc startup process.

Just in case, "new" mount engine of CRIU does not support if root mount of container is shared (only slave is supported) to mount outside of the container (e.g. from which it is created, i.e. --root criu option), after CRIU c/r root mount of the container would become separate sharing group if it was shared outside before.

libcontainer/container_linux.go

rata · 2023-08-07T11:17:06Z

libcontainer/container_linux.go

+				if err := unix.MountSetattr(int(mountFile.Fd()), "", unix.AT_EMPTY_PATH|setattrFlags, &unix.MountAttr{
+					Propagation: uint64(propFlags &^ unix.MS_REC),
+				}); err != nil {
+					return fmt.Errorf("remap mount sources: failed to set mount propagation of %q bind-mount to 0x%x: %w", m.Source, propFlags, err)


It will be nice to say which syscall is failing too.

eiffel-fl

Thank you for it! It permits to remove a lot of code and your implementation of ID map mounts without user ns is more flexible than mine.
I will take a deeper look later, but for now here are some questions:

libcontainer/container_linux.go

eiffel-fl · 2023-08-07T11:58:51Z

libcontainer/container_linux.go

-	// NOTE: when running a container with no PID namespace and the parent process spawning the container is
-	// PID1 the pdeathsig is being delivered to the container's init process by the kernel for some reason
-	// even with the parent still running.
+	// Due to a Go stdlib bug, we need to add c.safeExeFile to the set of


I am curious, can you please share more information regarding this bug?

golang/go#61751

libcontainer/userns/usernsfd_linux.go

This is needed to make sure that our userns tests work in GitHub Actions if the host /proc is masked. Signed-off-by: Aleksa Sarai <[email protected]>

Otherwise TESTFLAGS="-run FooBar" will result in TESTFLAGS=-run being executed in the container. Signed-off-by: Aleksa Sarai <[email protected]>

This includes quite a few cleanups and improvements to the way we do syncrhonisation. The core behaviour is unchanged, but switching to embedding json.RawMessage into the synchronisation structure will allow us to do more complicated synchronisation operations in future patches. The file descriptor passing through the syncrhonisation system feature will be used as part of the idmapped-mount and bind-mount-source features when switching that code to use the new mount API outside of nsexec.c. Signed-off-by: Aleksa Sarai <[email protected]>

*os.File is correctly tracked by the garbage collector, and there's no need to use raw file descriptors for this code. Signed-off-by: Aleksa Sarai <[email protected]>

The kernel ignores these arguments, and passing them can lead to confusing error messages (the old source is irrelevant for MS_REMOUNT), as well as causing issues for a future patch where we switch to move_mount(2). Signed-off-by: Aleksa Sarai <[email protected]>

The original implementation of cgroupns had additional synchronisation to "ensure" that the process is in the correct cgroup before unsharing the cgroupns. This behaviour was actually never necessary, and after commit 5110bd2 ("nsenter: remove cgroupns sync mechanism") there is no synchronisation at all, meaning that CLONE_NEWCGROUP should not get any special treatment. Fixes: 5110bd2 ("nsenter: remove cgroupns sync mechanism") Fixes: df3fa11 ("Add support for cgroup namespace") Signed-off-by: Aleksa Sarai <[email protected]>

This allow us to remove the amount of C code in runc quite substantially, as well as removing a whole execve(2) from the nsexec path because we no longer spawn "runc init" only to re-exec "runc init" after doing the clone. Signed-off-by: Aleksa Sarai <[email protected]>

In the runc state JSON we always use snake_case. This is a no-op change, but it will cause any existing container state files to be incorrectly parsed. Luckily, commit fbf183c ("Add uid and gid mappings to mounts") has never been in a runc release so we can change this before a 1.2.z release. Fixes: fbf183c ("Add uid and gid mappings to mounts") Signed-off-by: Aleksa Sarai <[email protected]>

With open_tree(OPEN_TREE_CLONE), it is possible to implement both the id-mapped mounts and bind-mount source file descriptor logic entirely in Go without requiring any complicated handling from nsexec. This allows us to remove the amount of C code we have in nsexec, as well as simplifying a whole host of places that were made more complicated with the addition of id-mapped mounts and the bind sourcefd logic. The one downside of this is that the bind sourcefd feature now depends on Linux 5.4. In addition, we can easily add support for id-mappings that don't match the container's user namespace. The approach taken here is to use Go's officially supported mechanism for spawning a process in a user namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a different process. The most efficient way to implement this would be to do clone() in cgo directly to run a function that just does kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out this approach is too slow. This also includes a partial fix for a bug in the handling of idmap mounts and mount propagation. In short, the issue is that because we do OPEN_TREE_CLONE on the host, the RootPropagation flag does not apply (nor any other mount propagation flags configured in config.json) and thus recursive bind-mounts can and will be leaked to the host. Because this patch switches the feature from 9c44407 ("Open bind mount sources from the host userns") to use OPEN_TREE_CLONE, this resulted in all bind-mounts having this behaviour. The partial fix here is to try to emulate the behaviour of RootPropagation for bind-mounts from the host. It turns out that bind-mounts inside containers were broken when using the fd-passing feature from 9c44407 ("Open bind mount sources from the host userns") because the file descriptor opens were done before we start doing any mounts in rootfs_linux.go. The solution for this is more involved and is fixed in a separate patch. At the very least, this patch doesn't worsen the mount propagation situation. Fixes: fda12ab ("Support idmap mounts on volumes") Fixes: 9c44407 ("Open bind mount sources from the host userns") Signed-off-by: Aleksa Sarai <[email protected]>

With the rework of nsexec.c to handle MOUNT_ATTR_IDMAP in our Go code we can now handle arbitrary mappings without issue, so remove the primary artificial limit of mappings (must use the same mapping as the container's userns) and add some tests. We still only support idmap mounts for bind-mounts because configuring mappings for other filesystems would require switching our entire mount machinery to the new mount API. The current design would easily allow for this but we would need to convert new mount options entirely to the fsopen/fsconfig/fsmount API. This can be done in the future. Signed-off-by: Aleksa Sarai <[email protected]>

Signed-off-by: Aleksa Sarai <[email protected]>

kolyshkin · 2023-08-14T23:19:29Z

I spent some time rebasing this PR and almost succeeded, except for idmap.bats changes.

One thing that helped me when applying commit 18ce9de ("iidmap: allow arbitrary idmap mounts regardless of userns configuration") is using git format-patch -1 --diff-algorithm=patience 18ce9de4e3e0c6 which produces much more sensible diff for validate_test.go, which applied fine except for 1 trivial hunk (empty line removal).

libcontainer/init_linux.go

+// syncParentSeccomp sends the fd associated with the seccomp file descriptor
+// to the parent, and wait for the parent to do pidfd_getfd() to grab a copy.
+func syncParentSeccomp(pipe *os.File, seccompFd int) error {
+	if seccompFd >= 0 {


kolyshkin · 2023-08-15T01:42:45Z

I suggest we split this one into more digestible PRs. Ideally, each medium-sized feature should be a separate PR.

I rebased the more-or-less trivial stuff from this PR to #3982, which I hope we can merge soon.

libcontainer/init_linux.go

+// syncParentSeccomp sends the fd associated with the seccomp file descriptor
+// to the parent, and wait for the parent to do pidfd_getfd() to grab a copy.
+func syncParentSeccomp(pipe *os.File, seccompFd int) error {
+	if seccompFd >= 0 {


libcontainer/process_linux.go

+			// We have a copy, the child can keep working. We don't need to
+			// wait for the seccomp notify listener to get the fd before we
+			// permit the child to continue because the child will happily wait
+			// for the listener if it hits SCMP_ACT_NOTIFY.
+			if err := writeSync(p.messageSockPair.parent, procSeccompDone); err != nil {


cyphar · 2023-08-20T14:02:34Z

All of this code is now split into separate PRs #3982, #3967, #3985, #3987. Closing in favour of those...

cyphar mentioned this pull request Aug 2, 2023

contrib/fs-idmap: Minor cleanups #3954

Merged

cyphar added this to the 1.2.0 milestone Aug 2, 2023

This was referenced Aug 4, 2023

nsexec: cloned_binary: remove bindfd logic entirely #3931

Merged

syscall: nextfd handling for attr.Files shuffle will clobber files golang/go#61751

Open

This comment was marked as outdated.

Sign in to view

cyphar mentioned this pull request Aug 4, 2023

runc: release 1.2.0-rc.1 #3963

Closed

This comment was marked as outdated.

Sign in to view

cyphar mentioned this pull request Aug 6, 2023

Support for ID map mounts without userns #3943

Closed

cyphar commented Aug 6, 2023

View reviewed changes

rata reviewed Aug 7, 2023

View reviewed changes

eiffel-fl reviewed Aug 7, 2023

View reviewed changes

cyphar added 11 commits August 8, 2023 16:27

gha: stash away a procfs mount

38767e2

This is needed to make sure that our userns tests work in GitHub Actions if the host /proc is masked. Signed-off-by: Aleksa Sarai <[email protected]>

makefile: quote TESTFLAGS when passing to containerised make

b35718b

Otherwise TESTFLAGS="-run FooBar" will result in TESTFLAGS=-run being executed in the container. Signed-off-by: Aleksa Sarai <[email protected]>

libcontainer: seccomp: pass around *os.File for notifyfd

a61cfd1

*os.File is correctly tracked by the garbage collector, and there's no need to use raw file descriptors for this code. Signed-off-by: Aleksa Sarai <[email protected]>

squash! libcontainer: remove all mount logic from nsexec

7a3c81b

Signed-off-by: Aleksa Sarai <[email protected]>

kolyshkin reviewed Aug 15, 2023

View reviewed changes

kolyshkin mentioned this pull request Aug 15, 2023

Nsexec spring cleaning part I #3982

Merged

kolyshkin reviewed Aug 16, 2023

View reviewed changes

kolyshkin mentioned this pull request Aug 16, 2023

ci/gha: add job timeouts #3984

Merged

This was referenced Aug 16, 2023

[Proposal] Use runc-dmz to defeat CVE-2019-5736 #3983

Closed

libcontainer: remove all mount logic from nsexec #3985

Merged

nsexec: cloned binary rework #3987

Merged

cyphar closed this Aug 20, 2023

cyphar deleted the nsexec-spring-cleaning branch August 20, 2023 14:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsexec: spring cleaning #3953

nsexec: spring cleaning #3953

cyphar commented Aug 1, 2023 •

edited

Loading

This comment was marked as outdated.

adrianreber commented Aug 4, 2023

This comment was marked as outdated.

cyphar Aug 6, 2023

kolyshkin Aug 9, 2023

cyphar commented Aug 6, 2023 •

edited

Loading

Snorch commented Aug 7, 2023

rata Aug 7, 2023 •

edited

Loading

eiffel-fl left a comment

eiffel-fl Aug 7, 2023

cyphar Aug 8, 2023

kolyshkin commented Aug 14, 2023

This comment was marked as resolved.

kolyshkin commented Aug 15, 2023

This comment was marked as resolved.

This comment was marked as resolved.

cyphar commented Aug 20, 2023

nsexec: spring cleaning #3953

nsexec: spring cleaning #3953

Conversation

cyphar commented Aug 1, 2023 • edited Loading

This comment was marked as outdated.

adrianreber commented Aug 4, 2023

This comment was marked as outdated.

cyphar Aug 6, 2023

Choose a reason for hiding this comment

kolyshkin Aug 9, 2023

Choose a reason for hiding this comment

cyphar commented Aug 6, 2023 • edited Loading

Snorch commented Aug 7, 2023

rata Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

eiffel-fl left a comment

Choose a reason for hiding this comment

eiffel-fl Aug 7, 2023

Choose a reason for hiding this comment

cyphar Aug 8, 2023

Choose a reason for hiding this comment

kolyshkin commented Aug 14, 2023

This comment was marked as resolved.

kolyshkin commented Aug 15, 2023

This comment was marked as resolved.

This comment was marked as resolved.

cyphar commented Aug 20, 2023

cyphar commented Aug 1, 2023 •

edited

Loading

cyphar commented Aug 6, 2023 •

edited

Loading

rata Aug 7, 2023 •

edited

Loading