-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsexec: spring cleaning #3953
nsexec: spring cleaning #3953
Commits on Aug 8, 2023
-
gha: stash away a procfs mount
This is needed to make sure that our userns tests work in GitHub Actions if the host /proc is masked. Signed-off-by: Aleksa Sarai <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 38767e2 - Browse repository at this point
Copy the full SHA 38767e2View commit details -
makefile: quote TESTFLAGS when passing to containerised make
Otherwise TESTFLAGS="-run FooBar" will result in TESTFLAGS=-run being executed in the container. Signed-off-by: Aleksa Sarai <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b35718b - Browse repository at this point
Copy the full SHA b35718bView commit details -
libcontainer: sync: cleanup synchronisation code
This includes quite a few cleanups and improvements to the way we do syncrhonisation. The core behaviour is unchanged, but switching to embedding json.RawMessage into the synchronisation structure will allow us to do more complicated synchronisation operations in future patches. The file descriptor passing through the syncrhonisation system feature will be used as part of the idmapped-mount and bind-mount-source features when switching that code to use the new mount API outside of nsexec.c. Signed-off-by: Aleksa Sarai <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for dbdc562 - Browse repository at this point
Copy the full SHA dbdc562View commit details -
libcontainer: seccomp: pass around *os.File for notifyfd
*os.File is correctly tracked by the garbage collector, and there's no need to use raw file descriptors for this code. Signed-off-by: Aleksa Sarai <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a61cfd1 - Browse repository at this point
Copy the full SHA a61cfd1View commit details -
rootfs: use empty src for MS_REMOUNT
The kernel ignores these arguments, and passing them can lead to confusing error messages (the old source is irrelevant for MS_REMOUNT), as well as causing issues for a future patch where we switch to move_mount(2). Signed-off-by: Aleksa Sarai <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3ceb838 - Browse repository at this point
Copy the full SHA 3ceb838View commit details -
nsexec: remove cgroupns special-casing
The original implementation of cgroupns had additional synchronisation to "ensure" that the process is in the correct cgroup before unsharing the cgroupns. This behaviour was actually never necessary, and after commit 5110bd2 ("nsenter: remove cgroupns sync mechanism") there is no synchronisation at all, meaning that CLONE_NEWCGROUP should not get any special treatment. Fixes: 5110bd2 ("nsenter: remove cgroupns sync mechanism") Fixes: df3fa11 ("Add support for cgroup namespace") Signed-off-by: Aleksa Sarai <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 39e4217 - Browse repository at this point
Copy the full SHA 39e4217View commit details -
nsexec: migrate memfd /proc/self/exe logic to Go code
This allow us to remove the amount of C code in runc quite substantially, as well as removing a whole execve(2) from the nsexec path because we no longer spawn "runc init" only to re-exec "runc init" after doing the clone. Signed-off-by: Aleksa Sarai <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d84e150 - Browse repository at this point
Copy the full SHA d84e150View commit details -
configs: fix idmapped mounts json field names
In the runc state JSON we always use snake_case. This is a no-op change, but it will cause any existing container state files to be incorrectly parsed. Luckily, commit fbf183c ("Add uid and gid mappings to mounts") has never been in a runc release so we can change this before a 1.2.z release. Fixes: fbf183c ("Add uid and gid mappings to mounts") Signed-off-by: Aleksa Sarai <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ffce2c0 - Browse repository at this point
Copy the full SHA ffce2c0View commit details -
libcontainer: remove all mount logic from nsexec
With open_tree(OPEN_TREE_CLONE), it is possible to implement both the id-mapped mounts and bind-mount source file descriptor logic entirely in Go without requiring any complicated handling from nsexec. This allows us to remove the amount of C code we have in nsexec, as well as simplifying a whole host of places that were made more complicated with the addition of id-mapped mounts and the bind sourcefd logic. The one downside of this is that the bind sourcefd feature now depends on Linux 5.4. In addition, we can easily add support for id-mappings that don't match the container's user namespace. The approach taken here is to use Go's officially supported mechanism for spawning a process in a user namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a different process. The most efficient way to implement this would be to do clone() in cgo directly to run a function that just does kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out this approach is too slow. This also includes a partial fix for a bug in the handling of idmap mounts and mount propagation. In short, the issue is that because we do OPEN_TREE_CLONE on the host, the RootPropagation flag does not apply (nor any other mount propagation flags configured in config.json) and thus recursive bind-mounts can and will be leaked to the host. Because this patch switches the feature from 9c44407 ("Open bind mount sources from the host userns") to use OPEN_TREE_CLONE, this resulted in all bind-mounts having this behaviour. The partial fix here is to try to emulate the behaviour of RootPropagation for bind-mounts from the host. It turns out that bind-mounts inside containers were broken when using the fd-passing feature from 9c44407 ("Open bind mount sources from the host userns") because the file descriptor opens were done before we start doing any mounts in rootfs_linux.go. The solution for this is more involved and is fixed in a separate patch. At the very least, this patch doesn't worsen the mount propagation situation. Fixes: fda12ab ("Support idmap mounts on volumes") Fixes: 9c44407 ("Open bind mount sources from the host userns") Signed-off-by: Aleksa Sarai <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for acaf423 - Browse repository at this point
Copy the full SHA acaf423View commit details -
idmap: allow arbitrary idmap mounts regardless of userns configuration
With the rework of nsexec.c to handle MOUNT_ATTR_IDMAP in our Go code we can now handle arbitrary mappings without issue, so remove the primary artificial limit of mappings (must use the same mapping as the container's userns) and add some tests. We still only support idmap mounts for bind-mounts because configuring mappings for other filesystems would require switching our entire mount machinery to the new mount API. The current design would easily allow for this but we would need to convert new mount options entirely to the fsopen/fsconfig/fsmount API. This can be done in the future. Signed-off-by: Aleksa Sarai <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 18ce9de - Browse repository at this point
Copy the full SHA 18ce9deView commit details -
squash! libcontainer: remove all mount logic from nsexec
Signed-off-by: Aleksa Sarai <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7a3c81b - Browse repository at this point
Copy the full SHA 7a3c81bView commit details