Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsexec: spring cleaning #3953

Closed
wants to merge 11 commits into from
Closed

nsexec: spring cleaning #3953

wants to merge 11 commits into from

Commits on Aug 8, 2023

  1. gha: stash away a procfs mount

    This is needed to make sure that our userns tests work in GitHub Actions
    if the host /proc is masked.
    
    Signed-off-by: Aleksa Sarai <[email protected]>
    cyphar committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    38767e2 View commit details
    Browse the repository at this point in the history
  2. makefile: quote TESTFLAGS when passing to containerised make

    Otherwise TESTFLAGS="-run FooBar" will result in TESTFLAGS=-run being
    executed in the container.
    
    Signed-off-by: Aleksa Sarai <[email protected]>
    cyphar committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    b35718b View commit details
    Browse the repository at this point in the history
  3. libcontainer: sync: cleanup synchronisation code

    This includes quite a few cleanups and improvements to the way we do
    syncrhonisation. The core behaviour is unchanged, but switching to
    embedding json.RawMessage into the synchronisation structure will allow
    us to do more complicated synchronisation operations in future patches.
    
    The file descriptor passing through the syncrhonisation system feature
    will be used as part of the idmapped-mount and bind-mount-source
    features when switching that code to use the new mount API outside of
    nsexec.c.
    
    Signed-off-by: Aleksa Sarai <[email protected]>
    cyphar committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    dbdc562 View commit details
    Browse the repository at this point in the history
  4. libcontainer: seccomp: pass around *os.File for notifyfd

    *os.File is correctly tracked by the garbage collector, and there's no
    need to use raw file descriptors for this code.
    
    Signed-off-by: Aleksa Sarai <[email protected]>
    cyphar committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    a61cfd1 View commit details
    Browse the repository at this point in the history
  5. rootfs: use empty src for MS_REMOUNT

    The kernel ignores these arguments, and passing them can lead to
    confusing error messages (the old source is irrelevant for MS_REMOUNT),
    as well as causing issues for a future patch where we switch to
    move_mount(2).
    
    Signed-off-by: Aleksa Sarai <[email protected]>
    cyphar committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    3ceb838 View commit details
    Browse the repository at this point in the history
  6. nsexec: remove cgroupns special-casing

    The original implementation of cgroupns had additional synchronisation
    to "ensure" that the process is in the correct cgroup before unsharing
    the cgroupns. This behaviour was actually never necessary, and after
    commit 5110bd2 ("nsenter: remove cgroupns sync mechanism") there is
    no synchronisation at all, meaning that CLONE_NEWCGROUP should not get
    any special treatment.
    
    Fixes: 5110bd2 ("nsenter: remove cgroupns sync mechanism")
    Fixes: df3fa11 ("Add support for cgroup namespace")
    Signed-off-by: Aleksa Sarai <[email protected]>
    cyphar committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    39e4217 View commit details
    Browse the repository at this point in the history
  7. nsexec: migrate memfd /proc/self/exe logic to Go code

    This allow us to remove the amount of C code in runc quite
    substantially, as well as removing a whole execve(2) from the nsexec
    path because we no longer spawn "runc init" only to re-exec "runc init"
    after doing the clone.
    
    Signed-off-by: Aleksa Sarai <[email protected]>
    cyphar committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    d84e150 View commit details
    Browse the repository at this point in the history
  8. configs: fix idmapped mounts json field names

    In the runc state JSON we always use snake_case. This is a no-op change,
    but it will cause any existing container state files to be incorrectly
    parsed. Luckily, commit fbf183c ("Add uid and gid mappings to
    mounts") has never been in a runc release so we can change this before a
    1.2.z release.
    
    Fixes: fbf183c ("Add uid and gid mappings to mounts")
    Signed-off-by: Aleksa Sarai <[email protected]>
    cyphar committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    ffce2c0 View commit details
    Browse the repository at this point in the history
  9. libcontainer: remove all mount logic from nsexec

    With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
    id-mapped mounts and bind-mount source file descriptor logic entirely in
    Go without requiring any complicated handling from nsexec.
    
    This allows us to remove the amount of C code we have in nsexec, as well
    as simplifying a whole host of places that were made more complicated
    with the addition of id-mapped mounts and the bind sourcefd logic. The
    one downside of this is that the bind sourcefd feature now depends on
    Linux 5.4.
    
    In addition, we can easily add support for id-mappings that don't match
    the container's user namespace. The approach taken here is to use Go's
    officially supported mechanism for spawning a process in a user
    namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
    different process. The most efficient way to implement this would be to
    do clone() in cgo directly to run a function that just does
    kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
    this approach is too slow.
    
    This also includes a partial fix for a bug in the handling of idmap
    mounts and mount propagation. In short, the issue is that because we do
    OPEN_TREE_CLONE on the host, the RootPropagation flag does not apply
    (nor any other mount propagation flags configured in config.json) and
    thus recursive bind-mounts can and will be leaked to the host. Because
    this patch switches the feature from 9c44407 ("Open bind mount
    sources from the host userns") to use OPEN_TREE_CLONE, this resulted in
    all bind-mounts having this behaviour. The partial fix here is to try to
    emulate the behaviour of RootPropagation for bind-mounts from the host.
    
    It turns out that bind-mounts inside containers were broken when using
    the fd-passing feature from 9c44407 ("Open bind mount sources from
    the host userns") because the file descriptor opens were done before we
    start doing any mounts in rootfs_linux.go. The solution for this is more
    involved and is fixed in a separate patch. At the very least, this patch
    doesn't worsen the mount propagation situation.
    
    Fixes: fda12ab ("Support idmap mounts on volumes")
    Fixes: 9c44407 ("Open bind mount sources from the host userns")
    Signed-off-by: Aleksa Sarai <[email protected]>
    cyphar committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    acaf423 View commit details
    Browse the repository at this point in the history
  10. idmap: allow arbitrary idmap mounts regardless of userns configuration

    With the rework of nsexec.c to handle MOUNT_ATTR_IDMAP in our Go code we
    can now handle arbitrary mappings without issue, so remove the primary
    artificial limit of mappings (must use the same mapping as the
    container's userns) and add some tests.
    
    We still only support idmap mounts for bind-mounts because configuring
    mappings for other filesystems would require switching our entire mount
    machinery to the new mount API. The current design would easily allow
    for this but we would need to convert new mount options entirely to the
    fsopen/fsconfig/fsmount API. This can be done in the future.
    
    Signed-off-by: Aleksa Sarai <[email protected]>
    cyphar committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    18ce9de View commit details
    Browse the repository at this point in the history
  11. squash! libcontainer: remove all mount logic from nsexec

    Signed-off-by: Aleksa Sarai <[email protected]>
    cyphar committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    7a3c81b View commit details
    Browse the repository at this point in the history