Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mount: add enhanced mount functionality to support run container in userns with host network #3613

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

shidao1
Copy link

@shidao1 shidao1 commented Sep 27, 2022

in the public cloud service product, serverless container running environment has some specials.

  1. the container is often running on a separate kernel.
  2. so the runc is work on host network mode.
  3. the container has no privilege permission, and also has no cap_sys_admin
  4. many data accelerate project will use mount.fuse to provide mount point for app access data

the purpose is the container running in a new userns on host network mode.
the main process is using syscall open_tree to get fd for mount point sys, proc, mqueue beforce runc switch to new user ns
and using move_mount to mount sys, proc, mqueue after runc switch to new user ns

jiangliu and others added 5 commits September 27, 2022 12:58
Extend bootstrap message to pass mount fds for open_tree()/move_mount().

Signed-off-by: Jiang Liu <[email protected]>
Enhance mountToRootfs() to support MoveMount(), so it could be used
to support cross user namespace mounting.

Signed-off-by: Jiang Liu <[email protected]>
Introduce struct namespace_info_t to split join_namespaces() in stages,
so it could be reused later.

Signed-off-by: Jiang Liu <[email protected]>
Prepare source mount fds for move_mount() to support cross user
namespace mounting.

Signed-off-by: Jiang Liu <[email protected]>
When a user namespace is enabled for a pod/container, it may fail to
mount /proc, /sys and /dev/mqueue under certain conditions. This may
be solved by enabling cross user namespace mounting.

Signed-off-by: Jiang Liu <[email protected]>
Signed-off-by: shidao.ytt <[email protected]>
@@ -400,12 +400,25 @@ func mountToRootfs(m *configs.Mount, c *mountConfig) error {
return err
}
// Selinux kernels do not support labeling of /proc or /sys
if m.IsMove() && *c.fd >= 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unlike in C, you don't have to dereference a pointer to a struct when accessing its members.

IOW s/*c.fd/c.fd/

@@ -400,12 +400,25 @@ func mountToRootfs(m *configs.Mount, c *mountConfig) error {
return err
}
// Selinux kernels do not support labeling of /proc or /sys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit makes the comment above ^^^ misplaced. It used to explain why label.SetFileLabel is not called here.

Comment on lines +554 to +555
if _, exist := nsList[configs.NEWPID]; exist {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is some code missing here?

@kolyshkin
Copy link
Contributor

We need test cases for this.

@kolyshkin
Copy link
Contributor

Also, I'm afraid you'll have to redo this once #3599 is merged, which refactors some C code in nsenter.

@jiangliu
Copy link

Also, I'm afraid you'll have to redo this once #3599 is merged, which refactors some C code in nsenter.

Thanks for reminder, we also noticed PR #3599, so open this PR:)
We will wait for #3599 to settle down first.

@Zheaoli
Copy link
Contributor

Zheaoli commented Nov 12, 2023

I think this PR is not active for long time, may I take handle the rest work for making this PR ready to merge? cc @AkihiroSuda

@cyphar
Copy link
Member

cyphar commented Nov 16, 2023

@Zheaoli You'll need to base it on top of #3985, which reworks all of the mountfd logic. I'm not sure how easy it'll be to use the new Go-based setup to implement this though. I suspect you can do it by creating a locked goroutine that joins the container's non-userns namespaces, but the slight issue is that we cannot create a procfs mount that uses the containers pidns because procfs uses the active pidns, not the for_children one (in fact, I'm not sure this PR handles procfs correctly).

Also, you don't want to use open_tree(2) like this -- a much better way is to use fsopen and fsconfig to configure the mount without touching the filesystem, and thus having an anonymous mountfd that you can then provide to the container. (To be fair, I'm not sure if the permissions work out okay with user namespaces in that case.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants