-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rootless Containers #774
Rootless Containers #774
Conversation
For cgroups, we can skip doing any setup if cgroupsPath == "" and Resources == nil in the config. |
@mrunalp I'm going to go with a rootless cgroup manager so we might expand it later (there are some upcoming kernel features that might make cgroups in rootless containers usable). |
Sounds good. Sent from my iPhone
|
return nil | ||
} | ||
|
||
// Used for comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some pretty complex validation. So some cgroups are ok but others are not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just fail the check if any values are set at all. No need to go check for defaults. User can create the config without setting resources or path which is simple to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The /actual/ check is that all cgroups have no non-default settings. Unfortunately, specconv
adds device cgroup settings that get merged with the config the user specified -- so a simple "is this equal to the zero value" check doesn't cut it. I haven't figured out a nice way of dealing with that (specconv
runs long before we get to this part, and we need it to run before we can do anything with the config
(like figuring out if we're rootless)).
Maybe we can do our rootless check before specconv, then specconv doesn't modify the cgroup settings if we're rootless, and then we do the config checks for rootless (do we have mapping rights).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, that sounds better. We should be looking for this in runc not in libcontainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really convinced that we shouldn't be doing any checking in libcontainer
. The same question can be asked about libcontainer/configs/validate
-- why do we do any config verification inside libcontainer
? There's also a question of whether or not libcontainer
should autodetect rootless mode or whether it should be passed as an option (you can't use rootless containers with root as far as I can tell -- and it's definitely a bad idea).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runc should populate the correct config that libcontainer gets and it should not be modified inside libcontainer. All the changes that need to be made should happen while we generate the config not after it is made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current state of this patchset doesn't modify the config inside libcontainer
. The issue is that specconv
adds device options to []Device
even if the user doesn't specify anything. So we can either do the verification of the cgroups in runC (which means that if someone uses libcontainer directly they probably won't immediately realise they can't set cgroup settings) or we do the verification in libcontainer and make specconv
not generate any cgroup options in rootless mode.
I'll also move the isRootless
checks to RootlessValidator
.
EDIT: I've fixed this.
i think the rootless cgroups manager is good, then it can be expanded when the cgroups ns is added :) just my opinion but ianam, thanks for this |
also wrt the features that might not be possible or are hard for the time being, they could be disabled, and then slowly turned back on as implementations evolve, kinda like how we did userns in docker, and then slowly added more features back in wrt sharing namespaces |
@jfrazelle AFAICS all of the core features work. But some of them either just require |
Nice :) On Tuesday, April 26, 2016, Aleksa Sarai [email protected] wrote:
Jessie Frazelle |
Note: as far as I can see, the only thing left to do before we can clear most of the checkpoints is for me to fix up the rootless cgroup manager so that it stores the process's actual cgroup path. Then |
Whoops wrong button. ;) |
@avagin Do you know if it's possible to run |
Looks like this is moving forward, very interesting. Thus I started testing it. Built on Fedora 24 (updated on May 2nd). Looks like mount-bind works fine without root permissions. I wasn't lucky with internet connectivity without sudo. Later on I need to check if there is a way not to be a root user within container (our software will not using root account). Nice progress!
For anyone else who wants to test this I am also sharing diff between original and modified
|
Unfortunately this is not possible, due to restrictions within the kernel. Essentially this is a logcal result of these two restrictions on user namespaces:
As a result, the only user that is mapped inside the container is your user (as root). You can discuss with the kernel upstream about restriction number 1, because it's the only restriction which it might be possible to fix. The second restriction is just a security issue. But at the moment, there isn't a way to do what you want. |
Yeah, I tried a few thing. I even added a user within a container using Docker, which I can see while running with runC, but cannot launch anything under that user. What about internet connectivity? |
Unfortunately, creating bridges between a container's network namespace and the hosts's network namespace requires creating a virtual interface in the host's network namespace. AFAIK that requires root (but I may be wrong). One potential solution would be to not create a network namespace (this currently doesn't work due to some bindmount issues, but that can be fixed). Obviously, this means you don't get the benefits of network namespacing (such as This could be something else we could push the kernel about. Unfortunately, I'm not familiar enough with networking to be able to help with writing a kernel patch. My only kernel experience thus far has been with cgroups. |
Okay. Looks like internet connectivity will arrive at some point. This is needed in my case because majority of data is not local (i.e. not available on some shared file system mounted via bind/slave mount to the container). It most cases it has to be streamed (network IO). I think, at this point just having internet connectivity is good step forward. Do you know what was the reasoning for namespaces to provide root inside the container? Quick googling didn't reveal too much documentation around this. |
To be honest, I just quickly read through the kernel code and I'm not sure this is a restriction imposed by the kernel. It's possible it's just how we've set up user namespaced containers to work. Currently my runC build is failing with the error:
Which tells me there's some permission issues with the |
@davidlt Best bet for networking would be to use the host network stack (i.e. don't add it to the config). |
I don't understand this passage. cgroups works for unprivileged users by the same way as other file systems. I've read the kernel code and haven't found places in cgroup code which are protected by CAP_SYS_ADMIN.
|
Just device cgroups will fail On Tue, May 3, 2016 at 2:29 PM, Andrew Vagin [email protected]
Jessie Frazelle |
@avagin Sorry, I need to update the first paragraph. But if you look at your session log:
If you have to use root to enable using cgroups, it's not useful for some usecases of rootless containers. I've been working upstream to allow an unprivileged cgroup namespace to create their own subtrees, which is what is necessary to make rootless containers mostly feature-complete. But yeah, Do you know anything about whether you can use |
Are you sure that this should be fixed in a kernel space? Maybe we need to fix this in systemd? How does LXC handles this problem.
We announced "Unprivileged dump" in CRIU 2.0 and now we are working on "Unprivileged restore". I don't know how good it will work for root-less containers, but I think it isn't unsolvable task. |
I have been testing DMTCP (transparently checkpoints a single-host or distributed computation in user-space) and that worked in user-land for checkpointing and restoring complex applications. Thus CRIU in user-land should technologically possible. |
Previously Host{U,G}ID only gave you the root mapping, which isn't very useful if you are trying to do other things with the IDMaps. Signed-off-by: Aleksa Sarai <[email protected]>
If the stdio of the container is owned by a group which is not mapped in the user namespace, attempting to fchown the file descriptor will result in EINVAL. Counteract this by simply not doing an fchown if the group owner of the file descriptor has no host mapping according to the configured GIDMappings. Signed-off-by: Aleksa Sarai <[email protected]>
Since this is a runC-specific feature, this belongs here over in opencontainers/ocitools (which is for generic OCI runtimes). In addition, we don't create a new network namespace. This is because currently if you want to set up a veth bridge you need CAP_NET_ADMIN in both network namespaces' pinned user namespace to create the necessary interfaces in each network namespace. Signed-off-by: Aleksa Sarai <[email protected]>
This is in preperation of allowing us to run the integration test suite on rootless containers. Signed-off-by: Aleksa Sarai <[email protected]>
This adds targets for rootless integration tests, as well as all of the required setup in order to get the tests to run. This includes quite a few changes, because of a lot of assumptions about things running as root within the bats scripts (which is not true when setting up rootless containers). Signed-off-by: Aleksa Sarai <[email protected]>
@hqhq Squashed and rebased. |
Should we drop groups that are unmapped?
|
@mrunalp We don't have privileges to do that. In fact, it's a security feature of the kernel to not allow unprivileged users to drop supplementary groups because of paths with modes such as |
ping @mrunalp |
🎉 |
Looks like it's party time! 11 months in development. Someone should post this on Hacker News. |
Any link to updated docs. Blog post? |
@muayyad-alsadi No doc updates, I'll follow up with those. Here's a blog post from last year and my talk at Linux.conf.au from earlier this year. |
schema: add `clean` to Makefile
This enables the support for "rootless container mode". There are
certain restrictions on what non-root users can do, resulting in several
runC features not being available.
There are no checks in place atI've implemented the configthe moment to make this clear to users.
validation.
cgroup directory. By default, the directories are owned by
root
and havethe mode
0755
. This means that we cannot set up any cgroups, or joincgroups. Therefore new cgroup namespace doesn't fix this for us either, but
hopefully we can get a patch upstream to fix this. But we should still
improve cgroup handling so that we apply any cgroups we can if we have
write access to the directory.
We also have to set
/proc/self/setgroups
to "deny".which means that any user-related directives won't work. You can only be
"root".
If you want to use this, you have to make sure you remove the
gid=5
entry from the/dev/pts
mount, and only map your own user in the namespace.Here's
runc start
working in bothroot
androotless
setup:And here's
runc exec
working in bothroot
androotless
setup:TODO
user
directive to run as a different user.runc exec
doesn't work, and we should be able to implement it. This actually complicates the code innsenter.c
which checks whether the container is unprivileged.setgroup
needs special treatment.runc exec
doesnt' work withroot
running the exec and a rootless container (ironically). This is because we autodetect therootless
parameter on run, which isn't accurate. This can be fixed by storing therootless
flag in thestate.json
.rootless
should be passed to the init through netlink. Currently we are doing therootless
check in two places and it doesn't make sense to do the check innsenter.c
-- we might actually have to do it with capability checks in the future.runc events
doesn't work because therootless.Manager
doesn't appear to manage the paths properly, so it can't get any data from the cgroups.runc spec --rootless
.loadContainer
doesn't properly load the cgroup manager for the container, because the API forces that to happen (libcontainer.New()
takes the cgroup manager as an argument). We can probably fix this by making it load the cgroup manager from the container state (but it might be ugly). Without this, we can't even hope to haverunc pause
andrunc resume
working.libcontainer/cgroup/rootless
. Luckily we already all of the mock stuff we need incgroupfs
.libcontainer/config/validate/rootless
libcontainer/specconv/spec_linux.go
withRootless == true
.(not necessary at the moment)libcontainer/cgroup/rootless
Detaching doesn't work due to a bug in runC with--console
and user namespaces. Console path resolution is done in host mount namespace #814 and tests: remove --console usage #883All of the sniff tests should work.The sniff tests are no more.THIS IS CURRENTLY BLOCKED ON FIXING THEThe console bug has been fixed as part of Consoles, consoles, consoles. #1018.--console
BUG.runc
(so we can mess around with arguments in rootless mode).ping
don't work in rootless containers (user namespace issues). validator: ensure user doesn't try to mount /sys without userns #807 fixes the validation issue.container.Processes()
to join the container PID namespace, enumerate the list of PIDs and then send them over a UNIX socket (this causes the PIDs to be translated). The end result is to not require cgroups to enumerate PIDs (which actually isn't a good idea since processes may join sub-cgroups). This is necessary forrunc ps
and similar things to work properly.CAP_SYS_ADMIN
in order to send a different PID. And there's also a valid question about atomicity (enumeration is not atomic, reading fromcgroup.procs
is).Open Questions
runc
back to/usr/bin
, since it's no longer an admin piece of software? This would also mean moving theman
pages toman1
.What works?
(while potentially possible, not implemented)runc checkpoint
runc create
(Console path resolution is done in host mount namespace #814)runc create --console
(Console path resolution is done in host mount namespace #814)runc delete
(not really useful -- cgroups)runc events
runc exec
runc exec --console
(Console path resolution is done in host mount namespace #814)runc kill
runc list
(cgroups)runc pause
(cgroups)runc ps
(while potentially possible, not implemented)runc restore
(cgroups)runc resume
runc run
runc run -d --console
(Console path resolution is done in host mount namespace #814)runc spec
runc start
(create
doesn't work -- Console path resolution is done in host mount namespace #814)runc state
(cgroups)runc update
root
:(while potentially possible, not implemented)runc checkpoint
runc delete
(not really useful -- cgroups)runc events
runc exec
runc exec --console
(Console path resolution is done in host mount namespace #814)runc kill
runc list
(cgroups)runc pause
runc ps
(cgroups)(while potentially possible, not implemented)runc restore
(cgroups)runc resume
runc run
runc run -d --console
(Console path resolution is done in host mount namespace #814)runc spec
runc start
(create
doesn't work -- Console path resolution is done in host mount namespace #814)runc state
(cgroups)runc update
Kernel Patches
CLONE_NEWCGROUP
fix to allow unprivileged processes to allow us to create subtrees.Implements #38.
Signed-off-by: Aleksa Sarai [email protected]