Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rootless Containers #774

Merged
merged 8 commits into from
Mar 27, 2017
Merged

Rootless Containers #774

merged 8 commits into from
Mar 27, 2017

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Apr 23, 2016

This enables the support for "rootless container mode". There are
certain restrictions on what non-root users can do, resulting in several
runC features not being available. There are no checks in place at
the moment to make this clear to users.
I've implemented the config
validation.

  • All cgroup operations require having write access to your current
    cgroup directory. By default, the directories are owned by root and have
    the mode 0755. This means that we cannot set up any cgroups, or join
    cgroups. Therefore new cgroup namespace doesn't fix this for us either, but
    hopefully we can get a patch upstream to fix this. But we should still
    improve cgroup handling so that we apply any cgroups we can if we have
    write access to the directory.
  • setgroups(2) cannot be used in a non-privileged user namespace setup.
    We also have to set /proc/self/setgroups to "deny".
  • We cannot map any user other than ourselves in a rootless container,
    which means that any user-related directives won't work. You can only be
    "root".

If you want to use this, you have to make sure you remove the gid=5 entry from the /dev/pts mount, and only map your own user in the namespace.

Here's runc start working in both root and rootless setup:
it works

And here's runc exec working in both root and rootless setup:
it also works

TODO

  • Provide a meaningful error message if the user specifies a configuration which maps more than one user (or doesn't map the effective user).
  • Provide a meaningful error message (don't just ignore it) if the user tries to use the user directive to run as a different user.
  • Provide a meaningful error message (don't just ignore it) if the user tries to specify any cgroup settings. Currently we can't provide cgroup settings (though it should be possible if the user just so happens to own the cgroup they currently reside in).
  • runc exec doesn't work, and we should be able to implement it. This actually complicates the code in nsenter.c which checks whether the container is unprivileged. setgroup needs special treatment.
  • runc exec doesnt' work with root running the exec and a rootless container (ironically). This is because we autodetect the rootless parameter on run, which isn't accurate. This can be fixed by storing the rootless flag in the state.json.
  • rootless should be passed to the init through netlink. Currently we are doing the rootless check in two places and it doesn't make sense to do the check in nsenter.c -- we might actually have to do it with capability checks in the future.
  • runc events doesn't work because the rootless.Manager doesn't appear to manage the paths properly, so it can't get any data from the cgroups.
  • Add runc spec --rootless.
  • loadContainer doesn't properly load the cgroup manager for the container, because the API forces that to happen (libcontainer.New() takes the cgroup manager as an argument). We can probably fix this by making it load the cgroup manager from the container state (but it might be ugly). Without this, we can't even hope to have runc pause and runc resume working.
  • ❌ Currently the cgroup setup is binary (either we use cgroups or we don't). But since cgroupv1 lets you have different permissions on different hierarchies, we should check for each subsystem that we have access to create a subtree in that cgroup. Then we can support several setups (as well as the kernel patch I've proposed). This would require many changes in config validation. In addition, we would have to store what cgroups we are using in a bitmask (like in the kernel).
    • ❌ This would require adding a bunch of tests to libcontainer/cgroup/rootless. Luckily we already all of the mock stuff we need in cgroupfs.
  • Add unit tests to:
    • libcontainer/config/validate/rootless
    • libcontainer/specconv/spec_linux.go with Rootless == true.
    • libcontainer/cgroup/rootless (not necessary at the moment)
  • Detaching doesn't work due to a bug in runC with --console and user namespaces. Console path resolution is done in host mount namespace #814 and tests: remove --console usage #883
  • Add a setup for testing the rootless containers so we can make sure there's no regressions in this set up. We can try running the whole test suite (minus cgroups and a few other things). All of the sniff tests should work. The sniff tests are no more. THIS IS CURRENTLY BLOCKED ON FIXING THE --console BUG. The console bug has been fixed as part of Consoles, consoles, consoles. #1018.
    • We have to refactor all of the tests to use a wrapper around runc (so we can mess around with arguments in rootless mode).
    • Also, we need to skip certain tests in rootless mode, since they are not supported.
  • We currently cannot use network namespaces in a rootless container setup (and still connect to the network). A fix for this is to use the host network, but that currently doesn't work in runC (Cannot create user namespaced container without network namespaces #799). It should also be noted that things like ping don't work in rootless containers (user namespace issues). validator: ensure user doesn't try to mount /sys without userns #807 fixes the validation issue.
    • Figure out why we still can't talk to the internet.
  • ❌ Switch container.Processes() to join the container PID namespace, enumerate the list of PIDs and then send them over a UNIX socket (this causes the PIDs to be translated). The end result is to not require cgroups to enumerate PIDs (which actually isn't a good idea since processes may join sub-cgroups). This is necessary for runc ps and similar things to work properly.
    • ❌ There are some unanswered questions about sending the PIDs though, because you need CAP_SYS_ADMIN in order to send a different PID. And there's also a valid question about atomicity (enumeration is not atomic, reading from cgroup.procs is).

Open Questions

What works?

Kernel Patches

Implements #38.

Signed-off-by: Aleksa Sarai [email protected]

@mrunalp
Copy link
Contributor

mrunalp commented Apr 23, 2016

For cgroups, we can skip doing any setup if cgroupsPath == "" and Resources == nil in the config.
We can either skip or introduce a NoOp cgroups manager.

@cyphar
Copy link
Member Author

cyphar commented Apr 24, 2016

@mrunalp I'm going to go with a rootless cgroup manager so we might expand it later (there are some upcoming kernel features that might make cgroups in rootless containers usable).

@mrunalp
Copy link
Contributor

mrunalp commented Apr 24, 2016

Sounds good.

Sent from my iPhone

On Apr 23, 2016, at 8:05 PM, Aleksa Sarai [email protected] wrote:

@mrunalp I'm going to go with a rootless cgroup manager so we might expand it later (there are some upcoming kernel features that might make cgroups in rootless containers usable).


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@cyphar cyphar changed the title [WIP] Rootless Containers Rootless Containers Apr 24, 2016
return nil
}

// Used for comparison.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some pretty complex validation. So some cgroups are ok but others are not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just fail the check if any values are set at all. No need to go check for defaults. User can create the config without setting resources or path which is simple to check.

Copy link
Member Author

@cyphar cyphar Apr 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /actual/ check is that all cgroups have no non-default settings. Unfortunately, specconv adds device cgroup settings that get merged with the config the user specified -- so a simple "is this equal to the zero value" check doesn't cut it. I haven't figured out a nice way of dealing with that (specconv runs long before we get to this part, and we need it to run before we can do anything with the config (like figuring out if we're rootless)).

Maybe we can do our rootless check before specconv, then specconv doesn't modify the cgroup settings if we're rootless, and then we do the config checks for rootless (do we have mapping rights).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, that sounds better. We should be looking for this in runc not in libcontainer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really convinced that we shouldn't be doing any checking in libcontainer. The same question can be asked about libcontainer/configs/validate -- why do we do any config verification inside libcontainer? There's also a question of whether or not libcontainer should autodetect rootless mode or whether it should be passed as an option (you can't use rootless containers with root as far as I can tell -- and it's definitely a bad idea).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runc should populate the correct config that libcontainer gets and it should not be modified inside libcontainer. All the changes that need to be made should happen while we generate the config not after it is made.

Copy link
Member Author

@cyphar cyphar Apr 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current state of this patchset doesn't modify the config inside libcontainer. The issue is that specconv adds device options to []Device even if the user doesn't specify anything. So we can either do the verification of the cgroups in runC (which means that if someone uses libcontainer directly they probably won't immediately realise they can't set cgroup settings) or we do the verification in libcontainer and make specconv not generate any cgroup options in rootless mode.

I'll also move the isRootless checks to RootlessValidator.

EDIT: I've fixed this.

@jessfraz
Copy link
Contributor

i think the rootless cgroups manager is good, then it can be expanded when the cgroups ns is added :) just my opinion but ianam, thanks for this

@jessfraz
Copy link
Contributor

also wrt the features that might not be possible or are hard for the time being, they could be disabled, and then slowly turned back on as implementations evolve, kinda like how we did userns in docker, and then slowly added more features back in wrt sharing namespaces
it's easier to make a smaller change then iterate on it, then one huge one

@cyphar
Copy link
Member Author

cyphar commented Apr 26, 2016

@jfrazelle AFAICS all of the core features work. But some of them either just require root (criu IIRC) or currently can't be done under user namespaces (cgroups -- which I'm working on a patch for). But all of the others should still work, and in principle I want to try to get all of the features working for root operating on a rootless container.

@jessfraz
Copy link
Contributor

Nice :)

On Tuesday, April 26, 2016, Aleksa Sarai [email protected] wrote:

@jfrazelle https://github.com/jfrazelle AFAICS all of the core features
work. But some of them either just require root (criu IIRC) or currently
can't be done under user namespaces (cgroups -- which I'm working on a
patch for). But all of the others should still work, and in principle I
want to try to get all of the features working for root operating on a
rootless container.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#774 (comment)

Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3
pgp.mit.edu http://pgp.mit.edu/pks/lookup?op=get&search=0x18F3685C0022BFF3

@cyphar
Copy link
Member Author

cyphar commented Apr 26, 2016

Note: as far as I can see, the only thing left to do before we can clear most of the checkpoints is for me to fix up the rootless cgroup manager so that it stores the process's actual cgroup path. Then runc events should work, as well as all of the freezer code (for root). Checkpoint and restore might also work just by doing that.

@cyphar cyphar closed this Apr 26, 2016
@cyphar cyphar reopened this Apr 26, 2016
@cyphar
Copy link
Member Author

cyphar commented Apr 26, 2016

Whoops wrong button. ;)

@cyphar
Copy link
Member Author

cyphar commented Apr 29, 2016

@avagin Do you know if it's possible to run criu as an unprivileged user (specifically from the kernel side when interacting with unprivileged user namespaces)? If not, is there any intention from upstream to get that to work? And if not, should we disable all checkpoint/restore functionality for rootless containers (even if the person doing the checkpoint/restore is root -- since the restore setup might break in weird ways)?

@davidlt
Copy link

davidlt commented May 2, 2016

Looks like this is moving forward, very interesting. Thus I started testing it. Built on Fedora 24 (updated on May 2nd). Looks like mount-bind works fine without root permissions. I wasn't lucky with internet connectivity without sudo. Later on I need to check if there is a way not to be a root user within container (our software will not using root account). Nice progress!

[davidlt@pccms205 magic_dir]$ pwd
/home/davidlt/magic_dir
[davidlt@pccms205 magic_dir]$ ls
[davidlt@pccms205 magic_dir]$ touch HOST_FILE
[davidlt@pccms205 magic_dir]$ ls
HOST_FILE
[davidlt@pccms205 magic_dir]$ cat /etc/os-release | grep PRETTY_NAME >> HOST_FILE
[davidlt@pccms205 magic_dir]$ cat HOST_FILE
PRETTY_NAME="Fedora 24 (Workstation Edition)"
[davidlt@pccms205 test]$ runc --root $PWD start test_container
sh-4.2# cat /etc/os-release | grep PRETTY_NAME
PRETTY_NAME="CentOS Linux 7 (Core)"
sh-4.2# cd /some
sh-4.2# cat /etc/os-release | grep PRETTY_NAME >> HOST_FILE
sh-4.2# touch NEW_FILE
sh-4.2# exit
[davidlt@pccms205 test]$ cat ~/magic_dir/HOST_FILE 
PRETTY_NAME="Fedora 24 (Workstation Edition)"
PRETTY_NAME="CentOS Linux 7 (Core)"
[davidlt@pccms205 test]$ file ~/magic_dir/NEW_FILE 
/home/davidlt/magic_dir/NEW_FILE: empty

For anyone else who wants to test this I am also sharing diff between original and modified config.json:

--- config.json 2016-05-02 13:25:24.468181348 +0200
+++ config.json.correct 2016-05-02 13:25:14.661496040 +0200
@@ -6,7 +6,7 @@
    },
    "process": {
        "terminal": true,
-       "user": {},
+       "user": { "uid": 0, "gid": 0, "additionalGids": null },
        "args": [
            "sh"
        ],
@@ -35,6 +35,12 @@
    },
    "hostname": "runc",
    "mounts": [
+                {
+                        "destination": "/some",
+                        "type": "bind",
+                        "source": "/home/davidlt/magic_dir",
+                        "options": [ "rbind" ]
+                },
        {
            "destination": "/proc",
            "type": "proc",
@@ -60,8 +66,7 @@
                "noexec",
                "newinstance",
                "ptmxmode=0666",
-               "mode=0620",
-               "gid=5"
+               "mode=0620"
            ]
        },
        {
@@ -112,14 +117,20 @@
    ],
    "hooks": {},
    "linux": {
-       "resources": {
-           "devices": [
-               {
-                   "allow": false,
-                   "access": "rwm"
-               }
-           ]
-       },
+                "uidMappings": [
+                        {
+                                "hostID": 1000,
+                                "containerID": 0,
+                                "size": 1
+                        }
+                ],
+                "gidMappings": [
+                        {
+                                "hostID": 1000,
+                                "containerID": 0,
+                                "size": 1
+                        }
+                ],
        "namespaces": [
            {
                "type": "pid"
@@ -135,7 +146,10 @@
            },
            {
                "type": "mount"
-           }
+           },
+                        {
+                                "type": "user"
+                        }
        ],
        "maskedPaths": [
            "/proc/kcore",
@@ -152,4 +166,4 @@
            "/proc/sysrq-trigger"
        ]
    }
-}
\ No newline at end of file
+}

@cyphar
Copy link
Member Author

cyphar commented May 2, 2016

@davidlt

Later on I need to check if there is a way not to be a root user within container (our software will not using root account). Nice progress!

Unfortunately this is not possible, due to restrictions within the kernel. Essentially this is a logcal result of these two restrictions on user namespaces:

  1. All user namespaces must provide a mapping for root inside a container.
  2. Unprivileged user namespaces can only provide a mapping for one user, the user which created the namespace.

As a result, the only user that is mapped inside the container is your user (as root). You can discuss with the kernel upstream about restriction number 1, because it's the only restriction which it might be possible to fix. The second restriction is just a security issue. But at the moment, there isn't a way to do what you want.

@davidlt
Copy link

davidlt commented May 2, 2016

Yeah, I tried a few thing. I even added a user within a container using Docker, which I can see while running with runC, but cannot launch anything under that user.

What about internet connectivity?

@cyphar
Copy link
Member Author

cyphar commented May 2, 2016

Unfortunately, creating bridges between a container's network namespace and the hosts's network namespace requires creating a virtual interface in the host's network namespace. AFAIK that requires root (but I may be wrong). One potential solution would be to not create a network namespace (this currently doesn't work due to some bindmount issues, but that can be fixed). Obviously, this means you don't get the benefits of network namespacing (such as iptables rules without root).

This could be something else we could push the kernel about. Unfortunately, I'm not familiar enough with networking to be able to help with writing a kernel patch. My only kernel experience thus far has been with cgroups.

@davidlt
Copy link

davidlt commented May 2, 2016

Okay. Looks like internet connectivity will arrive at some point. This is needed in my case because majority of data is not local (i.e. not available on some shared file system mounted via bind/slave mount to the container). It most cases it has to be streamed (network IO). I think, at this point just having internet connectivity is good step forward.

Do you know what was the reasoning for namespaces to provide root inside the container? Quick googling didn't reveal too much documentation around this.

@cyphar
Copy link
Member Author

cyphar commented May 2, 2016

To be honest, I just quickly read through the kernel code and I'm not sure this is a restriction imposed by the kernel. It's possible it's just how we've set up user namespaced containers to work. Currently my runC build is failing with the error:

process_linux.go:247: getting pipe fds for pid 16189 caused "readlink /proc/16189/fd/0: permission denied"

Which tells me there's some permission issues with the /proc setup we have (where we read the pipe file descriptors over stdin -- but for some reason it looks like the process can't open its own stdin?).

@mrunalp
Copy link
Contributor

mrunalp commented May 2, 2016

@davidlt Best bet for networking would be to use the host network stack (i.e. don't add it to the config).
The way typically networking is setup requires moving network devices from host network namespace to the container's network namespace. With privileged user namespaces, the runc hooks can do that work but that isn't possible with unprivileged containers. It would need discussion with upstream as @cyphar suggested.

@avagin
Copy link
Contributor

avagin commented May 3, 2016

All cgroup operations require having CAP_SYS_ADMIN in the root user
namespace. This means that we cannot set up any cgroups, or join
cgroups. The new cgroup namespace doesn't fix this for us either, but
hopefully we can get a patch upstream to fix this.

I don't understand this passage. cgroups works for unprivileged users by the same way as other file systems. I've read the kernel code and haven't found places in cgroup code which are protected by CAP_SYS_ADMIN.

[avagin@laptop ~]$ whoami 
avagin
[avagin@laptop ~]$ sudo mkdir /sys/fs/cgroup/cpu/test
[avagin@laptop ~]$ sudo chown avagin:avagin /sys/fs/cgroup/cpu/test
[avagin@laptop ~]$ echo $$ > /sys/fs/cgroup/cpu/test/tasks 
bash: /sys/fs/cgroup/cpu/test/tasks: Permission denied
[avagin@laptop ~]$ mkdir /sys/fs/cgroup/cpu/test/sub
[avagin@laptop ~]$ echo $$ > /sys/fs/cgroup/cpu/test/sub/tasks 
[avagin@laptop ~]$ cat /sys/fs/cgroup/cpu/test/sub/cpu.shares 
1024
[avagin@laptop ~]$ echo 2014 > /sys/fs/cgroup/cpu/test/sub/cpu.shares
[avagin@laptop ~]$ echo 512 > /sys/fs/cgroup/cpu/test/sub/cpu.shares
[avagin@laptop ~]$ ls -l /sys/fs/cgroup/cpu/test/
total 0
-rw-r--r--. 1 root   root   0 May  3 14:26 cgroup.clone_children
-rw-r--r--. 1 root   root   0 May  3 14:26 cgroup.procs
-r--r--r--. 1 root   root   0 May  3 14:26 cpuacct.stat
-rw-r--r--. 1 root   root   0 May  3 14:26 cpuacct.usage
-r--r--r--. 1 root   root   0 May  3 14:26 cpuacct.usage_percpu
-rw-r--r--. 1 root   root   0 May  3 14:26 cpu.cfs_period_us
-rw-r--r--. 1 root   root   0 May  3 14:26 cpu.cfs_quota_us
-rw-r--r--. 1 root   root   0 May  3 14:26 cpu.shares
-r--r--r--. 1 root   root   0 May  3 14:26 cpu.stat
-rw-r--r--. 1 root   root   0 May  3 14:26 notify_on_release
drwxrwxr-x. 2 avagin avagin 0 May  3 14:25 sub
-rw-r--r--. 1 root   root   0 May  3 14:25 tasks
[avagin@laptop ~]$ ls -l /sys/fs/cgroup/cpu/test/sub/
total 0
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cgroup.clone_children
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cgroup.procs
-r--r--r--. 1 avagin avagin 0 May  3 14:25 cpuacct.stat
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cpuacct.usage
-r--r--r--. 1 avagin avagin 0 May  3 14:25 cpuacct.usage_percpu
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cpu.cfs_period_us
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cpu.cfs_quota_us
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cpu.shares
-r--r--r--. 1 avagin avagin 0 May  3 14:25 cpu.stat
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 notify_on_release
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 tasks

@jessfraz
Copy link
Contributor

jessfraz commented May 3, 2016

Just device cgroups will fail

On Tue, May 3, 2016 at 2:29 PM, Andrew Vagin [email protected]
wrote:

All cgroup operations require having CAP_SYS_ADMIN in the root user
namespace. This means that we cannot set up any cgroups, or join
cgroups. The new cgroup namespace doesn't fix this for us either, but
hopefully we can get a patch upstream to fix this.

I don't understand this passage. I tried and cgroups works for
unprivileged users by the same way as other file systems. I've read the
kernel code and haven't found places in cgroup code which are protected by
CAP_SYS_ADMIN.

[avagin@laptop ~]$ whoami
avagin
[avagin@laptop ~]$ sudo mkdir /sys/fs/cgroup/cpu/test
[avagin@laptop ~]$ sudo chown avagin:avagin /sys/fs/cgroup/cpu/test
[avagin@laptop ~]$ echo $$ > /sys/fs/cgroup/cpu/test/tasks
bash: /sys/fs/cgroup/cpu/test/tasks: Permission denied
[avagin@laptop ~]$ mkdir /sys/fs/cgroup/cpu/test/sub
[avagin@laptop ~]$ echo $$ > /sys/fs/cgroup/cpu/test/sub/tasks
[avagin@laptop ~]$ cat /sys/fs/cgroup/cpu/test/sub/cpu.shares
1024
[avagin@laptop ~]$ echo 2014 > /sys/fs/cgroup/cpu/test/sub/cpu.shares
[avagin@laptop ~]$ echo 512 > /sys/fs/cgroup/cpu/test/sub/cpu.shares
[avagin@laptop ~]$ ls -l /sys/fs/cgroup/cpu/test/
total 0
-rw-r--r--. 1 root root 0 May 3 14:26 cgroup.clone_children
-rw-r--r--. 1 root root 0 May 3 14:26 cgroup.procs
-r--r--r--. 1 root root 0 May 3 14:26 cpuacct.stat
-rw-r--r--. 1 root root 0 May 3 14:26 cpuacct.usage
-r--r--r--. 1 root root 0 May 3 14:26 cpuacct.usage_percpu
-rw-r--r--. 1 root root 0 May 3 14:26 cpu.cfs_period_us
-rw-r--r--. 1 root root 0 May 3 14:26 cpu.cfs_quota_us
-rw-r--r--. 1 root root 0 May 3 14:26 cpu.shares
-r--r--r--. 1 root root 0 May 3 14:26 cpu.stat
-rw-r--r--. 1 root root 0 May 3 14:26 notify_on_release
drwxrwxr-x. 2 avagin avagin 0 May 3 14:25 sub
-rw-r--r--. 1 root root 0 May 3 14:25 tasks
[avagin@laptop ~]$ ls -l /sys/fs/cgroup/cpu/test/sub/
total 0
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cgroup.clone_children
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cgroup.procs
-r--r--r--. 1 avagin avagin 0 May 3 14:25 cpuacct.stat
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cpuacct.usage
-r--r--r--. 1 avagin avagin 0 May 3 14:25 cpuacct.usage_percpu
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cpu.cfs_period_us
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cpu.cfs_quota_us
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cpu.shares
-r--r--r--. 1 avagin avagin 0 May 3 14:25 cpu.stat
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 notify_on_release
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 tasks


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#774 (comment)

Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3
pgp.mit.edu http://pgp.mit.edu/pks/lookup?op=get&search=0x18F3685C0022BFF3

@cyphar
Copy link
Member Author

cyphar commented May 3, 2016

@avagin Sorry, I need to update the first paragraph. But if you look at your session log:

$ sudo chown avagin:avagin /sys/fs/cgroup/cpu/test

If you have to use root to enable using cgroups, it's not useful for some usecases of rootless containers. I've been working upstream to allow an unprivileged cgroup namespace to create their own subtrees, which is what is necessary to make rootless containers mostly feature-complete.

But yeah, CAP_SYS_ADMIN is definitely the wrong thing and I'll fix up what I wrote.

Do you know anything about whether you can use criu as an unprivileged user? My guess is that you probably can't, since it messes around with saving and restoring kernel state.

@avagin
Copy link
Contributor

avagin commented May 3, 2016

@cyphar

If you have to use root to enable using cgroups, it's not useful for some usecases of rootless containers.

Are you sure that this should be fixed in a kernel space? Maybe we need to fix this in systemd? How does LXC handles this problem.

Do you know anything about whether you can use criu as an unprivileged user? My guess is that you probably can't, since it messes around with saving and restoring kernel state.

We announced "Unprivileged dump" in CRIU 2.0 and now we are working on "Unprivileged restore". I don't know how good it will work for root-less containers, but I think it isn't unsolvable task.

@davidlt
Copy link

davidlt commented May 4, 2016

I have been testing DMTCP (transparently checkpoints a single-host or distributed computation in user-space) and that worked in user-land for checkpointing and restoring complex applications. Thus CRIU in user-land should technologically possible.

Previously Host{U,G}ID only gave you the root mapping, which isn't very
useful if you are trying to do other things with the IDMaps.

Signed-off-by: Aleksa Sarai <[email protected]>
If the stdio of the container is owned by a group which is not mapped in
the user namespace, attempting to fchown the file descriptor will result
in EINVAL. Counteract this by simply not doing an fchown if the group
owner of the file descriptor has no host mapping according to the
configured GIDMappings.

Signed-off-by: Aleksa Sarai <[email protected]>
Since this is a runC-specific feature, this belongs here over in
opencontainers/ocitools (which is for generic OCI runtimes).

In addition, we don't create a new network namespace. This is because
currently if you want to set up a veth bridge you need CAP_NET_ADMIN in
both network namespaces' pinned user namespace to create the necessary
interfaces in each network namespace.

Signed-off-by: Aleksa Sarai <[email protected]>
This is in preperation of allowing us to run the integration test suite
on rootless containers.

Signed-off-by: Aleksa Sarai <[email protected]>
This adds targets for rootless integration tests, as well as all of the
required setup in order to get the tests to run. This includes quite a
few changes, because of a lot of assumptions about things running as
root within the bats scripts (which is not true when setting up rootless
containers).

Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar
Copy link
Member Author

cyphar commented Mar 23, 2017

@hqhq Squashed and rebased.

@hqhq
Copy link
Contributor

hqhq commented Mar 23, 2017

LGTM

Approved with PullApprove

@mrunalp
Copy link
Contributor

mrunalp commented Mar 23, 2017

Should we drop groups that are unmapped?

[mrunal@acme busybox]$ ./runc --root ~/runc/state run 1234
/ # id
uid=0(root) gid=0(root) groups=65534,0(root)

@cyphar
Copy link
Member Author

cyphar commented Mar 24, 2017

@mrunalp We don't have privileges to do that. In fact, it's a security feature of the kernel to not allow unprivileged users to drop supplementary groups because of paths with modes such as 0707. Such ACLs make it easy to blacklist a group from accessing something.

@crosbymichael
Copy link
Member

ping @mrunalp

@mrunalp
Copy link
Contributor

mrunalp commented Mar 27, 2017

LGTM

Approved with PullApprove

@mrunalp mrunalp merged commit 653207b into opencontainers:master Mar 27, 2017
@cyphar
Copy link
Member Author

cyphar commented Mar 27, 2017

🎉

@davidlt
Copy link

davidlt commented Mar 27, 2017

Looks like it's party time! 11 months in development. Someone should post this on Hacker News.

@marcosnils
Copy link
Contributor

image

:D

@muayyad-alsadi
Copy link

Any link to updated docs. Blog post?

@cyphar
Copy link
Member Author

cyphar commented Mar 27, 2017

@muayyad-alsadi No doc updates, I'll follow up with those. Here's a blog post from last year and my talk at Linux.conf.au from earlier this year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.