Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

starting container process caused 'process_linux.go:245: running exec setns process for init caused "exit status 6"' #1130

Open
hkjn opened this issue Oct 20, 2016 · 29 comments

Comments

@hkjn
Copy link

hkjn commented Oct 20, 2016

Hi OCI folks,

We are seeing a failure to start Docker containers through runc, seemingly from this line:

This might well be a config or system issue (we're on somewhat old Kernel versions because CentOS..), but the logs don't give so much to go on here..

The man pages for setns is defining the error codes it should return:

But if the following page can be trusted, exit status 6 should be ENXIO, which is not mentioned in the man pages:

Any suggestions for how to debug further or what to check would be appreciated, thanks in advance!

Logs

/bin/docker: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:245: running exec setns process for init caused \\\"exit status 6\\\"\"\n".

System info

# uname -a
Linux ip-10-226-24-78 3.10.0-327.28.2.el7.x86_64 #1 SMP Wed Aug 3 11:11:39 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

# docker info
Containers: 1
 Running: 1
  Paused: 0
     Stopped: 0
     Images: 6
     Server Version: 1.12.2
     Storage Driver: overlay
      Backing Filesystem: xfs
        Logging Driver: json-file
        Cgroup Driver: cgroupfs
        Plugins:
         Volume: local
          Network: bridge null host overlay
            Swarm: inactive
            Runtimes: runc
            Default Runtime: runc
            Security Options: seccomp
            Kernel Version: 3.10.0-327.28.2.el7.x86_64
            Operating System: CentOS Linux 7 (Core)
            OSType: linux
            Architecture: x86_64
            CPUs: 2
            Total Memory: 7.389 GiB
            Name: ip-10-226-24-78
            ID: TNS5:V674:K6Y4:CSIT:ROPR:XJMI:LDSR:KTC3:DZS7:G7RD:426H:DFRN
            Docker Root Dir: /var/lib/docker
            Debug Mode (client): false
            Debug Mode (server): false
            Registry: https://index.docker.io/v1/
            WARNING: bridge-nf-call-iptables is disabled
            WARNING: bridge-nf-call-ip6tables is disabled
            Insecure Registries:
             127.0.0.0/8

# free -m
              total        used        free      shared  buff/cache   available
Mem:           7566         207         453           5        6904        4230
Swap:          2047         463        1584

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               2400.082
BogoMIPS:              4800.16
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0,1
@cyphar
Copy link
Member

cyphar commented Oct 20, 2016

The exit status 6 is a pretty ugly hack I added that allows us to figure out where inside this file your code is failing. An error from process_linux.go with "exit status 6" means that the 6th bail in that file was executed (in the version of runC you're running).

To cut a long story short, this is the code that is failing:

    /*
     * We must fork to actually enter the PID namespace, and use
     * CLONE_PARENT so that the child init can have the right parent
     * (the bootstrap process). Also so we don't need to forward the
     * child's exit code or resend its death signal.
     */
    childpid = clone_parent(env, config->cloneflags);
    if (childpid < 0)
        bail("unable to fork"); /* this is where exit status 6 comes from */

So, the big question is -- does your system support all of the namespaces that you're trying to use? What is the output of ls -la /proc/self/ns?

@hkjn
Copy link
Author

hkjn commented Oct 20, 2016

Ah, that helps explain the exit status, cheers.

What's odd here is that the failure was not consistent; sometimes the docker run command would work fine if we ran it manually, even if it failed with systemd, later it seemed to be failing with this symptom consistently.

The node degraded further and won't even let me ssh in now, so it's unfortunately hard to get more diagnostics from it.. another node which should be identically configured is giving the following output:

# ls -la /proc/self/ns
total 0
dr-x--x--x. 2 root root 0 Oct 20 09:13 .
dr-xr-xr-x. 9 root root 0 Oct 20 09:13 ..
lrwxrwxrwx. 1 root root 0 Oct 20 09:13 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 root root 0 Oct 20 09:13 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 root root 0 Oct 20 09:13 net -> net:[4026532028]
lrwxrwxrwx. 1 root root 0 Oct 20 09:13 pid -> pid:[4026531836]
lrwxrwxrwx. 1 root root 0 Oct 20 09:13 user -> user:[4026531837]
lrwxrwxrwx. 1 root root 0 Oct 20 09:13 uts -> uts:[4026531838]

But that node does not seem to hit the same issue as the first one; all services seem to have their containers start up fine.

I'll attach the info from /proc/self/ns from a node with this issue if it pops up again, feel free to close this bug or leave it open for others to chip in if they also hit the same symptom (couldn't find anything on Google by searching for the symptoms myself), your call.

@cyphar
Copy link
Member

cyphar commented Oct 20, 2016

@hkjn Actually, the best thing would be for you to attach an strace -f of runc when the issue occurs. Though, since you're using Docker this might prove difficult (and it will have very large performance effects that aren't favourable). If you can reproduce having a node like that again, please try running any runC container set up (without Docker) on that machine with strace -f runc run ... to see what breaks. Thanks.

@rajasec
Copy link
Contributor

rajasec commented Oct 20, 2016

@cyphar
When I run nested runc ( runc inside runc), I'm getting the below error
nsenter: unable to fork: Operation not permitted
container_linux.go:247: starting container process caused "process_linux.go:245: running exec setns process for init caused "exit status 6""
May not be the right use case, thought of testing it out.

@cyphar
Copy link
Member

cyphar commented Oct 21, 2016

@rajasec That's because you're trying to unshare namespaces you don't have the right to unshare. You'll have to take a look at the kernel code to figure out precisely what's happening (if you're trying to run runc from inside a chroot it's not going to work, for example).

@jaredbroad
Copy link

+1 have this error and don't use any runC for anything (though it might be used inside Mono). It also happens intermittently but mostly when the machine is tight on resources / overloaded.

Any other tips for debugging root cause if Im not using RunC?

@jamiethermo
Copy link

I have this error with docker (I assume docker-runc?). Not sure how I would debug it. Give me something to type and I'll type it?

@cyphar
Copy link
Member

cyphar commented Dec 14, 2016

Some information that would be useful from anyone else who comments on this issue:

  1. Are you running Docker with user namespaces enabled?
  2. Is SELinux enabled on your host and/or container?
  3. Can you use runc by itself -- outside of Docker? Read the README for information on how to start up a simple container.
  4. What kernel version / distribution are you using?

@jamiethermo
Copy link

No user namespaces.
SELinux is enabled & permissive
Don't have "runc". I have "docker-runc" which says its 1.0.0-rc2. Is that runc?
Centos 7.2: 3.10.0-327.36.2.el7.x86_64 #1 SMP Mon Oct 10 23:08:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

I'll have to tool around with it. I don't get a container following the runc readme. Doing something daft I expect.

@cyphar
Copy link
Member

cyphar commented Dec 14, 2016

@jamiethermo docker-runc is just what Docker calls it's packaged version of runc.

You can create a container like this:

% mkdir -p bundle/rootfs
% docker create --name a_new_rootfs alpine:latest true
% docker export a_new_rootfs | tar xvfC - bundle/rootfs
% runc spec -b bundle
% runc run -b bundle container
/ # # This is inside the container now.

Does that help?

@jamiethermo
Copy link

Ok. That works.

@cyphar
Copy link
Member

cyphar commented Dec 14, 2016

Alright, it would help to know what config.json the container is being started with (under Docker). Unfortunately Docker won't save the config.json if the container creation fails. You could try doing something like this:

% cat >/tmp/dodgy-runtime.sh <<EOF
#!/bin/sh

cat config.json >>/tmp/dodgy-runtime.log
exit 1
EOF
% chmod +x /tmp/dodgy-runtime.sh
% docker daemon --add-runtime="dodgy=/tmp/dodgy-runtime.sh" --default-runtime=dodgy

Then try to start a container. It will fail, but you should be able to get the config.json from /tmp/dodgy-runtime.log. You can then modify it so that the rootfs entry is equal to the string "rootfs" and then replace bundle/config.json in my previous comment with the old file.

Then runC should fail to start. Paste the config you got here.

@jamiethermo
Copy link

Ok. Can't do that right now. But since it seems arbitrary what is running and what is failing (the same docker image will run one minute and not the next), here's a config file that did get created. Don't know if that'll help. Will try the hack, above, tomorrow. Thanks!
config.json.zip

@hqhq
Copy link
Contributor

hqhq commented Dec 14, 2016

For people who get "exit status x",you can get the runc code you are using, then:

# cd libcontainer/nsenter
# gcc -E nsexec.c -o nsexec.i

Then you can find out which bail you hit from nsexec.i.

It's ugly though, we should improve it someday.

@cyphar
Copy link
Member

cyphar commented Dec 14, 2016

@hqhq Or you can count from the start of the file (which is what I do). Vim even has a shortcut for it. But yes, the bail(...) code was a hack to get around the fact that we aren't writing our errors to the error pipe in nsexec -- the only information we get is the return code. :P

@jamiethermo
Copy link

@cyphar Could I replace docker-runc with a bash script that saves off the config.json somewhere if it crashes? Could we make runc do that by default?

@cyphar
Copy link
Member

cyphar commented Dec 15, 2016

Could I replace docker-runc with a bash script that saves off the config.json somewhere if it crashes?

You could try that. By the way, if you haven't created an upstream bug report (in Docker) please do so.

Could we make runc do that by default?

I don't want to, mainly because it'd only be helpful for debugging things in certain cases under Docker. And runC is not just used inside Docker.

@jamesongithub
Copy link

ECS team thinks this issue is causing their agent to disconnect at times. Referenced aws/amazon-ecs-agent#658 (comment)

@jaredbroad
Copy link

jaredbroad commented Feb 1, 2017 via email

@jamesongithub
Copy link

hm might have to try that

@jamesongithub
Copy link

@cyphar is there a workaround for this? besides upgrading to ubuntu 16?

@cyphar
Copy link
Member

cyphar commented Mar 4, 2017

@jamesongithub It's likely that issues of this form are kernel issues (and since Ubuntu has interesting kernel policies, upgrading might be your only option), unless you have some very odd configurations. As I mentioned above, the error only tells us what line inside libcontainer/nsenter/nsexec.c failed (and unshare can fail for a wide variety of reasons).

@freefood89
Copy link

I've been having this issue with RHEL 7.3 too
SELINUX=enforcing
SELINUXTYPE=targeted

Besides being inexperienced with stuff like ns and runc, I'm struggling to figure out what's going on because it's intermittent as mentioned by @jamesongithub

ls -la /proc/self/ns shows the same results as @hkjn

@frezbo
Copy link

frezbo commented Jan 10, 2018

@frezbo
Copy link

frezbo commented Jan 10, 2018

For anyone having issues with RHEL only enable this option: namespace.unpriv_enable=1 and not this user_namespace.enable=1 having both in cmdline causes issues:

[ec2-user@ip-10-16-1-55 mycontainer]$ cat /proc/cmdline | grep "namespace.unpriv_enable=1"
BOOT_IMAGE=/boot/vmlinuz-3.10.0-693.11.6.el7.x86_64 root=UUID=de4def96-ff72-4eb9-ad5e-0847257d1866 ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 crashkernel=auto LANG=en_US.UTF-8 namespace.unpriv_enable=1
[ec2-user@ip-10-16-1-55 mycontainer]$ runc --root /tmp/runc run --no-pivot --no-new-keyring mycontainerid
/ # 

@chadfurman
Copy link

chadfurman commented Aug 10, 2018

I came here from google for a similar error. Turns out, I was trying to use the VOLUME directive in my dockerfile like this:

VOLUME . /src
thinking I could mount the current directory from the host as a volume like that, but that's not how it works.

You have to, instead, do this:

VOLUME /src
followed by
docker run -v /absolute/path/to/directory/on/host:/src <rest of your docker run command>

Note also (and somewhat unrelated) that I was getting similar errors on Fedora simply related to SELinux. And while I don't recommend doing the following for security reasons (see: http://stopdisablingselinux.com/), it did work for me:

sudo setenforce 0
sudo systemctl restart docker
docker build -t image .
docker run image

@smileusd
Copy link

smileusd commented Aug 21, 2018

I meet the same problem, when I build and start a image.

Sending build context to Docker daemon   220 MB
Step 1 : FROM warpdrive:tos-release-1-5
 ---> 769306738d96
Step 2 : COPY . /go/src/github.com/transwarp/warpdrive/
 ---> 07c99697b16e
Removing intermediate container 127c0e71a84b
Successfully built 07c99697b16e
/usr/bin/docker: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:245: running exec setns process for init caused \\\"exit status 6\\\"\"\n".
FATA[0301] exit status 125                              
make: *** [build] Error 1

Then I clean the a lot of images and containers and free the caches, the problem is disappear. But I think is not a cache problem because of the change of cache is tiny.

@meirwah
Copy link

meirwah commented Feb 14, 2019

@yipingxx
Copy link

It is bug of kernel(3.10.0-327),try to update your kernel version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests