diff --git a/docs/terminals.md b/docs/terminals.md new file mode 100644 index 00000000000..fc000e1a8b2 --- /dev/null +++ b/docs/terminals.md @@ -0,0 +1,314 @@ +# Terminals and Standard IO # + +*Note that the default configuration of `runc` (foreground, new terminal) is +generally the best option for most users. This document exists to help explain +what the purpose of the different modes is, and to try to steer users away from +common mistakes and misunderstandings.* + +In general, most processes on Unix (and Unix-like) operating systems have 3 +standard file descriptors provided at the start, collectively referred to as +"standard IO" (`stdio`): + +* `0`: standard-in (`stdin`), the input stream into the process +* `1`: standard-out (`stdout`), the output stream from the process +* `2`: standard-error (`stderr`), the error stream from the process + +When creating and running a container via `runc`, it is important to take care +to structure the `stdio` the new container's process receives. In some ways +containers are just regular processes, while in other ways they're an isolated +sub-partition of your machine (in a similar sense to a VM). This means that the +structure of IO is not as simple as with ordinary programs (which generally +just use the file descriptors you give them). + +## Other File Descriptors ## + +Before we continue, it is important to note that processes can have more file +descriptors than just `stdio`. By default in `runc` no other file descriptors +will be passed to the spawned container process. If you wish to explicitly pass +file descriptors to the container you have to use the `--preserve-fds` option. +These ancillary file descriptors don't have any of the strange semantics +discussed further in this document (those only apply to `stdio`) -- they are +passed untouched by `runc`. + +It should be noted that `--preserve-fds` does not take individual file +descriptors to preserve. Instead, it takes how many file descriptors (not +including `stdio` or `LISTEN_FDS`) should be passed to the container. In the +following example: + +``` +% runc run --preserve-fds 5 +``` + +`runc` will pass the first `5` file descriptors (`3`, `4`, `5`, `6`, and `7` -- +assuming that `LISTEN_FDS` has not been configured) to the container. + +In addition to `--preserve-fds`, `LISTEN_FDS` file descriptors are passed +automatically to allow for `systemd`-style socket activation. To extend the +above example: + +``` +% LISTEN_PID=$pid_of_runc LISTEN_FDS=3 runc run --preserve-fds 5 +``` + +`runc` will now pass the first `8` file descriptors (and it will also pass +`LISTEN_FDS=3` and `LISTEN_PID=1` to the container). The first `3` (`3`, `4`, +and `5`) were passed due to `LISTEN_FDS` and the other `5` (`6`, `7`, `8`, `9`, +and `10`) were passed due to `--preserve-fds`. You should keep this in mind if +you use `runc` directly in something like a `systemd` unit file. To disable +this `LISTEN_FDS`-style passing just unset `LISTEN_FDS`. + +**Be very careful when passing file descriptors to a container process.** Due +to some Linux kernel (mis)features, a container with access to certain types of +file descriptors (such as `O_PATH` descriptors) outside of the container's root +file system can use these to break out of the container's pivoted mount +namespace. [This has resulted in CVEs in the past.][CVE-2016-9962] + +[CVE-2016-9962]: https://nvd.nist.gov/vuln/detail/CVE-2016-9962 + +## Terminal Modes ## + +`runc` supports two distinct methods for passing `stdio` to the container's +primary process: + +* [new terminal](#new-terminal) (`terminal: true`) +* [pass-through](#pass-through) (`terminal: false`) + +When first using `runc` these two modes will look incredibly similar, but this +can be quite deceptive as these different modes have quite different +characteristics. + +By default, `runc spec` will create a configuration that will create a new +terminal (`terminal: true`). However, if the `terminal: ...` line is not +present in `config.json` then pass-through is the default. + +*In general we recommend using new terminal, because it means that tools like +`sudo` will work inside your container. But pass-through can be useful if you +know what you're doing, or if you're using `runc` as part of a non-interactive +pipeline.* + +### New Terminal ### + +In new terminal mode, `runc` will create a brand-new "console" (or more +precisely, a new pseudo-terminal using the container's namespaced +`/dev/pts/ptmx`) for your contained process to use as its `stdio`. + +When you start a process in new terminal mode, `runc` will do the following: + +1. Create a new pseudo-terminal. +2. Pass the slave end to the container's primary process as its `stdio`. +3. Send the master end to a process to interact with the `stdio` for the + container's primary process ([details below](#runc-modes)). + +It should be noted that since a new pseudo-terminal is being used for +communication with the container, some strange properties of pseudo-terminals +might surprise you. For instance, by default, all new pseudo-terminals +translate the byte `'\n'` to the sequence `'\r\n'` on both `stdout` and +`stderr`. In addition there are [a whole range of `ioctls(2)` that can only +interact with pseudo-terminal `stdio`][tty_ioctl(4)]. + +> **NOTE**: In new terminal mode, all three `stdio` file descriptors are the +> same underlying file. The reason for this is to match how a shell's `stdio` +> looks to a process (as well as remove race condition issues with having to +> deal with multiple master pseudo-terminal file descriptors). However this +> means that it is not really possible to uniquely distinguish between `stdout` +> and `stderr` from the caller's perspective. + +[tty_ioctl(4)]: https://linux.die.net/man/4/tty_ioctl + +### Pass-Through ### + +If you have already set up some file handles that you wish your contained +process to use as its `stdio`, then you can ask `runc` to pass them through to +the contained process (this is not necessarily the same as `--preserve-fds`'s +passing of file descriptors -- [details below](#runc-modes)). As an example +(assuming that `terminal: false` is set in `config.json`): + +``` +% echo input | runc run some_container > /tmp/log.out 2>& /tmp/log.err +``` + +Here the container's various `stdio` file descriptors will be substituted with +the following: + +* `stdin` will be sourced from the `echo input` pipeline. +* `stdout` will be output into `/tmp/log.out` on the host. +* `stderr` will be output into `/tmp/log.err` on the host. + +It should be noted that the actual file handles seen inside the container may +be different [based on the mode `runc` is being used in](#runc-modes) (for +instance, the file referenced by `1` could be `/tmp/log.out` directly or a pipe +which `runc` is using to buffer output, based on the mode). However the net +result will be the same in either case. In principle you could use the [new +terminal mode](#new-terminal) in a pipeline, but the difference will become +more clear when you are introduced to [`runc`'s detached mode](#runc-modes). + +## `runc` Modes ## + +`runc` itself runs in two modes: + +* [foreground](#foreground) +* [detached](#detached) + +You can use either [terminal mode](#terminal-modes) with either `runc` mode. +However, there are considerations that may indicate preference for one mode +over another. It should be noted that while two types of modes (terminal and +`runc`) are conceptually independent from each other, you should be aware of +the intricacies of which combination you are using. + +*In general we recommend using foreground because it's the most +straight-forward to use, with the only downside being that you will have a +long-running `runc` process. Detached mode is difficult to get right and +generally requires having your own `stdio` management.* + +### Foreground ### + +The default (and most straight-forward) mode of `runc`. In this mode, your +`runc` command remains in the foreground with the container process as a child. +All `stdio` is buffered through the foreground `runc` process (irrespective of +which terminal mode you are using). This is conceptually quite similar to +running a normal process interactively in a shell (and if you are using `runc` +in a shell interactively, this is what you should use). + +Because the `stdio` will be buffered in this mode, some very important +peculiarities of this mode should be kept in mind: + +* With [new terminal mode](#new-terminal), the container will see a + pseudo-terminal as its `stdio` (as you might expect). However, the `stdio` of + the foreground `runc` process will remain the `stdio` that the process was + started with -- and `runc` will copy all `stdio` between its `stdio` and the + container's `stdio`. This means that while a new pseudo-terminal has been + created, the foreground `runc` process manages it over the lifetime of the + container. + +* With [pass-through mode](#pass-through), the foreground `runc`'s `stdio` is + **not** passed to the container. Instead, the container's `stdio` is a set of + pipes which are used to copy data between `runc`'s `stdio` and the + container's `stdio`. This means that the container never has direct access to + host file descriptors (aside from the pipes created by the container runtime, + but that shouldn't be an issue). + +The main drawback of the foreground mode of operation is that it requires a +long-running foreground `runc` process. If you kill the foreground `runc` +process then you will no longer have access to the `stdio` of the container +(and in most cases this will result in the container dying abnormally due to +`SIGPIPE` or some other error). By extension this means that any bug in the +long-running foreground `runc` process (such as a memory leak) or a stray +OOM-kill sweep could result in your container being killed **through no fault +of the user**. In addition, there is no way in foreground mode of passing a +file descriptor directly to the container process as its `stdio` (like +`--preserve-fds` does). + +These shortcomings are obviously sub-optimal and are the reason that `runc` has +an additional mode called "detached mode". + +### Detached ### + +In contrast to foreground mode, in detached mode there is no long-running +foreground `runc` process once the container has started. In fact, there is no +long-running `runc` process at all. However, this means that it is up to the +caller to handle the `stdio` after `runc` has set it up for you. In a shell +this means that the `runc` command will exit and control will return to the +shell, after the container has been set up. + +You can run `runc` in detached mode in one of the following ways: + +* `runc run -d ...` which operates similar to `runc run` but is detached. +* `runc create` followed by `runc start` which is the standard container + lifecycle defined by the OCI runtime specification (`runc create` sets up the + container completely, waiting for `runc start` to begin execution of user + code). + +The main use-case of detached mode is for higher-level tools that want to be +wrappers around `runc`. By running `runc` in detached mode, those tools have +far more control over the container's `stdio` without `runc` getting in the +way (most wrappers around `runc` like `cri-o` or `containerd` use detached mode +for this reason). + +Unfortunately using detached mode is a bit more complicated and requires more +care than the foreground mode -- mainly because it is now up to the caller to +handle the `stdio` of the container. + +#### Detached Pass-Through #### + +In detached mode, pass-through actually does what it says on the tin -- the +`stdio` file descriptors of the `runc` process are passed through (untouched) +to the container's `stdio`. The purpose of this option is to allow a user to +set up `stdio` for a container themselves and then force `runc` to just use +their pre-prepared `stdio` (without any pseudo-terminal funny business). *If +you don't see why this would be useful, don't use this option.* + +**You must be incredibly careful when using detached pass-through (especially +in a shell).** The reason for this is that by using detached pass-through you +are passing host file descriptors to the container. In the case of a shell, +usually your `stdio` is going to be a pseudo-terminal (on your host). A +malicious container could take advantage of TTY-specific `ioctls` like +`TIOCSTI` to fake input into the **host** shell (remember that in detached +mode, control is returned to your shell and so the terminal you've given the +container is being read by a shell prompt). + +There are also several other issues with running non-malicious containers in a +shell with detached pass-through (where you pass your shell's `stdio` to the +container): + +* Output from the container will be interleaved with output from your shell (in + a non-deterministic way), without any real way of distinguishing from where a + particular piece of output came from. + +* Any input to `stdin` will be non-deterministically split and given to either + the container or the shell (because both are blocked on a `read(2)` of the + same FIFO-style file descriptor). + +They are all related to the fact that there is going to be a race when either +your host or the container tries to read from (or write to) `stdio`. This +problem is especially obvious when in a shell, where usually the terminal has +been put into raw mode (where each individual key-press should cause `read(2)` +to return). + +> **NOTE**: There is also currently a [known problem][issue-1721] where using +> detached pass-through will result in the container hanging if the `stdout` or +> `stderr` is a pipe (though this should be a temporary issue). + +[issue-1721]: https://github.com/opencontainers/runc/issues/1721 + +#### Detached New Terminal #### + +When creating a new pseudo-terminal in detached mode, and fairly obvious +problem appears -- how do we use the new terminal that `runc` created? Unlike +in pass-through, `runc` has created a new set of file descriptors that need to +be used by *something* in order for container communication to work. + +The way this problem is resolved is through the use of Unix domain sockets. +There is a feature of Unix sockets called `SCM_RIGHTS` which allows a file +descriptor to be sent through a Unix socket to a completely separate process +(which can then use that file descriptor as though they opened it). When using +`runc` in detached new terminal mode, this is how a user gets access to the +pseudo-terminal's master file descriptor. + +To this end, there is a new option (which is required if you want to use `runc` +in detached new terminal mode): `--console-socket`. This option takes the path +to a Unix domain socket which `runc` will connect to and send the +pseudo-terminal master file descriptor down. The general process for getting +the pseudo-terminal master is as follows: + +1. Create a Unix domain socket at some path, `$socket_path`. +2. Call `runc run` or `runc create` with the argument `--console-socket + $socket_path`. +3. Using `recvmsg(2)` retrieve the file descriptor sent using `SCM_RIGHTS` by + `runc`. +4. Now the manager can interact with the `stdio` of the container, using the + retrieved pseudo-terminal master. + +After `runc` exits, the only process with a copy of the pseudo-terminal master +file descriptor is whoever read the file descriptor from the socket. + +> **NOTE**: Currently `runc` doesn't support abstract socket addresses (due to +> it not being possible to pass an `argv` with a null-byte as the first +> character). In the future this may change, but currently you must use a valid +> path name. + +In order to help users make use of detached new terminal mode, we have provided +a [Go implementation in the `go-runc` bindings][containerd/go-runc.Socket], as +well as [a simple client][recvtty]. + +[containerd/go-runc.Socket]: https://godoc.org/github.com/containerd/go-runc#Socket +[recvtty]: /contrib/cmd/recvtty