seccomp: force -ENOSYS as new default action #2737

cyphar · 2021-01-14T09:47:55Z

Having -EPERM is the default was a fairly significant mistake from a
future-proofing standpoint in that it makes any new syscall return a
non-ignorable error (from glibc's point of view). We need to correct
this now because faccessat2(2) is something glibc critically needs to
have support for, but they're blocked on container runtimes because we
return -EPERM unconditionally (leading to confusion in glibc).

Unfortunately there are several issues which stop us from having a clean
solution to this problem:

libseccomp has several limitations which require us to emulate
behaviour we want:

a. We cannot do logic based on syscall number, meaning we cannot
specify a "largest known syscall number";
b. libseccomp doesn't know in which kernel version a syscall was
added, and has no API for "minimum kernel version" so we cannot
simply ask libseccomp to generate sane -ENOSYS rules for us.
c. Additional seccomp rules for the same syscall are not treated as
distinct rules -- if rules overlap, seccomp will merge them. This
means we cannot add per-syscall -EPERM fallbacks;
d. There is no inverse operation for SCMP_CMP_MASKED_EQ;
e. libseccomp does not allow you to specify multiple rules for a
single argument, making it impossible to invert OR rules for
arguments.
The runtime-spec does not have any way of specifying:

a. The errno for the default action;
b. The minimum kernel version or "newest syscall at time of profile
creation"; nor
c. Which syscalls were intentionally excluded from the allow list
(weird syscalls that are no longer used were excluded entirely,
but Docker et al expect those syscalls to get EPERM not ENOSYS).
Certain syscalls should not return -ENOSYS (especially only for
certain argument combinations) because this could also trigger glibc
confusion. This means we have to return -EPERM for certain syscalls
but not as a global default.
There is not an obvious (and reasonable) upper limit to syscall
numbers, so we cannot create a set of rules for each syscall above
the largest syscall number in libseccomp. This means we must handle
inverse rules as described below.
Any syscall can be specified multiple times, which can make
generation of hotfix rules much harder.

As a result, we have to work around all of these things by coming up
with a heuristic to stop the bleeding. In the future we could hopefully
improve the situation in the runtime-spec and libseccomp.

The solution is to switch the default action errno to -ENOSYS, and
simply pick a Linux kernel version which is treated as the universal
minimum version (Linux 3.0 was picked since basically no widely-used
distribution ships an older kernel, and it contains enough common
syscalls which would cause confusion if -ENOSYS was returned).

We then generate -EPERM rules for each syscall which:

Existed in Linux 3.0 (read: has a syscall number smaller than
"setns" on that platform) and does not have any explicit rules in
the configuration. These syscalls get an unconditional -EPERM,
because they are assumed to have been excluded intentionally.
Have an explicit conditional rule. These syscalls get a set of
-EPERM rules which invert the logic of the conditional, to ensure
that syscalls which are only partially permitted do not return
-ENOSYS.

This is trivial for all operations except SCMP_CMP_MASKED_EQ,
which has no inverse operatin. For MASKED_EQ we generate a series of
MASKED_EQ rules except for the value which was checked in the
original rule.

There are additional complications, which are all treated as simply
"too hard" and will thus fall back to -ENOSYS:

a. Any syscall specified in more than one seccomp rule is treated as
"too hard to handle" and will be ignored. This is because they
are logical OR rules but things become far too complicated if the
action is non-trivial.
b. Syscalls which have OR rules (the same argument specified
multiple times) because it would require AND support in
libseccomp, which doesn't exist.
c. Syscalls which do not have allow rules (ACT_LOG or ACT_ALLOW),
because other actions have different logical meanings to allow
rules and it'd be too complicated to handle those as well for a
hotfix.

None of the above happens if the default action is not SCMP_RET_ERRNO.

Fixes #2151
Signed-off-by: Aleksa Sarai [email protected]

cyphar · 2021-01-14T09:49:59Z

I'm not sure if this works in all scenarios because I haven't tested this thoroughly yet, but I'm pushing this so folks can comment.

/cc @richfelker @jethrob @fweimer Does this seem reasonable-ish to you? This is more complicated than I'd like but it's a result of several legacy aspects of runc's seccomp code (one of which is #2735 -- and is probably a bug).

richfelker · 2021-01-14T16:22:12Z

@cyphar: I still don't entirely understand the need for inverse rules rather than a well-defined fallback order, but if the latter is hard, I guess they work. At least the patch does not look too large/invasive and the high-level explanation of what it's doing seems to be reasonable and in alignment with what I expected.

cyphar · 2021-01-14T23:22:49Z

@richfelker As far as I understand, libseccomp doesn't have per-syscall fallback order (or at least it's not guaranteed to do so) -- if you add an unconditional -EPERM rule for a syscall it will remove any other rules present for that syscall. I'm pretty sure that the code for generating this is imperfect so we could probably hack around it (by doing MASKED_EQ of 0 to 0) but it seems a bit like playing with fire to do something like that.

Once seccomp/libseccomp#286 is resolved we can switch away from this workaround.

cyphar · 2021-01-15T05:49:35Z

It appears that libseccomp doesn't support a single rule referencing the same argument more than once (this is enforced by db_col_rule_add but appears to be undocumented). This means that we cannot invert OR-ified rules properly (as per De-Morgan's laws, a negation of OR statements becomes an AND statement of negations).

Just to make sure, I've tried adding a basic fallback rule (even with the MASKED_EQ) trick and as far as I can tell it does override existing rules -- though strangely it only overrides conditional rules, and non-conditional rules are not overridden (my guess it's because they are treated separately internally). So my comment above about per-syscall overrides is also correct.

I'm starting to get a creeping feeling that this might've been simpler to implement in BPF without any other workarounds.

/cc @pcmoore Am I misunderstanding libseccomp, or are these real restrictions on filters created with libseccomp?

cyphar · 2021-01-15T07:18:43Z

Okay this latest version handles things pretty well. The main thing which we cannot handle is OR rules for arguments (because of the above libseccomp restriction) but this doesn't impact the default Docker rules and thus is not a major issue -- besides I would argue the current implementation is actually a bug (see #2735).

All-in-all, I've tested this with a variety of cases and it does appear to work correctly. I'm marking this as ready for review -- PTAL @opencontainers/runc-maintainers. I'd also appreciate a review from @justincormack et al.

Having -EPERM is the default was a fairly significant mistake from a future-proofing standpoint in that it makes any new syscall return a non-ignorable error (from glibc's point of view). We need to correct this now because faccessat2(2) is something glibc *critically* needs to have support for, but they're blocked on container runtimes because we return -EPERM unconditionally (leading to confusion in glibc). Unfortunately there are several issues which stop us from having a clean solution to this problem: 1. libseccomp has several limitations which require us to emulate behaviour we want: a. We cannot do logic based on syscall number, meaning we cannot specify a "largest known syscall number"; b. libseccomp doesn't know in which kernel version a syscall was added, and has no API for "minimum kernel version" so we cannot simply ask libseccomp to generate sane -ENOSYS rules for us. c. Additional seccomp rules for the same syscall are not treated as distinct rules -- if rules overlap, seccomp will merge them. This means we cannot add per-syscall -EPERM fallbacks; d. There is no inverse operation for SCMP_CMP_MASKED_EQ; e. libseccomp does not allow you to specify multiple rules for a single argument, making it impossible to invert OR rules for arguments. 2. The runtime-spec does not have any way of specifying: a. The errno for the default action; b. The minimum kernel version or "newest syscall at time of profile creation"; nor c. Which syscalls were intentionally excluded from the allow list (weird syscalls that are no longer used were excluded entirely, but Docker et al expect those syscalls to get EPERM not ENOSYS). 3. Certain syscalls should not return -ENOSYS (especially only for certain argument combinations) because this could also trigger glibc confusion. This means we have to return -EPERM for certain syscalls but not as a global default. 4. There is not an obvious (and reasonable) upper limit to syscall numbers, so we cannot create a set of rules for each syscall above the largest syscall number in libseccomp. This means we must handle inverse rules as described below. 5. Any syscall can be specified multiple times, which can make generation of hotfix rules much harder. As a result, we have to work around all of these things by coming up with a heuristic to stop the bleeding. In the future we could hopefully improve the situation in the runtime-spec and libseccomp. The solution is to switch the default action errno to -ENOSYS, and simply pick a Linux kernel version which is treated as the universal minimum version (Linux 3.0 was picked since basically no widely-used distribution ships an older kernel, and it contains enough common syscalls which would cause confusion if -ENOSYS was returned). We then generate -EPERM rules for each syscall which: 1. Existed in Linux 3.0 (read: has a syscall number smaller than "setns" on that platform) and does not have any explicit rules in the configuration. These syscalls get an unconditional -EPERM, because they are assumed to have been excluded intentionally. 2. Have an explicit *conditional* rule. These syscalls get a set of -EPERM rules which invert the logic of the conditional, to ensure that syscalls which are only partially permitted do not return -ENOSYS. This is trivial for all operations *except* SCMP_CMP_MASKED_EQ, which has no inverse operatin. For MASKED_EQ we generate a series of MASKED_EQ rules except for the value which was checked in the original rule. There are additional complications, which are all treated as simply "too hard" and will thus fall back to -ENOSYS: a. Any syscall specified in more than one seccomp rule is treated as "too hard to handle" and will be ignored. This is because they are logical OR rules but things become far too complicated if the action is non-trivial. b. Syscalls which have OR rules (the same argument specified multiple times) because it would require AND support in libseccomp, which doesn't exist. c. Syscalls which do not have allow rules (ACT_LOG or ACT_ALLOW), because other actions have different logical meanings to allow rules and it'd be too complicated to handle those as well for a hotfix. None of the above happens if the default action is not SCMP_RET_ERRNO. Signed-off-by: Aleksa Sarai <[email protected]>

cyphar · 2021-01-15T08:33:40Z

I have a test program I've been using, but I'm not sure what's the nicest way to convert it to an integration test:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <errno.h>
#include <fcntl.h>

#include <sys/syscall.h>
#include <linux/openat2.h>

int openat2(int dfd, const char *path, struct open_how *how, size_t usize)
{
	int ret = syscall(SYS_openat2, dfd, path, how, usize);
	if (ret < 0)
		ret = -errno;
	return ret;
}

int pidfd_open(pid_t pid, unsigned int flags)
{
	int ret = syscall(SYS_pidfd_open, pid, flags);
	if (ret < 0)
		ret = -errno;
	return ret;
}

int _dup3(int oldfd, int newfd, int flags)
{
	int ret = dup3(oldfd, newfd, flags);
	if (ret < 0)
		ret = -errno;
	return ret;
}

int fake_syscall(void)
{
	int ret = syscall(1000);
	if (ret < 0)
		ret = -errno;
	return ret;
}

#define debug_assert(lval, rval)											\
	do {																	\
		int L = (lval), R = (rval);											\
		if (L != R) {														\
			printf("debug_assert(%s == %s) failed: %d != %d\n", #lval, #rval, L, R); \
		}																	\
	} while (0)

int main(void)
{
	// Basic rule.

	// Multiple arguments with OR rules.
	debug_assert(openat2(100, NULL, NULL, 0), -EINVAL);
	debug_assert(openat2(9001, NULL, NULL, 0), -EINVAL);
	debug_assert(openat2(0, NULL, NULL, 0), -ENOSYS); // :(
	debug_assert(openat2(0, NULL, NULL, 0), -ENOSYS); // :(

	// Multiple arguments with AND rules.
	debug_assert(pidfd_open(1, 1337), -EINVAL);
	debug_assert(pidfd_open(1, 0), -EPERM);
	debug_assert(pidfd_open(500, 1337), -EPERM);
	debug_assert(pidfd_open(500, 500), -EPERM);

	// Multiple rules for the same syscall.
	debug_assert(_dup3(0, -100, 0xFFFF), -ENOSYS);
	debug_assert(_dup3(1, -100, 0xFFFF), -EINVAL);
	debug_assert(_dup3(2, -100, 0xFFFF), -ENOSYS);
	debug_assert(_dup3(3, -100, 0xFFFF), -EINVAL);

	// Out-of-bounds syscall.
	debug_assert(fake_syscall(), -ENOSYS);
}

With the seccomp section in config.json being set to:

config.json

		"seccomp": {
			"defaultAction": "SCMP_ACT_ERRNO",
			"architectures": [
				"SCMP_ARCH_X86",
				"SCMP_ARCH_X32",
				"SCMP_ARCH_X86_64"
			],
			"syscalls": [
				{
					"action": "SCMP_ACT_ALLOW",
					"names": [
						"accept",
						"accept4",
						"access",
						"adjtimex",
						"alarm",
						"arch_prctl",
						"bind",
						"brk",
						"capget",
						"capset",
						"chdir",
						"chmod",
						"chown",
						"chown32",
						"clock_adjtime",
						"clock_adjtime64",
						"clock_getres",
						"clock_getres_time64",
						"clock_gettime",
						"clock_gettime64",
						"clock_nanosleep",
						"clock_nanosleep_time64",
						"close",
						"connect",
						"copy_file_range",
						"creat",
						"dup",
						"dup2",
						"epoll_create",
						"epoll_create1",
						"epoll_ctl",
						"epoll_ctl_old",
						"epoll_pwait",
						"epoll_wait",
						"epoll_wait_old",
						"eventfd",
						"eventfd2",
						"execve",
						"execveat",
						"exit",
						"exit_group",
						"faccessat",
						"faccessat2",
						"fadvise64",
						"fadvise64_64",
						"fallocate",
						"fanotify_mark",
						"fchdir",
						"fchmod",
						"fchmodat",
						"fchown",
						"fchown32",
						"fchownat",
						"fcntl",
						"fcntl64",
						"fdatasync",
						"fgetxattr",
						"flistxattr",
						"flock",
						"fork",
						"fremovexattr",
						"fsetxattr",
						"fstat",
						"fstat64",
						"fstatat64",
						"fstatfs",
						"fstatfs64",
						"fsync",
						"ftruncate",
						"ftruncate64",
						"futex",
						"futex_time64",
						"futimesat",
						"getcpu",
						"getcwd",
						"getdents",
						"getdents64",
						"getegid",
						"getegid32",
						"geteuid",
						"geteuid32",
						"getgid",
						"getgid32",
						"getgroups",
						"getgroups32",
						"getitimer",
						"getpeername",
						"getpgid",
						"getpgrp",
						"getpid",
						"getppid",
						"getpriority",
						"getrandom",
						"getresgid",
						"getresgid32",
						"getresuid",
						"getresuid32",
						"getrlimit",
						"get_robust_list",
						"getrusage",
						"getsid",
						"getsockname",
						"getsockopt",
						"get_thread_area",
						"gettid",
						"gettimeofday",
						"getuid",
						"getuid32",
						"getxattr",
						"inotify_add_watch",
						"inotify_init",
						"inotify_init1",
						"inotify_rm_watch",
						"io_cancel",
						"ioctl",
						"io_destroy",
						"io_getevents",
						"io_pgetevents",
						"io_pgetevents_time64",
						"ioprio_get",
						"ioprio_set",
						"io_setup",
						"io_submit",
						"io_uring_enter",
						"io_uring_register",
						"io_uring_setup",
						"ipc",
						"kill",
						"lchown",
						"lchown32",
						"lgetxattr",
						"link",
						"linkat",
						"listen",
						"listxattr",
						"llistxattr",
						"_llseek",
						"lremovexattr",
						"lseek",
						"lsetxattr",
						"lstat",
						"lstat64",
						"madvise",
						"membarrier",
						"memfd_create",
						"mincore",
						"mkdir",
						"mkdirat",
						"mknod",
						"mknodat",
						"mlock",
						"mlock2",
						"mlockall",
						"mmap",
						"mmap2",
						"modify_ldt",
						"mprotect",
						"mq_getsetattr",
						"mq_notify",
						"mq_open",
						"mq_timedreceive",
						"mq_timedreceive_time64",
						"mq_timedsend",
						"mq_timedsend_time64",
						"mq_unlink",
						"mremap",
						"msgctl",
						"msgget",
						"msgrcv",
						"msgsnd",
						"msync",
						"munlock",
						"munlockall",
						"munmap",
						"nanosleep",
						"newfstatat",
						"_newselect",
						"open",
						"openat",
						"pause",
						"pidfd_send_signal",
						"pipe",
						"pipe2",
						"poll",
						"ppoll",
						"ppoll_time64",
						"prctl",
						"pread64",
						"preadv",
						"preadv2",
						"prlimit64",
						"pselect6",
						"pselect6_time64",
						"pwrite64",
						"pwritev",
						"pwritev2",
						"read",
						"readahead",
						"readlink",
						"readlinkat",
						"readv",
						"recv",
						"recvfrom",
						"recvmmsg",
						"recvmmsg_time64",
						"recvmsg",
						"remap_file_pages",
						"removexattr",
						"rename",
						"renameat",
						"renameat2",
						"restart_syscall",
						"rmdir",
						"rseq",
						"rt_sigaction",
						"rt_sigpending",
						"rt_sigprocmask",
						"rt_sigqueueinfo",
						"rt_sigreturn",
						"rt_sigsuspend",
						"rt_sigtimedwait",
						"rt_sigtimedwait_time64",
						"rt_tgsigqueueinfo",
						"sched_getaffinity",
						"sched_getattr",
						"sched_getparam",
						"sched_get_priority_max",
						"sched_get_priority_min",
						"sched_getscheduler",
						"sched_rr_get_interval",
						"sched_rr_get_interval_time64",
						"sched_setaffinity",
						"sched_setattr",
						"sched_setparam",
						"sched_setscheduler",
						"sched_yield",
						"seccomp",
						"select",
						"semctl",
						"semget",
						"semop",
						"semtimedop",
						"semtimedop_time64",
						"send",
						"sendfile",
						"sendfile64",
						"sendmmsg",
						"sendmsg",
						"sendto",
						"setfsgid",
						"setfsgid32",
						"setfsuid",
						"setfsuid32",
						"setgid",
						"setgid32",
						"setgroups",
						"setgroups32",
						"setitimer",
						"setpgid",
						"setpriority",
						"setregid",
						"setregid32",
						"setresgid",
						"setresgid32",
						"setresuid",
						"setresuid32",
						"setreuid",
						"setreuid32",
						"setrlimit",
						"set_robust_list",
						"setsid",
						"setsockopt",
						"set_thread_area",
						"set_tid_address",
						"setuid",
						"setuid32",
						"setxattr",
						"shmat",
						"shmctl",
						"shmdt",
						"shmget",
						"shutdown",
						"sigaltstack",
						"signalfd",
						"signalfd4",
						"sigprocmask",
						"sigreturn",
						"socket",
						"socketcall",
						"socketpair",
						"splice",
						"stat",
						"stat64",
						"statfs",
						"statfs64",
						"statx",
						"symlink",
						"symlinkat",
						"sync",
						"sync_file_range",
						"syncfs",
						"sysinfo",
						"tee",
						"tgkill",
						"time",
						"timer_create",
						"timer_delete",
						"timer_getoverrun",
						"timer_gettime",
						"timer_gettime64",
						"timer_settime",
						"timer_settime64",
						"timerfd_create",
						"timerfd_gettime",
						"timerfd_gettime64",
						"timerfd_settime",
						"timerfd_settime64",
						"times",
						"tkill",
						"truncate",
						"truncate64",
						"ugetrlimit",
						"umask",
						"uname",
						"unlink",
						"unlinkat",
						"utime",
						"utimensat",
						"utimensat_time64",
						"utimes",
						"vfork",
						"vmsplice",
						"wait4",
						"waitid",
						"waitpid",
						"write",
						"writev"
					]
				},
				{
					"action": "SCMP_ACT_ALLOW",
					"names": [
						"dup3"
					],
					"args": [
						{
							"index": 0,
							"value": 1,
							"op": "SCMP_CMP_EQ"
						}
					]
				},
				{
					"action": "SCMP_ACT_ALLOW",
					"names": [
						"dup3"
					],
					"args": [
						{
							"index": 0,
							"value": 2,
							"op": "SCMP_CMP_GT"
						}
					]
				},
				{
					"action": "SCMP_ACT_ALLOW",
					"names": [
						"pidfd_open"
					],
					"args": [
						{
							"index": 0,
							"value": 1,
							"op": "SCMP_CMP_EQ"
						},
						{
							"index": 1,
							"value": 1337,
							"op": "SCMP_CMP_EQ"
						}
					]
				},
				{
					"action": "SCMP_ACT_ALLOW",
					"names": [
						"openat2"
					],
					"args": [
						{
							"index": 0,
							"value": 100,
							"op": "SCMP_CMP_EQ"
						},
						{
							"index": 0,
							"value": 9001,
							"op": "SCMP_CMP_EQ"
						}
					]
				},
				{
					"action": "SCMP_ACT_ALLOW",
					"names": [
						"clone"
					],
					"args": [
						{
							"index": 0,
							"value": 2114060288,
							"op": "SCMP_CMP_MASKED_EQ"
						}
					]
				}
			]
		}

giuseppe · 2021-01-15T09:18:40Z

it seems to work well with the default Podman seccomp profile, that has some weird handling for socket: https://github.com/containers/common/blob/master/pkg/seccomp/seccomp.json#L791-L897

AkihiroSuda · 2021-01-15T09:25:13Z

I have a test program I've been using, but I'm not sure what's the nicest way to convert it to an integration test:

I think we can just compile and run the C code from bats

cyphar · 2021-01-15T09:39:13Z

@giuseppe That example is actually ignored entirely by the workaround code -- you will get -ENOSYS as a fallback because we can't generate a proper rule-set for the fallback. Hopefully that doesn't break anything 🤞.

cyphar · 2021-01-15T11:22:00Z

I hit my first -ENOSYS issue -- renameat2 in glibc hides -ENOSYS and will return a fake -EINVAL if you have non-zero flag arguments. I've worked around it but boy do I really hope there aren't many more regressions like this...

giuseppe · 2021-01-15T11:23:48Z

@giuseppe That example is actually ignored entirely by the workaround code -- you will get -ENOSYS as a fallback because we can't generate a proper rule-set for the fallback. Hopefully that doesn't break anything crossed_fingers.

I see. I guess it works because socket doesn't rely on the default return value in this case, and it will be allowed by the next rule.

Differently, both personality and clone get -ENOSYS by default, but I think these syscalls are more difficult to cause issues

cyphar · 2021-01-15T11:27:14Z

clone shouldn't be getting ENOSYS -- while the podman/docker profile has multiple references to it, only one rule is actually generated (at least according to my tests). clone should still be getting EPERM but yeah the personality configuration appears to trigger ENOSYS too...

fweimer-rh · 2021-01-15T11:28:29Z

I hit my first -ENOSYS issue -- renameat2 in glibc hides -ENOSYS and will return a fake -EINVAL if you have non-zero flag arguments. I've worked around it but boy do I really hope there aren't many more regressions like this...

mlock2 does this as well. The idea is to mirror what the kernel would do if it encounters an unknown/unsupported flag. If the system call does not have a flag argument, then all flags are unsupported.

I don't see how this could be a problem. Could you please elaborate?

You can make a direct system call using syscall (SYS_renameat2, …) to avoid this, but SYS_renameat2 might not be defined if it's missing from the kernel headers (likewise the renameat2 function is only available since glibc 2.28).

cyphar · 2021-01-15T11:31:08Z

@fweimer-rh The (potential) issue is that programs which used to get -EPERM when a seccomp profile was violated (such as "no clone(CLONE_NEWUSER) for containers") may now start getting strange errors like -EINVAL or -ENOENT depending on the glibc fallback code. While it is still an error, it is possible that userspace programs have fallbacks for -EPERM but not for -EINVAL (or -ENOSYS). runc has fallback code that uses os.IsPermission which won't catch -EINVAL nor -ENOENT.

I don't think there is a nice solution for this outside of making libseccomp be able to handle this whole problem more nicely, but it is something that does worry me a fair bit...

fweimer-rh · 2021-01-15T11:42:36Z

@cyphar The EINVAL behavior was specifically motivated by the existing kernel behavior. clone(2) says this:

EINVAL

CLONE_NEWUSER was specified in the flags mask, but the kernel was not configured with the CONFIG_USER_NS option.

My mental model is that seccomp is essentially deconfiguring otherwise supported kernel interfaces, so returning EINVAL for filtered flags is not wrong per se.

(Note that glibc doesn't do anything like that with clone3 currently, so this is hypothetical at this point, but clone(2) is fairly well-documented, so it's a good example.)

cyphar · 2021-01-15T13:02:08Z

@fweimer-rh

My mental model is that seccomp is essentially deconfiguring otherwise supported kernel interfaces, so returning EINVAL for filtered flags is not wrong per se.

I appreciate this is a bit of a philosophical discussion, but my mental model is that we are acting like an ad-hoc MAC or LSM-like system which is explicitly restricting features. This isn't true for unknown syscalls, but it is definitely true for cases like clone(CLONE_NEWUSER).

pcmoore · 2021-01-15T14:21:06Z

It appears that libseccomp doesn't support a single rule referencing the same argument more than once (this is enforced by db_col_rule_add but appears to be undocumented).

My hope was that the seccomp_rule_add(3) manpage made this clear (excerpt below), suggestions on how to improve the documentation are always welcome.

When adding syscall argument comparisons to the filter it is important to remember that while it is possible to have multiple comparisons in a single rule, you can only compare each argument once in a single rule. In other words, you can not have multiple comparisons of the 3rd syscall argument in a single rule.

https://www.mankier.com/3/seccomp_rule_add

cyphar · 2021-01-15T14:49:55Z

@pcmoore It seems that text is only available in man '3+1' seccomp_rule_add on my distribution, not man 3 secomp_rule_add. So it is documented, just not in the version of the documentation I was reading. 😬

cyphar · 2021-01-15T14:57:15Z

Seems like the --debug test is flaky, it's failed about 30% of the time I've re-run CI on this PR...

cyphar · 2021-01-15T15:14:42Z

Hmm, now the test is failing on CentOS 7 with no warnings or any output other than runc run exiting with a status of 218. Oh and it looks like the GHA integration tests aren't catching test failures under Vagrant? 😬

pcmoore · 2021-01-15T15:36:35Z

@pcmoore It seems that text is only available in man '3+1' seccomp_rule_add on my distribution, not man 3 secomp_rule_add. So it is documented, just not in the version of the documentation I was reading.

Sometimes it's easy to miss things in the docs too; I've done it more times than I care to remember.

The man 3+1 <blah> is new to me, and a quick search through man's manpage doesn't show anything jumping out. Odd. Out of curiosity, what distro are you using?

justincormack · 2021-01-15T15:50:39Z

This seems totally the wrong place to do this, runc is (sigh) just doing what it is told and if you have to second guess its behaviour too thats even worse. There are only a few seccomp policies in the world, just fix them.

richfelker · 2021-01-15T16:37:12Z

It appears that libseccomp doesn't support a single rule referencing the same argument more than once (this is enforced by db_col_rule_add but appears to be undocumented).

My hope was that the seccomp_rule_add(3) manpage made this clear (excerpt below), suggestions on how to improve the documentation are always welcome.

I don't think this is a documentation problem. The underlying problem is that the DSL/libseccomp API for this is just utterly misdesigned in a way that's anti-intuitive, harder to use than, and less powerful than, raw BPF. Is there no way it can be fixed to be somewhat less bad while maintaining compatibility with important caller expectations? If not, can the necessary fixups just be done on the resulting BPF rather than via the deficient libseccomp API?

pcmoore · 2021-01-15T16:54:07Z

I don't think this is a documentation problem. The underlying problem is that the DSL/libseccomp API for this is just utterly misdesigned in a way that's anti-intuitive, harder to use than, and less powerful than, raw BPF. Is there no way it can be fixed to be somewhat less bad while maintaining compatibility with important caller expectations? If not, can the necessary fixups just be done on the resulting BPF rather than via the deficient libseccomp API?

That libseccomp library is just awful isn't it? As I understand it the libseccomp maintainers are open to accepting patches to improve the library; which is good, because they are clearly a bunch of idiots who don't know what they are doing.

Alternatively you could simply write your own seccomp filters in raw BPF, that's clearly what a real programmer would do in this case. The libseccomp maintainers openly say that libseccomp is not for everyone or every use case; they say the goal is to lower the bar for the common seccomp use cases in hopes that more applications/developers will make use of seccomp. However, what they should really be saying is that libseccomp is always worse than writing raw BPF and only idiots should use it.

cyphar · 2021-01-16T00:04:24Z

@pcmoore

The man 3+1 is new to me, and a quick search through man's manpage doesn't show anything jumping out. Odd. Out of curiosity, what distro are you using?

openSUSE. The seccomp_rule_add(3+1) man page is an entirely separate doc and has several paragraphs that are missing from seccomp_rule_add(3).

@justincormack

This seems totally the wrong place to do this, runc is (sigh) just doing what it is told and if you have to second guess its behaviour too thats even worse. There are only a few seccomp policies in the world, just fix them.

So switch the default errno to -ENOSYS and then add explicit -EPERM rules? The issue is that the current implementation of this is a workaround (I would hope it would only exist for a few months at most) -- the long-term goal would be that seccomp profiles indicate what their minimum kernel version is, so that the -EPERM/-ENOSYS split is not handled by the profile author explicitly.

The other (potentially simpler) solution is to temporarily implement this in BPF ourselves until libseccomp has the minimum kernel feature we need (there are some other features that would be nice like supporting multiple AND rules for the same argument, but they aren't mandatory).

You also can't specify the return value in the default action, so we will need to change runc somehow.

Vagrantfile.centos7

tests/integration/seccomp.bats

tests/integration/testdata/.gitignore

tests/integration/testdata/seccomp_syscall_test1.c

tests/integration/seccomp.bats

cyphar · 2021-01-16T13:33:33Z

@richfelker

If not, can the necessary fixups just be done on the resulting BPF rather than via the deficient libseccomp API?

Leaving aside the way you phrased your point, I just tried my hand at this and it's at least somewhat non-trivial to get right. Because libseccomp doesn't have per-syscall fallback paths everything goes through a final reta instruction at the end (except for per-syscall SECCOMP_RET_ERRNO rules maybe?) -- the result is that we would need to figure out how to disentangle the different failures when patching. We would also need to do this in an architecture-specific way and we'd still have issues with holes in the syscall table.

All of this is far more complicated than just making the default -ENOSYS and returning -EPERM for syscalls present in Linux 3.0 but not mentioned in the filter. For the cases where we are returning -ENOSYS but we previously returned -EPERM, this PR does handle more of those cases but not all of them -- and ideally this hotfix should be as simple as possible without breaking containers.

So for now it may just be simplest to do the most basic -ENOSYS switch with trivial -EPERM handling (return -EPERM for any syscall not mentioned in the profile at all but was present in Linux 3.0). I think that this PR will lead to less breakages than the most trivial -EPERM handling, but maybe we should see what breaks before we commit to more complicated filter modifications.

After that I can work on cooking up our own BPF generation that handles these cases in the way we would like them to be handled. It is a bit sad we'll have to reinvent most of libseccomp's architecture handling, but we will still use the libseccomp syscall name lookup and architecture information to avoid having to handle all of that ourselves too... And we will need kernel version information anyway (which I will work on implementing in libseccomp).

I do think long-term we could switch back to libseccomp but there are a few other limitations we would ideally like to have fixed in libseccomp (the multiple-rule-for-a-single-argument restriction would be great to remove -- separately from this whole issue).

The slightly positive thing is that we can use the golang.org/x/net/bpf emulation code to do unit testing of our generated programs.

cyphar · 2021-01-17T10:20:12Z

#2746 is a hotfix variant of this PR that only does -EPERM injection for the "easy" case. I am working on the custom BPF generation separately...

Signed-off-by: Aleksa Sarai <[email protected]>

giuseppe · 2021-01-17T15:04:31Z

You also can't specify the return value in the default action, so we will need to change runc somehow.

do you think it makes sense to update the runtime specs now to allow overriding the default action?

cyphar · 2021-01-17T21:52:48Z

@giuseppe I assume you meant the default errno for the default action? That's probably not a bad idea to do anyway, but that's not actually enough to solve this problem -- ideally the ENOSYS behaviour should be auto-generated by the container runtime because it'd be too complicated for users of runc to generate it (not to mention the whole thing is a footgun).

So we will also need to add either a "minimal kernel version" field to the spec or otherwise define the -ENOSYS behaviour we need.

There are also several aspects of runc's seccomp implementation that aren't really in line with the runtime-spec -- such as #2735 (though the runtime-spec's definition of seccomp rules is woefully vague so you could argue that it's compliant purely because it's so ill-defined). We need to fix those too.

giuseppe · 2021-01-18T07:37:24Z

@giuseppe I assume you meant the default errno for the default action? That's probably not a bad idea to do anyway, but that's not actually enough to solve this problem -- ideally the ENOSYS behaviour should be auto-generated by the container runtime because it'd be too complicated for users of runc to generate it (not to mention the whole thing is a footgun).

yes, I was referring to the errno for the default action, but perhaps it is not necessary with the approach in #2746? Would exposing lastSyscall (and default to setns if it is not specified) be better?

cyphar · 2021-01-18T09:01:52Z

@giuseppe A better solution is something more like seccomp/libseccomp#11 -- you specify the minimum kernel version of the profile -- any syscalls newer than that kernel version get -ENOSYS rather than -EPERM. In principle you don't really need lastSyscall (or even a "minimum kernel version") because that information is already present in the profile (the newest syscall in the profile is the latest one) -- but in order for this to be robust we need kernel version information in libseccomp (syscalls aren't always added sequentially to the syscall table).

But I would prefer we don't touch the runtime-spec until libseccomp has the features we need to make a decision about what the right API should be. The current seccomp API was pretty much wholesale copied from the internal runc implementation, which is quite worrying because the runc implementation has weird footguns like #2735 and #1847. I don't want a(nother) runc hotfix to be codified in the spec.

fweimer-rh · 2021-01-18T10:30:00Z

What is the impact of the Linux 3.0 cut-off point? I'm asking because modern glibc assumes at least kernel 3.2 (main differences are prlimit64 and accept4 on some architectures, I think). If I understand things correctly, it means that getrlimit fails with EPERM with the 3.2 limit, and with ENOSYS with the 3.0 limit because the implementation calls prlimit64 internally, unless there is a rule for prlimit64 in the policy. In theory, getrlimit and pr64limit with a zero PID and a null pointer for the new limit should be treated identically. But maybe this has to be punted to policy authors.

A complete different concern is who maintains the name-to-syscall mapping. @cyphar You said you wanted to get this from libseccomp. However, downstream, at least we don't treat libseccomp as a package that we need to keep updating continuously for each system call that is added to the upstream (!) kernel. Maybe this is something that we need to change downstream, and I will have to start a discussion about that.

cyphar · 2021-01-18T13:02:35Z

@fweimer-rh Well this is intended as a hotfix, if you think 3.2 is a better cutoff point (the fact it's the minimum version required for glibc is probably the best reason to pick it) we can use that instead. Especially if you think a Linux 3.0 cut-off will break glibc installations. But we don't plan to have this workaround stick around for a very long time -- which is why I've hacked up #2746 as an even simpler implementation of this PR so that we can have something merged.

The longer-term goal is to get rid of the hard-coded hotfix and instead either:

Get policy authors to specify what kernel version the profile was written for --effectively indicating what set of syscalls had been audited when creating the profile. Any syscall that existed in the specified kernel version will get the default action (-EPERM) but any newer syscalls (syscalls which did not exist in the syscall table in that kernel version) will get -ENOSYS to be future-compatible. This is the most "correct" solution but requires users to specify the kernel version.
Auto-detect what kernel version the profile was written for by taking the newest syscall referenced in the profile and using its syscall version as the target kernel version. This requires no additional configuration, but will give the wrong results in some cases.

I think the way we may implement this eventually is to have (1) be the preferred method but if no kernel version is specified we fall back to (2). The reason we need to do this on a kernel version level rather than syscall number level is that the syscall table can have holes in it, and thus any simplistic "just check if the syscall number is larger than X" checks can be wrong -- this happened recently with close_range -- it was added after openat2 despite openat2 having a larger syscall number.

However this long-term solution requires features in libseccomp (or some other source) which don't currently exist -- see below.

A complete different concern is who maintains the name-to-syscall mapping. @cyphar You said you wanted to get this from libseccomp.

We already use libseccomp for this purpose (for two reasons -- there is no real kernel API for it, and we only care about syscall names in the context of seccomp). If we change sources of this information that'd be fine so long as the mapping didn't change for old syscalls.

I agree that this is less than ideal because it requires libseccomp to be updated fairly regularly, but right now these are the tools we have at our disposal (and in practice it hasn't been a huge problem -- the libseccomp maintainers are pretty reactive to new syscall additions). However it should also be noted that for libseccomp's usecase we also need to have the following information which probably won't exist in a hypothetical kernel API:

Syscall numbers for different architectures because libseccomp supports generating a filter that works on more than one architecture. libseccomp already has this information.
In what kernel version a syscall first appeared. This is not currently provided by libseccomp, but is something they are working on (RFE: support "maximum kernel version" seccomp/libseccomp#11) and will be necessary for the long-term solution I described above.

cyphar · 2021-01-18T14:29:52Z

I've thought of a nicer interim solution to this problem -- inspired by @richfelker, except we prepend our -ENOSYS handling -- which I'll implement in this PR tomorrow.

But we still need a better long-term solution.

cyphar · 2021-01-19T06:35:09Z

#2750 is a cleaner (though still not ideal) solution to this problem. Closing in favour of that implementation.

cyphar mentioned this pull request Jan 14, 2021

seccomp filter should return ENOSYS for unknown syscalls #2151

Closed

AkihiroSuda added the impact/changelog label Jan 14, 2021

AkihiroSuda added this to the 1.0.0-rc93 milestone Jan 14, 2021

cyphar marked this pull request as ready for review January 15, 2021 07:18

cyphar requested review from AkihiroSuda and a team January 15, 2021 08:11