Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add support for encrypted images #2297

Draft
wants to merge 279 commits into
base: criu-dev
Choose a base branch
from

Conversation

rst0git
Copy link
Member

@rst0git rst0git commented Nov 7, 2023

This pull request extends CRIU with support for encrypted images. A new cli option, -e|--encrypt, is used to enable this functionality with the dump command.

The implementation is based on the existing integration with GnuTLS, using ChaCha20-Poly1305 for protobuf and raw images, and AES-XTS for memory pages. The symmetric keys used for encryption are randomly generated, encrypted with a public key loaded from X.509 certificate and stored in cipher.img. During restore, if cipher.img exists, CRIU will load a corresponding private key from a PEM file and decrypt the symmetric keys.

Usage example:

criu dump -e ...
criu restore ...

The following figure shows the results of performance evaluation, where CRIUsec includes the changes in this pull request, CRIU is used without encryption as a baseline, and GnuPG, OpenSSL, age are alternative solutions used with post-dump action-script for comparison.

osctobe and others added 30 commits June 14, 2023 17:38
Add a sanity check for THP_DISABLE. This discovered a broken commit
in Google's kernel tree.

Signed-off-by: Michał Mirosław <[email protected]>
Apparently Skylake uses init-optimization when saving FPU state, and ptrace()
returns XSTATE_BV[0] = 0 meaning FPU was not used by a task (in init state).
Since CRIU restore uses sigreturn to restore registers, FPU state is always
restored. Fill the state with default values on dump to make restore happy.

Signed-off-by: Michał Mirosław <[email protected]>
This commit revises the error handling in the fdspy test. Previously,
a failure case could have been incorrectly reported as successful because
of a specific check `pass != 0`, leading to potential false positives
when `check_pipe_ends()` returned `-1` due to a read/write pipe error.

To improve this, we've adjusted the error handling to return `0` in case
of any error. As such, the final success condition remains unchanged. This
approach will help accurately differentiate between successful and failed
cases, ensuring the output "All OK" is printed for success, and "Something
went WRONG" for any failure.

Fixes: 5364ca3 ("compel/test: Fix warn_unused_result")

Signed-off-by: Haorong Lu <[email protected]>
Use $TMPDIR for tests_root as the host's /tmp might not have enough
features or space.

Signed-off-by: Michał Mirosław <[email protected]>
Extend ability to limit time taken to all CRIU invocations.

Signed-off-by: Michał Mirosław <[email protected]>
We don't want test framework to change its behaviour on whether we
run a single or multiple tests in a run. When we shard the test suite
it can result in some shards having a single test to run and
unexpectedly change the test output format.

Signed-off-by: Michał Mirosław <[email protected]>
Allow to split test suite into predictable sets to parallelize runs on
multiple machines or VMs.

Signed-off-by: Michał Mirosław <[email protected]>
Make it clear that the option numbers are indexes not the option
identifiers ("names"). Also show the value change that prompted test
failure.

Signed-off-by: Michał Mirosław <[email protected]>
Make it possible to skip network lock to enable uses that break connections
anyway to work without iptables/nftables being present.

Signed-off-by: Michał Mirosław <[email protected]>
The fail() macro provides a new line character at the end of the
message. This patch fixes the following lint check that currently
fails in CI:

$ git --no-pager grep -E '^\s*\<(pr_perror|fail)\>.*\\n"'
test/zdtm/static/thp_disable.c:		fail("prctl(GET_THP_DISABLE) returned unexpected value: %d != 1\n", ret);
test/zdtm/static/thp_disable.c:		fail("Flags changed %lx -> %lx\n", orig_flags, new_flags);
test/zdtm/static/thp_disable.c:		fail("Madvs changed %lx -> %lx\n", orig_madv, new_madv);
test/zdtm/static/thp_disable.c:		fail("post-migration prctl(GET_THP_DISABLE) returned unexpected value: %d != 1\n", ret);
test/zdtm/static/thp_disable.c:		fail("Flags changed %lx -> %lx\n", orig_flags, new_flags);
test/zdtm/static/thp_disable.c:		fail("Madvs changed %lx -> %lx\n", orig_madv, new_madv);

Fixes: checkpoint-restore#2193

Signed-off-by: Radostin Stoyanov <[email protected]>
During dump, CRIU stores the structs representing sockets in a statically sized
hashmap of size 32. We have some (admittedly crazy) tasks that use tens of
thousands of sockets, and seem to spend most of the dump time iterating over
the linked lists of the map.

16K is chosen arbitrarily, so that it reduces the lengths of the chains to few
elements on average, while not introducing significant memory overhead.

From: Radosław Burny <[email protected]>
Signed-off-by: Michał Mirosław <[email protected]>
Try IPv6 if IPv4 sockets are not supported.

Signed-off-by: Michał Mirosław <[email protected]>
The test for HAS_MEMFD is empty and noit used. Remove it.

Fixes: 5ee1ac1 ("criu: remove FEATURE_TEST_MEMFD")
Change-Id: I43b8f0cfd50ce9bdf93dafb647377318df1deae8
Signed-off-by: Michał Mirosław <[email protected]>
`make` without `-s` option will normally show the commands executed. In
the case of detecting build environment features current makefile will
cause detected features to be seen as 'echo #define' commands, but not
detected ones will be silent. Change it so that all tried features can
be seen (outside of make's silent mode) regardless of detection result.

Signed-off-by: Michał Mirosław <[email protected]>
$LDFLAGS can contain `-Ldir`s that are required by '-lib's in $LIBS.
Reverse the order so that `-L` options make effect.

Signed-off-by: Michał Mirosław <[email protected]>
Make $(AR) used also for libzdtmtst build.

Signed-off-by: Michał Mirosław <[email protected]>
When trying to build CRIU with libbsd enabled the compilation fails due
to duplicate definition of __aligned macro. Other such definitions are
already wrapped with #ifndef make __aligned definition consistent and
make it easier in the future to use the libbsd features if needed.

Signed-off-by: Michał Mirosław <[email protected]>
nla_get_s32() was added to libnl 3.2.7 in 2015. Remove CRIU's definition
as it breaks build when statically linking the binary.

From: Uros Prestor <[email protected]>
Signed-off-by: Michał Mirosław <[email protected]>
Container runtimes commonly use CRIU with RPC. However, this prevents
the use of action-scripts set in a CRIU configuration file due to the
explicit scripts mode introduced with the following commit:

  ac78f13
  actions: Introduce explicit scripts mode

This patch enables container checkpoint/restore with action-scripts
specified via configuration file.

Signed-off-by: Radostin Stoyanov <[email protected]>
New 'query-ext-files' action for `criu dump` is sent after
freezing the process tree. This allows to defer gathering
the external file list when the process tree is in a stable
state and avoids race with the process creating and deleting
files.

Change-Id: Iae32149dc3992dea086f513ada52cf6863beaa1f
Signed-off-by: Michał Mirosław <[email protected]>
Google's RPC client process is in a different pidns and has more privileges --
CRIU can't open its /proc/<pid>/fd/<fd>.  For images_dir_fd to be useful here
it would need to refer to a passed or CRIU's fd.

From: Michał Cłapiński <[email protected]>
Change-Id: Icbfb5af6844b21939a15f6fbb5b02264c12341b1
Signed-off-by: Michał Mirosław <[email protected]>
If the error is ignored it is not important enough - make it a warning
instead.

From: Mian Luo <[email protected]>
Change-Id: If2641c3d4e0a4d57fdf04e4570c49be55f526535
Signed-off-by: Michał Mirosław <[email protected]>
kerndat_nsid() is not used outside kerndat.c. Make it static.

Change-Id: I52e518ecb7c627cc1866e373411b2be3f71a2c9d
Signed-off-by: Michał Mirosław <[email protected]>
If not dumping netns nor connections, nsid support is not used. Don't
fail the run as if the support is needed, the dumping process will fail
later.

Change-Id: I39a086756f6d520c73bb6b21eaf6d9fb49a18879
Signed-off-by: Michał Mirosław <[email protected]>
bsach64 and others added 26 commits June 7, 2024 22:26
Move PYTHON_EXTERNALLY_MANAGED and PIP_BREAK_SYSTEM_PACKAGES
into Makefile.install to avoid code duplication. In addition, add
PIPFLAGS variable to enable specifying pip options during installation.
This is particularly useful for packaging, where it is common for `pip install`
to run in an environment with pre-installed dependencies and without internet
access. In such environment, we need to specify the following options:

    --no-build-isolation --no-index --no-deps

Signed-off-by: Radostin Stoyanov <[email protected]>
The current link opens a page with the following text:

    The MediaWiki FAQ can be found at:
    https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:FAQ

Signed-off-by: Radostin Stoyanov <[email protected]>
…and run the plugin finalizer later in the dump sequence

Restore rseq_cs state before calling RESUME_DEVICES_LATE as the CUDA plugin will
temporarily unfreeze a thread during the plugin hook to assist with device
restore

Run the plugin finalizer later in the dump sequence since the finalizer is used
by the CUDA plugin to handle some process cleanup

Signed-off-by: Jesus Ramos <[email protected]>
…DEVICES to be used during pstree collection

PAUSE_DEVICES is called before a process is frozen and is used by the CUDA
plugin to place the process in a state that's ready to be checkpointed and
quiesce any pending work

CHECKPOINT_DEVICES is called after all processes in the tree have been frozen
and PAUSE'd and performs the actual checkpointing operation for CUDA
applications

Signed-off-by: Jesus Ramos <[email protected]>
Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an
r555 or higher driver along with the cuda-checkpoint binary.

Signed-off-by: Jesus Ramos <[email protected]>
Commit fc683cb ("compel: shstk: save CET state when CPU supports it")
started using PTRACE_ARCH_PRCTL to query shadow stack status. While
PTRACE_ARCH_PRCTL has existed in the kernel for a long time, it was only
added to glibc in version 2.27. Amazon Linux 2 (AL2) has glibc 2.26,
which does not have this definition. As a result, build on AL2 fails
with the below error:

    compel/arch/x86/src/lib/infect.c: In function ‘get_task_xsave’:
    compel/arch/x86/src/lib/infect.c:276:14: error: ‘PTRACE_ARCH_PRCTL’ undeclared (first use in this function)
    276 |   if (ptrace(PTRACE_ARCH_PRCTL, pid, (unsigned long)&features, ARCH_SHSTK_STATUS)) {
        |              ^~~~~~~~~~~~~~~~~

While the definition is present on the system via the kernel headers (in
asm/ptrace-abi.h) which can be reached by including linux/ptrace.h, the
comment in compel/include/uapi/ptrace.h says:

    We'd want to include both sys/ptrace.h and linux/ptrace.h, hoping
    that most definitions come from either one or another. Alas, on
    Alpine/musl both files declare struct ptrace_peeksiginfo_args, so
    there is no way they can be used together. Let's rely on libc one.

Since including linux/ptrace.h is not an option, define
PTRACE_ARCH_PRCTL if it doesn't already exist. An interesting point to
note is that in sys/ptrace.h, PTRACE_ARCH_PRCTL is an enum value so the
preprocessor doesn't know about it. PT_ARCH_PRCTL is the preprocessor
symbol that matches the value of PTRACE_ARCH_PRCTL. So look for
PT_ARCH_PRCTL to decide if PTRACE_ARCH_PRCTL is available or not.

Another interesting point to note is that AL2 ships with GCC 7 by
default, which does not support the -mshstk option, causing other build
failures. Luckily, it also ships GCC 10 which does have the option.
Using GCC 10 lets the build succeed.

Fixes: fc683cb ("compel: shstk: save CET state when CPU supports it")
Signed-off-by: Pratyush Yadav <[email protected]>
Duplicate string in irmap_scan_path_add, otherwise it will free before
parsing next configuration input.

[ avagin: handle errors of xstrdup ]

Signed-off-by: Liu Hua <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Sometimes due to sigblockmask inheritance cgroupd can inherit SIGTERM
blocked. That will lead cgroupd ignoring SIGTERM from stop_cgroupd() and
CRIU will get stuck due to waiting for never-stopping cgroupd.

I see this happening in lxc-checkpoint, also saw this in OpenVZ jenkins
on cgroup_inotify00 test.

Signed-off-by: Pavel Tikhomirov <[email protected]>
Before this fix, it could return MAP_FAILED which is ((void *) -1).

Signed-off-by: Andrei Vagin <[email protected]>
It was added in v5.3-rc1~211^2~4^2~10.

Fixes checkpoint-restore#2390

Signed-off-by: Andrei Vagin <[email protected]>
The CI tests with CentOS 7 have been disabled and removed [1,2].
This patch removes the obsolete Makefile targets for these tests.

[1] checkpoint-restore@24bc083
[2] checkpoint-restore@f8466ca

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch extends CRIU dump with support for encryption of images
using ChaCha20-Poly1305 authenticated-encryption in combination with
X.509 certificates.

The '--encrypt' option can be used with the dump/pre-dump commands to
enable this functionality. When this option has been specified during
dump, the GnuTLS library will be used to load a public key from X.509
certificate, and to generate a 256-bit random `token`. The token's
value is then encrypted with the public key and the corresponding
ciphertext is saved in `cipher.img`. During restore, if cipher.img
exists in the images directory, the GnuTLS library will be used to
load a private key from a corresponding PEM file to decrypt the token
value.

The token value is used with ChaCha20-Poly1305 to encrypt/decrypt
all other CRIU images. The 256-bit token is used in combination with
96-bits `nonce` and 128-bits `tag` to protect data confidentiality
and provide message authentication for each data entry.

Example:
	criu dump --encrypt ...
	criu restore ...

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch extends ZDTM to run `criu dump` with the `--encrypt`
option to test the encryption functionality of CRIU images.

Signed-off-by: Radostin Stoyanov <[email protected]>
'opts' is defined in cr_options.h. This header will be included in a
subsequent patch. We rename the local variable 'opts' to 'bpfmap_opts'
to avoid variable shadowing.

Signed-off-by: Radostin Stoyanov <[email protected]>
We calculate the total memory size needed for both keys and values and
allocate a single contiguous memory region using a single mmap call.
In a subsequent patch, this change would enable encrypting the combined
memory region using a single pair of ChaCha20-Poly1305 tag and nonce.

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch extends dump_one_bpfmap_data() with support for encryption.

Signed-off-by: Radostin Stoyanov <[email protected]>
During checkpoint, the contents of ghost images and pipe data is
splice()-ed between file descriptors. To enable encryption for this data
we introduce `tls_encrypt_file_data()` and `tls_decrypt_file_data()`.
These functions read data from input file descriptor, perform
encryption/decryption of the data, and write it to the corresponding
output file descriptor.

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch extends CRIT with the ability to decode encrypted images.
When `cipher.img` is present, crit will load the corresponding private
key (from /etc/pki/criu/private/key.pem), decrypt the cipher token and
use it to decrypt the protobuf entries in the image that is being
decoded.

Signed-off-by: Radostin Stoyanov <[email protected]>
cr_system() and cr_system_userns() are used to run external executables
such as tar, ip, and iptables. These external tools are used to create
image files in 3rd party format (i.e., raw images). In order to encrypt
the output of these tools, and to decrypt their input, we replace the
corresponding input/output file descriptor with a pipe, and perform
encryption/decryption of the data.

Signed-off-by: Radostin Stoyanov <[email protected]>
We use the AES-XTS block cipher to encrypt memory pages as it is
designed to encrypt blocks of data with fixed-size (e.g. memory pages),
allows the use of hardware acceleration available in modern CPUs, and
uses a single initialization vector (IV), instead of per-page nonce,
to ensure that encrypting the same plaintext with the same key results
in different ciphertexts.

In particular, XTS uses two 256-bits AES keys. One key is used to
perform block encryption, and the other is used to encrypt a so-called
"tweak value". The encrypted tweak value is further modified (with a
Galois polynomial function) and XOR-ed with both the plaintext and
ciphertext of each block. This method ensures that encrypting multiple
blocks with identical data will produce different ciphertext.

Since CRIU restores memory pages in the restorer context, this PIE code
cannot be linked with libraries such as GnuTLS to perform decryption.
Instead, we introduce a helper process to decrypt memory pages data.
The restorer context communicates with this helper process using PIPEs.
It sends the function arguments be used by preadv() and receives back
its return value. The decrypted data is transferred to the target
address space with process_vm_writev.

Suggested-by: Daiki Ueno <[email protected]>
Signed-off-by: Radostin Stoyanov <[email protected]>
The AES-XTS cipher does not provide integrity verification.
In this patch we add a verification mechanism based on the
HMAC-SHA-256 algorithm.

In order to support iterative checkpointing and memory deduplication
with encrypted memory, and to avoid storing HMAC for each memory page,
we compute XOR for of HMAC value for all memory pages and store this
value in cipher.img

The XOR computation also allows us to address the problem that memory
pages are read during restore in a different order then they are written
during checkpoint. In addition, to ensure that memory pages are restored
in correct order, we include the PID and VMA address associated with each
page in the HMAC computation.

The following example illustrates the HMAC value computation:

	H_n = HMAC(PID + VMA + MEMORY + KEY)
	hmac_value = H_1 ^ H_2 ^ ... ^ H_n

- PID: PID associated with the memory page
- VMA: virtual memory address associated with memory page
- KEY: secret key
- H_n: n-th memory page
- hmac_value: value stored in cipther.img during checkpoint, and used
              for integrity verification during restore

Signed-off-by: Radostin Stoyanov <[email protected]>
Measure the time for data encryption and decryption with
stream and block ciphers.

Signed-off-by: Radostin Stoyanov <[email protected]>
This script, similar to ssh-keygen and certtool, makes it easier
to generate and install certificate and key to enable encryption
support with CRIU.

Signed-off-by: Radostin Stoyanov <[email protected]>
if args.type == 'rsa':
generate_rsa_key(args.bits)
elif args.type == 'ec':
generate_ec_key()

Check failure

Code scanning / CodeQL

Wrong number of arguments in a call Error

Call to
function generate_ec_key
with too few arguments; should be no fewer than 1.
* restorer can wait() for it when the restore stage is done.
*/
ta->helpers = (pid_t *)rst_mem_align_cpos(RM_PRIVATE);
child = rst_mem_alloc(sizeof(*child), RM_PRIVATE);

Check failure

Code scanning / CodeQL

Inconsistent nullness check Error

The result of this call to rst_mem_alloc is not checked for null, but 91% of calls to rst_mem_alloc check for null.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.