[RFC] Dump/restore syscall user dispatch via ptrace #2423

svetly-todorov · 2024-06-24T14:21:20Z

Following up on issue #2037.

Use the PTRACE_GET/SET_SYSCALL_USER_DISPATCH flags upstreamed in kernel 6.4 to preserve SUD settings for a target process.

The majority of the implementation mirrors that of the seccomp dump/restore. The most significant difference is that, unlike seccomp, there's no flag for disabling SUD while a target is being ptraced. Therefore restoring it in the parasite blob is dangerous, because remote syscalls can generate a SIGSYS. Instead, restoration happens right before the final PTRACE_DETACH, and not in the parasite code.

Change made through this commit: - Include copy of flog as a seperate tree. - Modify the makefile to add and compile flog code. Signed-off-by: prakritigoyal19 <[email protected]>

CID 302713 (checkpoint-restore#1 of 1): Missing varargs init or cleanup (VARARGS) va_end was not called for argptr. Signed-off-by: Adrian Reber <[email protected]>

Separate commit for easier criu-dev <-> master transfer. Acked-by: Mike Rapoport <[email protected]> Signed-off-by: Adrian Reber <[email protected]>

It is mapped, not maped. Same applies for mmap I guess. Found by codespell, except it wants to change it to mapped, which will make it less specific. Signed-off-by: Kir Kolyshkin <[email protected]>

Brought to you by codespell -w (using codespell v2.1.0). [v2: use "make indent" on the result] Signed-off-by: Kir Kolyshkin <[email protected]>

Fixes: checkpoint-restore#2121 Signed-off-by: Pengda Yang <[email protected]>

The TOS(type of service) field in the ip header allows you specify the priority of the socket data. Signed-off-by: Suraj Shirvankar <[email protected]>

Signed-off-by: Suraj Shirvankar <[email protected]>

The pipe_size type is unsigned int, when the fcntl call fails and return -1, it will cause a negative rollover problem. Signed-off-by: zhoujie <[email protected]>

Newer Intel CPUs (Sapphire Rapids) have a much larger xsave area than before. Looking at older CPUs I see 2440 bytes. # cpuid -1 -l 0xd -s 0 ... bytes required by XSAVE/XRSTOR area = 0x00000988 (2440) On newer CPUs (Sapphire Rapids) it grows to 11008 bytes. # cpuid -1 -l 0xd -s 0 ... bytes required by XSAVE/XRSTOR area = 0x00002b00 (11008) This increase the xsave area from one page to four pages. Without this patch the fpu03 test fails, with this patch it works again. Signed-off-by: Adrian Reber <[email protected]>

Signed-off-by: Adrian Reber <[email protected]>

Using the fact that we know criu_pid and criu is a parent of restored process we can create pidfile with pid on caller pidns level. We need to move mount namespace creation to child so that criu-ns can see caller pidns proc. Signed-off-by: Pavel Tikhomirov <[email protected]>

By default, the file name 'amdgpu_plugin.txt' is used also as the name for the corresponding man page (`man amdgpu_plugin`). However, when this man page is installed system-wide it would be more appropriate to have a prefix 'criu-' (e.g., `man criu-amdgpu-plugin`). Signed-off-by: Radostin Stoyanov <[email protected]>

crun wants to set empty_ns and this interface is missing from the library. This adds it to libcriu. Signed-off-by: Adrian Reber <[email protected]>

--criu-binary argument provides a way to supply the CRIU binary location to run_criu(). Related to: checkpoint-restore#1909 Signed-off-by: Dhanuka Warusadura <[email protected]>

These changes remove and update the changes introduced in 7177938 in favor of the Python version in CI. os.waitstatus_to_exitcode() function appeared in Python 3.9 Related to: checkpoint-restore#1909 Signed-off-by: Dhanuka Warusadura <[email protected]>

These changes add test implementations for criu-ns script. Fixes: checkpoint-restore#1909 Signed-off-by: Dhanuka Warusadura <[email protected]>

These changes fix the `ImportError: No module named pathlib` error when executing criu-ns tests located at criu/test/others/criu-ns Signed-off-by: Dhanuka Warusadura <[email protected]>

CentOS 7 CI environment uses Python 2. To execute criu-ns script in CentOS 7 changing the current shebang line to python is required. This reverse the changes made in a15a63f Signed-off-by: Dhanuka Warusadura <[email protected]>

This is a patch proposed by Thomas here: https://lore.kernel.org/all/87ilczc7d9.ffs@tglx/ It removes (created id > desired id) "sanity" check and adds proper checking that ids start at zero and increment by one each time when we create/delete a posix timer. First purpose of it is to fix infinite looping in create_posix_timers on old pre 3.11 kernels. Second purpose is to allow kernel interface of creating posix timers with desired id change from iterating with predictable next id to just setting next id directly. And at the same time removing predictable next id so that criu with this patch would not get to infinite loop in create_posix_timers if this happens. Thanks a lot to Thomas! Signed-off-by: Pavel Tikhomirov <[email protected]>

This hook allows to start image streamer process from an action script. Signed-off-by: Radostin Stoyanov <[email protected]>

…tions This does cgroup namespace creation separately from joining task cgroups. This makes the code more logical, because creating cgroup namespace also involves joining cgroups but these cgroups can be different to task's cgroups as they are cgroup namespace roots (cgns_prefix), and mixing all of them together may lead to misunderstanding. Another positive thing is that we consolidate !item->parent checks in one place in restore_task_with_children. Signed-off-by: Valeriy Vdovin <[email protected]> Signed-off-by: Pavel Tikhomirov <[email protected]>

4.15-based kernels don't allow F_*SEAL for memfds created with MFD_HUGETLB. Since seals are not possible in this case, fake F_GETSEALS result as if it was queried for a non-sealing-enabled memfd. Signed-off-by: Michał Mirosław <[email protected]>

Linux 4.15 doesn't like empty string for cgroup2 mount options. Pass NULL then to satisfy the kernel check. Log the options for easier debugging. Signed-off-by: Michał Mirosław <[email protected]>

The original commit added saving THP_DISABLED flag value, but missed restoring it. There is restoring code, but used only when --lazy_pages mode is enabled. Restore the prctl flag always. While at it, rename the `has_thp_enabled` -> `!thp_disabled` for consistency. Fixes: bbbd597 (2017-06-28 "mem: add dump state of THP_DISABLED prctl") Signed-off-by: Michał Mirosław <[email protected]>

If prctl(SET_THP_DISABLE) is not used due to bad semantics, log it for easier debugging. Signed-off-by: Michał Mirosław <[email protected]>

While at it, don't carry over stale errno to the fail() message. Signed-off-by: Michał Mirosław <[email protected]>

Signed-off-by: Michał Mirosław <[email protected]>

Add a sanity check for THP_DISABLE. This discovered a broken commit in Google's kernel tree. Signed-off-by: Michał Mirosław <[email protected]>

During restore, CRIU prints "Enqueue page-read" messages for each page-read request [1]. However, this message does not provide useful information, increases performance overhead during restore and the size of log file. $ ./zdtm.py run -t zdtm/static/maps06 -f h -k always $ grep 'Enqueue page-read' dump/zdtm/static/maps06/56/1/restore.log | wc -l 20493 This commit replaces these log messages with a single message that shows the number of enqueued page-read requests. $ grep 'enqueued' dump/zdtm/static/maps06/56/1/restore.log (00.061449) 56: nr_enqueued: 20493 [1] checkpoint-restore@91388fc Signed-off-by: Radostin Stoyanov <[email protected]>

1) In dump_tcp_conn_state, if return from libsoccr_save is >=0, we check that sizeof(struct libsoccr_sk_data) returned from libsoccr_save is equal to sizeof(struct libsoccr_sk_data) we see in dump_tcp_conn_state (probably to check if we use the right library version). And if sizes are different we go to err_r, which just returns ret, which can teoretically be 0 (if size in library is zero) and that would lead dump_one_tcp treat this as success though it is obvious error. 2) In case of dump_opt or open_image fails we don't explicitly set ret and rely that sizeof(struct libsoccr_sk_data) previously set to ret is not 0, I don't really like it, it makes reading code too complex. 3) We have a lot of err_* labels which do exactly the same thing, there is no point in having all of them, also it is better to choose the name of the label based on what it really does. So let's refactor error handling to avoid these inconsistencies. Signed-off-by: Pavel Tikhomirov <[email protected]>

images/core.proto

rst0git · 2024-06-27T19:23:29Z

@svetly-todorov Thank you for the pull request! Would you be able to add tests for this feature?

rst0git · 2024-06-27T19:24:10Z

compel/src/lib/ptrace.c

+
+	/* Setup SUD-disable struct */
+	memset(&disable, 0, sizeof(disable));
+	disable.mode = PR_SYS_DISPATCH_OFF;


#10 2.352 compel/src/lib/ptrace.c:54:17: error: ‘PR_SYS_DISPATCH_OFF’ undeclared (first use in this function) #10 2.352 54 | disable.mode = PR_SYS_DISPATCH_OFF; #10 2.352 | ^~~~~~~~~~~~~~~~~~~

This is probably a case of kernel headers not always being present. I'll go back and explicitly add the #ifdefs to sud.h.

While I'm at it, I'll probably also add some wrappers around the entire SUD architecture, since this only works with newer versions of the kernel, etc.

criu/include/sud.h

rst0git · 2024-06-27T19:25:40Z

criu/sud.c

+    struct sys_dispatch_entry *entry;
+    sud_config_t config;
+
+	entry = sud_lookup(tid_real, true, false);


The indentation is also mixed here.

svetly-todorov · 2024-06-28T00:54:47Z

@rst0git

I'm working on a rebase that adds some bugfixes and addresses the whitespace issues.

We (@gmprice and I) did write a pair of tests for this. Where should they be included?

rst0git · 2024-06-28T10:12:36Z

We (@gmprice and I) did write a pair of tests for this. Where should they be included?

Ideally, the tests should be in a separate commit and it would be great if it is possible to add them in ZDTM. For example, the following commits have been used to add ZDTM tests: 96a30f0, d490218, 516b369.

avagin · 2024-07-02T02:44:08Z

criu/sud.c

+#undef LOG_PREFIX
+#define LOG_PREFIX "sys-dispatch: "
+
+static struct rb_root sud_tid_rb_root = RB_ROOT;


pls add more info in the commit message to describe what is going on here.

avagin · 2024-07-02T18:04:44Z

The majority of the implementation mirrors that of the seccomp dump/restore. The most significant difference is that, unlike seccomp, there's no flag for disabling SUD while a target is being ptraced. Therefore restoring it in the parasite blob is dangerous, because remote syscalls can generate a SIGSYS. Instead, restoration happens right before the final PTRACE_DETACH, and not in the parasite code.

CRIU injects a parasite blob while a process is being dumped. How do you handle where to place a parasite blob?

svetly-todorov · 2024-07-02T19:48:53Z

The majority of the implementation mirrors that of the seccomp dump/restore. The most significant difference is that, unlike seccomp, there's no flag for disabling SUD while a target is being ptraced. Therefore restoring it in the parasite blob is dangerous, because remote syscalls can generate a SIGSYS. Instead, restoration happens right before the final PTRACE_DETACH, and not in the parasite code.

CRIU injects a parasite blob while a process is being dumped. How do you handle where to place a parasite blob?

@avagin

We don't worry about the parasite blob during dump because we collect and disable SUD during the compel_wait_task that precedes the infection. It's been a while since I wrote this code, but if I'm remembering correctly, collection of SUD entries happens in this callback.

This callback triggers here in compel_wait_task.

The callback is necessary for SUD, but not necessary for seccomp, because ptrace can disable seccomp without overwriting its settings. So for seccomp, you can SUSPEND_SECCOMP during compel_wait_task and then read its settings later. But for SUD, you can only read its settings while it is active; ptrace only lets you enable/disable it without a graceful suspend.

Use the PTRACE_GET/SET_SYSCALL_USER_DISPATCH_CONFIG flags to interface with a tracee's SUD settings. This is scaffolding for infect and seize, which will use ptrace to collect and suspend the SUD status of a seized task. Signed-off-by: Svetly Todorov <[email protected]> Signed-off-by: Gregory Price <[email protected]>

Use the new ptrace wrappers to get the SUD mode of the seized process. If activated, use ptrace_suspend_sud to disable it before proceeding with infection. Signed-off-by: Svetly Todorov <[email protected]> Signed-off-by: Gregory Price <[email protected]>

Introduce function signatures for SUD image generation, dump, and restore. These mirror the functions in seccomp.h/.c. Signed-off-by: Svetly Todorov <[email protected]> Signed-off-by: Gregory Price <[email protected]>

Add the necessary protobuf descriptions, image descriptions, and magic number for saving SUD data. Signed-off-by: Svetly Todorov <[email protected]> Signed-off-by: Gregory Price <[email protected]>

Add hooks for saving SUD status upon infection. Signed-off-by: Svetly Todorov <[email protected]> Signed-off-by: Gregory Price <[email protected]>

Add hooks for dumping SUD image to disk during cr-dump.c. Signed-off-by: Svetly Todorov <[email protected]> Signed-off-by: Gregory Price <[email protected]>

Add hooks for restoring SUD settings throughout cr-restore.c. The implementation is a little unorthodox. Unlike seccomp, SUD isn't suspended while a task is under ptrace. So the parasite code cannot restore the SUD settings because SIGSYS may be triggered when the restorer blob is unmapped. Instead, we opt to reopen the per-core data right before PTRACE_DETACH, and restore the SUD settings then. This way we don't risk triggering SIGSYS via a remote syscall. Signed-off-by: Svetly Todorov <[email protected]> Signed-off-by: Gregory Price <[email protected]>

Signed-off-by: Svetly Todorov <[email protected]>

svetly-todorov · 2024-07-03T18:39:29Z

I have pushed up a kind of v2 for this RFC containing some fixes and the beginnings of a test case. CRIU itself should build with the new changes. But the test does not work quite yet. I have to drop development on this for now, because of other work obligations. I may circle back to this in the future. Forgive me!

github-actions · 2024-08-22T00:47:10Z

A friendly reminder that this PR had no activity for 30 days.

github-actions · 2024-09-22T00:08:55Z

A friendly reminder that this PR had no activity for 30 days.

prakritigoyal19 and others added 30 commits June 11, 2023 23:30

Add flog to CRIU

823db1b

Change made through this commit: - Include copy of flog as a seperate tree. - Modify the makefile to add and compile flog code. Signed-off-by: prakritigoyal19 <[email protected]>

flog: Missing varargs init or cleanup (VARARGS)

1444b72

CID 302713 (checkpoint-restore#1 of 1): Missing varargs init or cleanup (VARARGS) va_end was not called for argptr. Signed-off-by: Adrian Reber <[email protected]>

Run 'make indent' on 'flog/'

cc193dc

Separate commit for easier criu-dev <-> master transfer. Acked-by: Mike Rapoport <[email protected]> Signed-off-by: Adrian Reber <[email protected]>

flog: typo: mmaped -> mmapped

1fa21a0

It is mapped, not maped. Same applies for mmap I guess. Found by codespell, except it wants to change it to mapped, which will make it less specific. Signed-off-by: Kir Kolyshkin <[email protected]>

flog: fix some codespell warnings

7eaf7a3

Brought to you by codespell -w (using codespell v2.1.0). [v2: use "make indent" on the result] Signed-off-by: Kir Kolyshkin <[email protected]>

limit the field width of 'scanf'

aed3f34

Fixes: checkpoint-restore#2121 Signed-off-by: Pengda Yang <[email protected]>

sk-inet: Add IP TOS socket option

d6860d0

The TOS(type of service) field in the ip header allows you specify the priority of the socket data. Signed-off-by: Suraj Shirvankar <[email protected]>

zdtm: Add tests for ip tos restore

0dd4668

Signed-off-by: Suraj Shirvankar <[email protected]>

dump: increase fcntl call failure judgment

13eb876

The pipe_size type is unsigned int, when the fcntl call fails and return -1, it will cause a negative rollover problem. Signed-off-by: zhoujie <[email protected]>

ci: fix new codespell errors

1d4c5ed

Signed-off-by: Adrian Reber <[email protected]>

scripts: make newer versions of shellcheck happy

d98c533

Signed-off-by: Adrian Reber <[email protected]>

lib/c: add empty_ns interfaces to libcriu

f57bda4

crun wants to set empty_ns and this interface is missing from the library. This adds it to libcriu. Signed-off-by: Adrian Reber <[email protected]>

criu-ns: Add --criu-binary argument to run_criu()

f308272

--criu-binary argument provides a way to supply the CRIU binary location to run_criu(). Related to: checkpoint-restore#1909 Signed-off-by: Dhanuka Warusadura <[email protected]>

criu-ns: Add tests for criu-ns script

8094df8

These changes add test implementations for criu-ns script. Fixes: checkpoint-restore#1909 Signed-off-by: Dhanuka Warusadura <[email protected]>

criu-ns: Install Python pathlib module in CentOS 7

f0e9358

These changes fix the `ImportError: No module named pathlib` error when executing criu-ns tests located at criu/test/others/criu-ns Signed-off-by: Dhanuka Warusadura <[email protected]>

criu-ns: Update shebang line to python

9130fef

CentOS 7 CI environment uses Python 2. To execute criu-ns script in CentOS 7 changing the current shebang line to python is required. This reverse the changes made in a15a63f Signed-off-by: Dhanuka Warusadura <[email protected]>

action-scripts: Add pre-stream hook

104a828

This hook allows to start image streamer process from an action script. Signed-off-by: Radostin Stoyanov <[email protected]>

Fix mount(cgroup2) for older kernels.

11288c9

Linux 4.15 doesn't like empty string for cgroup2 mount options. Pass NULL then to satisfy the kernel check. Log the options for easier debugging. Signed-off-by: Michał Mirosław <[email protected]>

Log if prctl(SET_THP_DISABLE) doesn't work as expected.

7ca6856

If prctl(SET_THP_DISABLE) is not used due to bad semantics, log it for easier debugging. Signed-off-by: Michał Mirosław <[email protected]>

zdtm: thp_disable: Output a single failure message

d3a33ca

While at it, don't carry over stale errno to the fail() message. Signed-off-by: Michał Mirosław <[email protected]>

zdtm: thp_disable: Verify prctl(THP_DISABLE) migration

6006cb6

Signed-off-by: Michał Mirosław <[email protected]>

zdtm: thp_disable: Verify MADV_NOHUGEPAGE before migration

c75c017

Add a sanity check for THP_DISABLE. This discovered a broken commit in Google's kernel tree. Signed-off-by: Michał Mirosław <[email protected]>

rst0git and others added 2 commits April 16, 2024 15:05

rst0git reviewed Jun 27, 2024

View reviewed changes

images/core.proto Show resolved Hide resolved

rst0git reviewed Jun 27, 2024

View reviewed changes

criu/include/sud.h Show resolved Hide resolved

rst0git reviewed Jun 27, 2024

View reviewed changes

avagin reviewed Jul 2, 2024

View reviewed changes

svetly-todorov added 8 commits July 2, 2024 13:20

sud.h: header for SUD functions

3158f3b

Introduce function signatures for SUD image generation, dump, and restore. These mirror the functions in seccomp.h/.c. Signed-off-by: Svetly Todorov <[email protected]> Signed-off-by: Gregory Price <[email protected]>

images: add scaffolding for SUD data

3cfb466

Add the necessary protobuf descriptions, image descriptions, and magic number for saving SUD data. Signed-off-by: Svetly Todorov <[email protected]> Signed-off-by: Gregory Price <[email protected]>

seize.c/sud.c: build SUD image during seize

35d14bf

Add hooks for saving SUD status upon infection. Signed-off-by: Svetly Todorov <[email protected]> Signed-off-by: Gregory Price <[email protected]>

cr-dump.c: dump SUD settings to image

d39b8cc

Add hooks for dumping SUD image to disk during cr-dump.c. Signed-off-by: Svetly Todorov <[email protected]> Signed-off-by: Gregory Price <[email protected]>

sud.c: start test case for exercising syscall dispatch C/R

c4f248f

Signed-off-by: Svetly Todorov <[email protected]>

svetly-todorov force-pushed the _sud_support_rfc branch from d4b0cc1 to c4f248f Compare July 3, 2024 18:37

mihalicyn self-requested a review July 22, 2024 11:31

github-actions bot added the stale-pr label Aug 22, 2024

rst0git removed the stale-pr label Aug 22, 2024

github-actions bot added the stale-pr label Sep 22, 2024

avagin force-pushed the criu-dev branch from 266f150 to f5dec05 Compare October 31, 2024 06:45

github-actions bot removed the stale-pr label Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Dump/restore syscall user dispatch via ptrace #2423

[RFC] Dump/restore syscall user dispatch via ptrace #2423

svetly-todorov commented Jun 24, 2024

rst0git commented Jun 27, 2024

rst0git Jun 27, 2024

svetly-todorov Jun 27, 2024 •

edited

Loading

rst0git Jun 27, 2024

svetly-todorov commented Jun 28, 2024 •

edited

Loading

rst0git commented Jun 28, 2024

avagin Jul 2, 2024

avagin commented Jul 2, 2024

svetly-todorov commented Jul 2, 2024 •

edited

Loading

svetly-todorov commented Jul 3, 2024 •

edited

Loading

github-actions bot commented Aug 22, 2024

github-actions bot commented Sep 22, 2024

[RFC] Dump/restore syscall user dispatch via ptrace #2423

Are you sure you want to change the base?

[RFC] Dump/restore syscall user dispatch via ptrace #2423

Conversation

svetly-todorov commented Jun 24, 2024

rst0git commented Jun 27, 2024

rst0git Jun 27, 2024

Choose a reason for hiding this comment

svetly-todorov Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

rst0git Jun 27, 2024

Choose a reason for hiding this comment

svetly-todorov commented Jun 28, 2024 • edited Loading

rst0git commented Jun 28, 2024

avagin Jul 2, 2024

Choose a reason for hiding this comment

avagin commented Jul 2, 2024

svetly-todorov commented Jul 2, 2024 • edited Loading

svetly-todorov commented Jul 3, 2024 • edited Loading

github-actions bot commented Aug 22, 2024

github-actions bot commented Sep 22, 2024

svetly-todorov Jun 27, 2024 •

edited

Loading

svetly-todorov commented Jun 28, 2024 •

edited

Loading

svetly-todorov commented Jul 2, 2024 •

edited

Loading

svetly-todorov commented Jul 3, 2024 •

edited

Loading