Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Dump/restore syscall user dispatch via ptrace #2423

Open
wants to merge 247 commits into
base: criu-dev
Choose a base branch
from

Conversation

svetly-todorov
Copy link

Following up on issue #2037.

Use the PTRACE_GET/SET_SYSCALL_USER_DISPATCH flags upstreamed in kernel 6.4 to preserve SUD settings for a target process.

The majority of the implementation mirrors that of the seccomp dump/restore. The most significant difference is that, unlike seccomp, there's no flag for disabling SUD while a target is being ptraced. Therefore restoring it in the parasite blob is dangerous, because remote syscalls can generate a SIGSYS. Instead, restoration happens right before the final PTRACE_DETACH, and not in the parasite code.

prakritigoyal19 and others added 30 commits June 11, 2023 23:30
Change made through this commit:
- Include copy of flog as a seperate tree.
- Modify the makefile to add and compile flog code.

Signed-off-by: prakritigoyal19 <[email protected]>
CID 302713 (checkpoint-restore#1 of 1): Missing varargs init or cleanup (VARARGS)
 va_end was not called for argptr.

Signed-off-by: Adrian Reber <[email protected]>
Separate commit for easier criu-dev <-> master transfer.

Acked-by: Mike Rapoport <[email protected]>
Signed-off-by: Adrian Reber <[email protected]>
It is mapped, not maped. Same applies for mmap I guess.

Found by codespell, except it wants to change it to mapped,
which will make it less specific.

Signed-off-by: Kir Kolyshkin <[email protected]>
Brought to you by

    codespell -w

(using codespell v2.1.0).

[v2: use "make indent" on the result]

Signed-off-by: Kir Kolyshkin <[email protected]>
The TOS(type of service) field in the ip header allows you specify the
priority of the socket data.

Signed-off-by: Suraj Shirvankar <[email protected]>
The pipe_size type is unsigned int, when the fcntl call fails and
return -1, it will cause a negative rollover problem.

Signed-off-by: zhoujie <[email protected]>
Newer Intel CPUs (Sapphire Rapids) have a much larger xsave area than
before. Looking at older CPUs I see 2440 bytes.

    # cpuid -1 -l 0xd -s 0
    ...
        bytes required by XSAVE/XRSTOR area     = 0x00000988 (2440)

On newer CPUs (Sapphire Rapids) it grows to 11008 bytes.

    # cpuid -1 -l 0xd -s 0
    ...
        bytes required by XSAVE/XRSTOR area     = 0x00002b00 (11008)

This increase the xsave area from one page to four pages.

Without this patch the fpu03 test fails, with this patch it works again.

Signed-off-by: Adrian Reber <[email protected]>
Using the fact that we know criu_pid and criu is a parent of restored
process we can create pidfile with pid on caller pidns level.

We need to move mount namespace creation to child so that criu-ns can
see caller pidns proc.

Signed-off-by: Pavel Tikhomirov <[email protected]>
By default, the file name 'amdgpu_plugin.txt' is used also as the name
for the corresponding man page (`man amdgpu_plugin`). However, when
this man page is installed system-wide it would be more appropriate
to have a prefix 'criu-' (e.g., `man criu-amdgpu-plugin`).

Signed-off-by: Radostin Stoyanov <[email protected]>
crun wants to set empty_ns and this interface is missing from the
library. This adds it to libcriu.

Signed-off-by: Adrian Reber <[email protected]>
--criu-binary argument provides a way to supply the CRIU binary
location to run_criu().

Related to: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <[email protected]>
These changes remove and update the changes introduced in
7177938 in favor of the
Python version in CI.

os.waitstatus_to_exitcode() function appeared in Python 3.9

Related to: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <[email protected]>
These changes add test implementations for criu-ns script.

Fixes: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <[email protected]>
These changes fix the `ImportError: No module named pathlib`
error when executing criu-ns tests located at criu/test/others/criu-ns

Signed-off-by: Dhanuka Warusadura <[email protected]>
CentOS 7 CI environment uses Python 2. To execute criu-ns
script in CentOS 7 changing the current shebang line to
python is required.

This reverse the changes made in a15a63f

Signed-off-by: Dhanuka Warusadura <[email protected]>
This is a patch proposed by Thomas here:
https://lore.kernel.org/all/87ilczc7d9.ffs@tglx/

It removes (created id > desired id) "sanity" check and adds proper
checking that ids start at zero and increment by one each time when we
create/delete a posix timer.

First purpose of it is to fix infinite looping in create_posix_timers on
old pre 3.11 kernels.

Second purpose is to allow kernel interface of creating posix timers
with desired id change from iterating with predictable next id to just
setting next id directly. And at the same time removing predictable next
id so that criu with this patch would not get to infinite loop in
create_posix_timers if this happens.

Thanks a lot to Thomas!

Signed-off-by: Pavel Tikhomirov <[email protected]>
This hook allows to start image streamer process from an action script.

Signed-off-by: Radostin Stoyanov <[email protected]>
…tions

This does cgroup namespace creation separately from joining task
cgroups. This makes the code more logical, because creating cgroup
namespace also involves joining cgroups but these cgroups can be
different to task's cgroups as they are cgroup namespace roots
(cgns_prefix), and mixing all of them together may lead to
misunderstanding.

Another positive thing is that we consolidate !item->parent checks in
one place in restore_task_with_children.

Signed-off-by: Valeriy Vdovin <[email protected]>
Signed-off-by: Pavel Tikhomirov <[email protected]>
4.15-based kernels don't allow F_*SEAL for memfds created with MFD_HUGETLB.
Since seals are not possible in this case, fake F_GETSEALS result as if it
was queried for a non-sealing-enabled memfd.

Signed-off-by: Michał Mirosław <[email protected]>
Linux 4.15 doesn't like empty string for cgroup2 mount options.
Pass NULL then to satisfy the kernel check. Log the options for
easier debugging.

Signed-off-by: Michał Mirosław <[email protected]>
The original commit added saving THP_DISABLED flag value, but missed
restoring it. There is restoring code, but used only when --lazy_pages
mode is enabled. Restore the prctl flag always. While at it, rename the
`has_thp_enabled` -> `!thp_disabled` for consistency.

Fixes: bbbd597 (2017-06-28 "mem: add dump state of THP_DISABLED prctl")
Signed-off-by: Michał Mirosław <[email protected]>
If prctl(SET_THP_DISABLE) is not used due to bad semantics, log it
for easier debugging.

Signed-off-by: Michał Mirosław <[email protected]>
While at it, don't carry over stale errno to the fail() message.

Signed-off-by: Michał Mirosław <[email protected]>
Add a sanity check for THP_DISABLE. This discovered a broken commit
in Google's kernel tree.

Signed-off-by: Michał Mirosław <[email protected]>
rst0git and others added 2 commits April 16, 2024 15:05
During restore, CRIU prints "Enqueue page-read" messages for
each page-read request [1]. However, this message does not
provide useful information, increases performance overhead
during restore and the size of log file.

$ ./zdtm.py run -t zdtm/static/maps06 -f h -k always
$ grep 'Enqueue page-read' dump/zdtm/static/maps06/56/1/restore.log | wc -l
20493

This commit replaces these log messages with a single message
that shows the number of enqueued page-read requests.

$ grep 'enqueued' dump/zdtm/static/maps06/56/1/restore.log
(00.061449)     56: nr_enqueued:   20493

[1] checkpoint-restore@91388fc

Signed-off-by: Radostin Stoyanov <[email protected]>
1) In dump_tcp_conn_state, if return from libsoccr_save is >=0, we check
that sizeof(struct libsoccr_sk_data) returned from libsoccr_save is
equal to sizeof(struct libsoccr_sk_data) we see in dump_tcp_conn_state
(probably to check if we use the right library version). And if sizes
are different we go to err_r, which just returns ret, which can
teoretically be 0 (if size in library is zero) and that would lead
dump_one_tcp treat this as success though it is obvious error.

2) In case of dump_opt or open_image fails we don't explicitly set ret
and rely that sizeof(struct libsoccr_sk_data) previously set to ret is
not 0, I don't really like it, it makes reading code too complex.

3) We have a lot of err_* labels which do exactly the same thing, there
is no point in having all of them, also it is better to choose the name
of the label based on what it really does.

So let's refactor error handling to avoid these inconsistencies.

Signed-off-by: Pavel Tikhomirov <[email protected]>
@rst0git
Copy link
Member

rst0git commented Jun 27, 2024

@svetly-todorov Thank you for the pull request! Would you be able to add tests for this feature?


/* Setup SUD-disable struct */
memset(&disable, 0, sizeof(disable));
disable.mode = PR_SYS_DISPATCH_OFF;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#10 2.352 compel/src/lib/ptrace.c:54:17: error: ‘PR_SYS_DISPATCH_OFF’ undeclared (first use in this function)
#10 2.352    54 |  disable.mode = PR_SYS_DISPATCH_OFF;
#10 2.352       |                 ^~~~~~~~~~~~~~~~~~~

Copy link
Author

@svetly-todorov svetly-todorov Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a case of kernel headers not always being present. I'll go back and explicitly add the #ifdefs to sud.h.

While I'm at it, I'll probably also add some wrappers around the entire SUD architecture, since this only works with newer versions of the kernel, etc.

criu/sud.c Outdated
struct sys_dispatch_entry *entry;
sud_config_t config;

entry = sud_lookup(tid_real, true, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation is also mixed here.

@svetly-todorov
Copy link
Author

svetly-todorov commented Jun 28, 2024

@rst0git

I'm working on a rebase that adds some bugfixes and addresses the whitespace issues.

We (@gmprice and I) did write a pair of tests for this. Where should they be included?

@rst0git
Copy link
Member

rst0git commented Jun 28, 2024

We (@gmprice and I) did write a pair of tests for this. Where should they be included?

Ideally, the tests should be in a separate commit and it would be great if it is possible to add them in ZDTM. For example, the following commits have been used to add ZDTM tests: 96a30f0, d490218, 516b369.

#undef LOG_PREFIX
#define LOG_PREFIX "sys-dispatch: "

static struct rb_root sud_tid_rb_root = RB_ROOT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add more info in the commit message to describe what is going on here.

@avagin
Copy link
Member

avagin commented Jul 2, 2024

The majority of the implementation mirrors that of the seccomp dump/restore. The most significant difference is that, unlike seccomp, there's no flag for disabling SUD while a target is being ptraced. Therefore restoring it in the parasite blob is dangerous, because remote syscalls can generate a SIGSYS. Instead, restoration happens right before the final PTRACE_DETACH, and not in the parasite code.

CRIU injects a parasite blob while a process is being dumped. How do you handle where to place a parasite blob?

@svetly-todorov
Copy link
Author

svetly-todorov commented Jul 2, 2024

The majority of the implementation mirrors that of the seccomp dump/restore. The most significant difference is that, unlike seccomp, there's no flag for disabling SUD while a target is being ptraced. Therefore restoring it in the parasite blob is dangerous, because remote syscalls can generate a SIGSYS. Instead, restoration happens right before the final PTRACE_DETACH, and not in the parasite code.

CRIU injects a parasite blob while a process is being dumped. How do you handle where to place a parasite blob?

@avagin

We don't worry about the parasite blob during dump because we collect and disable SUD during the compel_wait_task that precedes the infection. It's been a while since I wrote this code, but if I'm remembering correctly, collection of SUD entries happens in this callback.

This callback triggers here in compel_wait_task.

The callback is necessary for SUD, but not necessary for seccomp, because ptrace can disable seccomp without overwriting its settings. So for seccomp, you can SUSPEND_SECCOMP during compel_wait_task and then read its settings later. But for SUD, you can only read its settings while it is active; ptrace only lets you enable/disable it without a graceful suspend.

Use the PTRACE_GET/SET_SYSCALL_USER_DISPATCH_CONFIG flags to interface with a tracee's SUD settings.

This is scaffolding for infect and seize, which will use ptrace to collect and suspend the SUD status of a seized task.

Signed-off-by: Svetly Todorov <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Use the new ptrace wrappers to get the SUD mode of the seized process. If activated, use ptrace_suspend_sud to disable it before proceeding with infection.

Signed-off-by: Svetly Todorov <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Introduce function signatures for SUD image generation, dump, and restore. These mirror the functions in seccomp.h/.c.

Signed-off-by: Svetly Todorov <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Add the necessary protobuf descriptions, image descriptions, and magic number for saving SUD data.

Signed-off-by: Svetly Todorov <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Add hooks for saving SUD status upon infection.

Signed-off-by: Svetly Todorov <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Add hooks for dumping SUD image to disk during cr-dump.c.

Signed-off-by: Svetly Todorov <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Add hooks for restoring SUD settings throughout cr-restore.c.

The implementation is a little unorthodox. Unlike seccomp, SUD isn't suspended while a task is under ptrace. So the parasite code cannot restore the SUD settings because SIGSYS may be triggered when the restorer blob is unmapped.
Instead, we opt to reopen the per-core data right before PTRACE_DETACH, and restore the SUD settings then. This way we don't risk triggering SIGSYS via a remote syscall.

Signed-off-by: Svetly Todorov <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
@svetly-todorov
Copy link
Author

svetly-todorov commented Jul 3, 2024

I have pushed up a kind of v2 for this RFC containing some fixes and the beginnings of a test case. CRIU itself should build with the new changes. But the test does not work quite yet. I have to drop development on this for now, because of other work obligations. I may circle back to this in the future. Forgive me!

@mihalicyn mihalicyn self-requested a review July 22, 2024 11:31
Copy link

A friendly reminder that this PR had no activity for 30 days.

Copy link

A friendly reminder that this PR had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.