Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fedora] not ok 16 checkpoint --lazy-pages and restore #2924

Closed
kolyshkin opened this issue Apr 28, 2021 · 8 comments · Fixed by #3546
Closed

[fedora] not ok 16 checkpoint --lazy-pages and restore #2924

kolyshkin opened this issue Apr 28, 2021 · 8 comments · Fixed by #3546

Comments

@kolyshkin
Copy link
Contributor

kolyshkin commented Apr 28, 2021

This is from a recent CI run (https://github.com/opencontainers/runc/pull/2923/checks?check_run_id=2452887046)

not ok 16 checkpoint --lazy-pages and restore
# (from function `fail' in file tests/integration/helpers.bash, line 274,
#  from function `runc_restore_with_pipes' in file tests/integration/checkpoint.bats, line 94,
#  in test file tests/integration/checkpoint.bats, line 225)
#   `runc_restore_with_pipes ./image-dir test_busybox_restore --lazy-pages' failed
# runc spec (status=0):
# 
# uffd-noncoop is supported
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.0.2-dev",
#   "id": "test_busybox",
#   "pid": 27036,
#   "status": "running",
#   "bundle": "/tmp/bats-run-24758/runc.cXN8Qe/bundle",
#   "rootfs": "/tmp/bats-run-24758/runc.cXN8Qe/bundle/rootfs",
#   "created": "2021-04-27T23:34:48.770811211Z",
#   "owner": ""
# }
# __runc restore test_busybox_restore failed (status: 1)
# time="2021-04-27T23:34:49Z" level=error msg="criu failed: type NOTIFY errno 0\nlog file: image-dir/restore.log"
# CRIU log errors (if any):
# ./image-dir/restore.log-(00.079105) 27107 (native) is going to execute the syscall 11, required is 15
# ./image-dir/restore.log-(00.079116) 27107 was trapped
# ./image-dir/restore.log-(00.079118) `- Expecting exit
# ./image-dir/restore.log-(00.079126) 27107 was trapped
# ./image-dir/restore.log-(00.079129) 27107 (native) is going to execute the syscall 18446744073709551615, required is 15
# ./image-dir/restore.log:(00.079136) Error (compel/src/lib/infect.c:1513): Task 27107 is in unexpected state: f7f
# ./image-dir/restore.log:(00.079742) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated
# ./image-dir/restore.log:(00.079748) Error (criu/cr-restore.c:2416): Can't stop all tasks on rt_sigreturn
# ./image-dir/restore.log:(00.079750) Error (criu/cr-restore.c:2454): Killing processes because of failure on restore.
# ./image-dir/restore.log-The Network was unlocked so some data or a connection may have been lost.
# ./image-dir/restore.log:(00.085728) Error (criu/mount.c:3396): mnt: Can't remove the directory /tmp/.criu.mntns.eawsHK: No such file or directory
# ./image-dir/restore.log:(00.085742) Error (criu/cr-restore.c:2483): Restoring FAILED.
# --
# ./image-dir/restore.log-(00.079105) 27107 (native) is going to execute the syscall 11, required is 15
# ./image-dir/restore.log-(00.079116) 27107 was trapped
# ./image-dir/restore.log-(00.079118) `- Expecting exit
# ./image-dir/restore.log-(00.079126) 27107 was trapped
# ./image-dir/restore.log-(00.079129) 27107 (native) is going to execute the syscall 18446744073709551615, required is 15
# ./image-dir/restore.log:(00.079136) Error (compel/src/lib/infect.c:1513): Task 27107 is in unexpected state: f7f
# ./image-dir/restore.log:(00.079742) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated
# ./image-dir/restore.log:(00.079748) Error (criu/cr-restore.c:2416): Can't stop all tasks on rt_sigreturn
# ./image-dir/restore.log:(00.079750) Error (criu/cr-restore.c:2454): Killing processes because of failure on restore.
# ./image-dir/restore.log-The Network was unlocked so some data or a connection may have been lost.
# ./image-dir/restore.log:(00.085728) Error (criu/mount.c:3396): mnt: Can't remove the directory /tmp/.criu.mntns.eawsHK: No such file or directory
# ./image-dir/restore.log:(00.085742) Error (criu/cr-restore.c:2483): Restoring FAILED.
# runc restore failed

Might be related to #2760

@kolyshkin
Copy link
Contributor Author

@adrianreber @avagin @rppt PTAL 🙏🏻

@rppt
Copy link

rppt commented Apr 28, 2021

I think it is related to #2760
Do we have lazy-pages.log?

@kolyshkin
Copy link
Contributor Author

Do we have lazy-pages.log?

When restore fails, we do this:

echo "CRIU log errors (if any):"
grep -B 5 Error "$workdir"/*.log ./image-dir/*.log || true

which, together with the above log, tells us there were no Errors in lazy-pages.log.

Let me know if adding something like this:

echo "Lazy pages log (if available):"
cat ./image-dir/lazy-pages.log || true

will help in figuring this out?

@kolyshkin
Copy link
Contributor Author

I think that the problem is, even if the container name is different, it is restored into the same cgroup (and maybe systemd is killing that cgroup).

[kir@kir-rhat runc-tst-cpt]$ sudo ../runc/runc checkpoint xx666
[kir@kir-rhat runc-tst-cpt]$ sudo ../runc/runc restore -d --console-socket tty.sock xx666_restored
INFO[0000] container: &{id:xx666_restored root:/run/runc/xx666_restored config:0xc00033a000 cgroupManager:0xc0003421e0 intelRdtManager:<nil> initPath:/proc/self/exe initArgs:[../runc/runc init] initProcess:<nil> initProcessStartTime:0 criuPath:criu newuidmapPath: newgidmapPath: m:{state:0 sema:0} criuVersion:0 state:0xc000298b10 created:{wall:0 ext:0 loc:<nil>} fifo:<nil>} 
INFO[0000] container.config: {NoPivotRoot:false ParentDeathSignal:0 Rootfs:/home/kir/go/src/github.com/opencontainers/runc-tst-cpt/rootfs Umask:<nil> Readonlyfs:true RootPropagation:0 Mounts:[0xc00033c000 0xc00033c0b0 0xc00033c160 0xc00033c210 0xc00033c2c0 0xc00033c370 0xc00033c420] Devices:[0x55b5cf03cdc0 0x55b5cf03ce20 0x55b5cf03ce80 0x55b5cf03cee0 0x55b5cf03cf40 0x55b5cf03cfa0] MountLabel: Hostname:runc Namespaces:[{Type:NEWPID Path:} {Type:NEWNET Path:} {Type:NEWIPC Path:} {Type:NEWUTS Path:} {Type:NEWNS Path:} {Type:NEWCGROUP Path:}] Capabilities:0xc000292800 Networks:[0xc00033c4d0] Routes:[] Cgroups:0xc0002e16c0 AppArmorProfile: ProcessLabel: Rlimits:[] OomScoreAdj:<nil> UidMappings:[] GidMappings:[] MaskPaths:[/proc/acpi /proc/asound /proc/kcore /proc/keys /proc/latency_stats /proc/timer_list /proc/timer_stats /proc/sched_debug /sys/firmware /proc/scsi] ReadonlyPaths:[/proc/bus /proc/fs /proc/irq /proc/sys /proc/sysrq-trigger] Sysctl:map[] Seccomp:<nil> NoNewPrivileges:true Hooks:map[] Version:1.0.2-dev Labels:[bundle=/home/kir/go/src/github.com/opencontainers/runc-tst-cpt] NoNewKeyring:false IntelRdt:<nil> RootlessEUID:false RootlessCgroups:false} 
INFO[0000] container.config.cgroups: {Name:xx666_restored Parent: Path: ScopePrefix: Paths:map[] Resources:0xc000282300 SystemdProps:[]} 
[kir@kir-rhat runc-tst-cpt]$ sudo ../runc/runc list
ID               PID         STATUS      BUNDLE                                                    CREATED                          OWNER
xx40_restored    3612232     running     /home/kir/go/src/github.com/opencontainers/runc-tst-cpt   2021-04-28T18:43:55.588709153Z   root
xx666_restored   3637234     running     /home/kir/go/src/github.com/opencontainers/runc-tst-cpt   2021-04-28T20:03:04.521477937Z   root
xx767            3627364     running     /home/kir/go/src/github.com/opencontainers/runc-tst       2021-04-28T19:46:10.002493083Z   root
[kir@kir-rhat runc-tst-cpt]$ cat /proc/3637234/cgroup 
0::/user.slice/user-1000.slice/[email protected]/xx666

@rppt
Copy link

rppt commented Apr 29, 2021

Let me know if adding something like this:

echo "Lazy pages log (if available):"
cat ./image-dir/lazy-pages.log || true

will help in figuring this out?

We'll have more information, but if you suspect it's an issue with cgoups and systemd, this information might be less relevant.

@kolyshkin
Copy link
Contributor Author

Writing this down before I forgot. The issue is, criu is unable to restore into a different cgroup, and thus there's some kind of a race between runc+criu restoring the container into the same cgroup, and systemd removing the original container's cgroup.

@kolyshkin
Copy link
Contributor Author

Still the case with Fedora 35 and criu 3.16. From https://cirrus-ci.com/task/6279543066460160:

not ok 24 checkpoint --lazy-pages and restore
# (from function `fail' in file tests/integration/helpers.bash, line 338,
#  from function `runc_restore_with_pipes' in file tests/integration/checkpoint.bats, line 94,
#  in test file tests/integration/checkpoint.bats, line 256)
#   `runc_restore_with_pipes ./image-dir test_busybox_restore --lazy-pages' failed
# runc spec (status=0):
# 
# uffd-noncoop is supported
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.0.2-dev",
#   "id": "test_busybox",
#   "pid": 26121,
#   "status": "running",
#   "bundle": "/tmp/bats-run-5wCTTi/runc.g8U5xW/bundle",
#   "rootfs": "/tmp/bats-run-5wCTTi/runc.g8U5xW/bundle/rootfs",
#   "created": "2021-11-12T08:51:09.980655157Z",
#   "owner": ""
# }
# __runc restore test_busybox_restore failed (status: 1)
# time="2021-11-12T08:51:10Z" level=error msg="criu failed: type NOTIFY errno 0\nlog file: image-dir/restore.log"
# CRIU log errors (if any):
# ./image-dir/restore.log-(00.058769) 26191 (native) is going to execute the syscall 11, required is 15
# ./image-dir/restore.log-(00.058809) 26191 was trapped
# ./image-dir/restore.log-(00.058812) `- Expecting exit
# ./image-dir/restore.log-(00.058853) 26191 was trapped
# ./image-dir/restore.log-(00.058857) 26191 (native) is going to execute the syscall 18446744073709551615, required is 15
# ./image-dir/restore.log:(00.058898) Error (compel/src/lib/infect.c:1543): Task 26191 is in unexpected state: f7f
# ./image-dir/restore.log:(00.058909) Error (compel/src/lib/infect.c:1549): Task stopped with 15: Terminated
# ./image-dir/restore.log:(00.058911) Error (criu/cr-restore.c:2383): Can't stop all tasks on rt_sigreturn
# ./image-dir/restore.log:(00.058912) Error (criu/cr-restore.c:2420): Killing processes because of failure on restore.
# ./image-dir/restore.log-The Network was unlocked so some data or a connection may have been lost.
# ./image-dir/restore.log:(00.062714) Error (criu/mount.c:3385): mnt: Can't remove the directory /tmp/.criu.mntns.4PLAEH: No such file or directory
# ./image-dir/restore.log:(00.062720) Error (criu/cr-restore.c:2447): Restoring FAILED.
# --
# ./image-dir/restore.log-(00.058769) 26191 (native) is going to execute the syscall 11, required is 15
# ./image-dir/restore.log-(00.058809) 26191 was trapped
# ./image-dir/restore.log-(00.058812) `- Expecting exit
# ./image-dir/restore.log-(00.058853) 26191 was trapped
# ./image-dir/restore.log-(00.058857) 26191 (native) is going to execute the syscall 18446744073709551615, required is 15
# ./image-dir/restore.log:(00.058898) Error (compel/src/lib/infect.c:1543): Task 26191 is in unexpected state: f7f
# ./image-dir/restore.log:(00.058909) Error (compel/src/lib/infect.c:1549): Task stopped with 15: Terminated
# ./image-dir/restore.log:(00.058911) Error (criu/cr-restore.c:2383): Can't stop all tasks on rt_sigreturn
# ./image-dir/restore.log:(00.058912) Error (criu/cr-restore.c:2420): Killing processes because of failure on restore.
# ./image-dir/restore.log-The Network was unlocked so some data or a connection may have been lost.
# ./image-dir/restore.log:(00.062714) Error (criu/mount.c:3385): mnt: Can't remove the directory /tmp/.criu.mntns.4PLAEH: No such file or directory
# ./image-dir/restore.log:(00.062720) Error (criu/cr-restore.c:2447): Restoring FAILED.
# runc restore failed

@kolyshkin
Copy link
Contributor Author

One more; from https://cirrus-ci.com/task/6333831285309440

not ok 24 checkpoint --lazy-pages and restore
# (from function `fail' in file tests/integration/helpers.bash, line 338,
#  from function `runc_restore_with_pipes' in file tests/integration/checkpoint.bats, line 94,
#  in test file tests/integration/checkpoint.bats, line 256)
#   `runc_restore_with_pipes ./image-dir test_busybox_restore --lazy-pages' failed
# runc spec (status=0):
# 
# uffd-noncoop is supported
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.0.2-dev",
#   "id": "test_busybox",
#   "pid": 26141,
#   "status": "running",
#   "bundle": "/tmp/bats-run-8MpAhj/runc.6IeS56/bundle",
#   "rootfs": "/tmp/bats-run-8MpAhj/runc.6IeS56/bundle/rootfs",
#   "created": "2021-11-18T02:30:42.446111966Z",
#   "owner": ""
# }
# __runc restore test_busybox_restore failed (status: 1)
# time="2021-11-18T02:30:42Z" level=error msg="criu failed: type NOTIFY errno 0\nlog file: image-dir/restore.log"
# CRIU log errors (if any):
# ./image-dir/restore.log-(00.042718) 26211 (native) is going to execute the syscall 11, required is 15
# ./image-dir/restore.log-(00.042759) 26211 was trapped
# ./image-dir/restore.log-(00.042761) `- Expecting exit
# ./image-dir/restore.log-(00.042803) 26211 was trapped
# ./image-dir/restore.log-(00.042806) 26211 (native) is going to execute the syscall 18446744073709551615, required is 15
# ./image-dir/restore.log:(00.042852) Error (compel/src/lib/infect.c:1543): Task 26211 is in unexpected state: f7f
# ./image-dir/restore.log:(00.042863) Error (compel/src/lib/infect.c:1549): Task stopped with 15: Terminated
# ./image-dir/restore.log:(00.042865) Error (criu/cr-restore.c:2383): Can't stop all tasks on rt_sigreturn
# ./image-dir/restore.log:(00.042866) Error (criu/cr-restore.c:2420): Killing processes because of failure on restore.
# ./image-dir/restore.log-The Network was unlocked so some data or a connection may have been lost.
# ./image-dir/restore.log:(00.042986) Error (criu/mount.c:3385): mnt: Can't remove the directory /tmp/.criu.mntns.pjhQtV: No such file or directory
# ./image-dir/restore.log:(00.042990) Error (criu/cr-restore.c:2447): Restoring FAILED.
# --
# ./image-dir/restore.log-(00.042718) 26211 (native) is going to execute the syscall 11, required is 15
# ./image-dir/restore.log-(00.042759) 26211 was trapped
# ./image-dir/restore.log-(00.042761) `- Expecting exit
# ./image-dir/restore.log-(00.042803) 26211 was trapped
# ./image-dir/restore.log-(00.042806) 26211 (native) is going to execute the syscall 18446744073709551615, required is 15
# ./image-dir/restore.log:(00.042852) Error (compel/src/lib/infect.c:1543): Task 26211 is in unexpected state: f7f
# ./image-dir/restore.log:(00.042863) Error (compel/src/lib/infect.c:1549): Task stopped with 15: Terminated
# ./image-dir/restore.log:(00.042865) Error (criu/cr-restore.c:2383): Can't stop all tasks on rt_sigreturn
# ./image-dir/restore.log:(00.042866) Error (criu/cr-restore.c:2420): Killing processes because of failure on restore.
# ./image-dir/restore.log-The Network was unlocked so some data or a connection may have been lost.
# ./image-dir/restore.log:(00.042986) Error (criu/mount.c:3385): mnt: Can't remove the directory /tmp/.criu.mntns.pjhQtV: No such file or directory
# ./image-dir/restore.log:(00.042990) Error (criu/cr-restore.c:2447): Restoring FAILED.
# runc restore failed

kolyshkin added a commit to kolyshkin/runc that referenced this issue Aug 4, 2022
When doing a lazy checkpoint/restore, we should not restore into the
same cgroup, otherwise there is a race which result in occasional
killing of the restored container (GH opencontainers#2760, opencontainers#2924).

The fix is to use --manage-cgroup-mode=ignore, which allows to restore
into a different cgroup.

Note that since cgroupsPath is not set in config.json, the cgroup is
derived from the container name, so calling set_cgroups_path is not
needed.

For the previous (unsuccessful) attempt to fix this, as well as detailed
(and apparently correct) analysis, see commit 36fe3cc.

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Sep 2, 2022
When doing a lazy checkpoint/restore, we should not restore into the
same cgroup, otherwise there is a race which result in occasional
killing of the restored container (GH opencontainers#2760, opencontainers#2924).

The fix is to use --manage-cgroup-mode=ignore, which allows to restore
into a different cgroup.

Note that since cgroupsPath is not set in config.json, the cgroup is
derived from the container name, so calling set_cgroups_path is not
needed.

For the previous (unsuccessful) attempt to fix this, as well as detailed
(and apparently correct) analysis, see commit 36fe3cc.

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Oct 27, 2022
When doing a lazy checkpoint/restore, we should not restore into the
same cgroup, otherwise there is a race which result in occasional
killing of the restored container (GH opencontainers#2760, opencontainers#2924).

The fix is to use --manage-cgroup-mode=ignore, which allows to restore
into a different cgroup.

Note that since cgroupsPath is not set in config.json, the cgroup is
derived from the container name, so calling set_cgroups_path is not
needed.

For the previous (unsuccessful) attempt to fix this, as well as detailed
(and apparently correct) analysis, see commit 36fe3cc.

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Nov 2, 2022
When doing a lazy checkpoint/restore, we should not restore into the
same cgroup, otherwise there is a race which result in occasional
killing of the restored container (GH opencontainers#2760, opencontainers#2924).

The fix is to use --manage-cgroup-mode=ignore, which allows to restore
into a different cgroup.

Note that since cgroupsPath is not set in config.json, the cgroup is
derived from the container name, so calling set_cgroups_path is not
needed.

For the previous (unsuccessful) attempt to fix this, as well as detailed
(and apparently correct) analysis, see commit 36fe3cc.

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Dec 15, 2022
When doing a lazy checkpoint/restore, we should not restore into the
same cgroup, otherwise there is a race which result in occasional
killing of the restored container (GH opencontainers#2760, opencontainers#2924).

The fix is to use --manage-cgroup-mode=ignore, which allows to restore
into a different cgroup.

Note that since cgroupsPath is not set in config.json, the cgroup is
derived from the container name, so calling set_cgroups_path is not
needed.

For the previous (unsuccessful) attempt to fix this, as well as detailed
(and apparently correct) analysis, see commit 36fe3cc.

Signed-off-by: Kir Kolyshkin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants