[fedora] not ok 16 checkpoint --lazy-pages and restore #2924

kolyshkin · 2021-04-28T00:08:06Z

This is from a recent CI run (https://github.com/opencontainers/runc/pull/2923/checks?check_run_id=2452887046)

not ok 16 checkpoint --lazy-pages and restore
# (from function `fail' in file tests/integration/helpers.bash, line 274,
#  from function `runc_restore_with_pipes' in file tests/integration/checkpoint.bats, line 94,
#  in test file tests/integration/checkpoint.bats, line 225)
#   `runc_restore_with_pipes ./image-dir test_busybox_restore --lazy-pages' failed
# runc spec (status=0):
# 
# uffd-noncoop is supported
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.0.2-dev",
#   "id": "test_busybox",
#   "pid": 27036,
#   "status": "running",
#   "bundle": "/tmp/bats-run-24758/runc.cXN8Qe/bundle",
#   "rootfs": "/tmp/bats-run-24758/runc.cXN8Qe/bundle/rootfs",
#   "created": "2021-04-27T23:34:48.770811211Z",
#   "owner": ""
# }
# __runc restore test_busybox_restore failed (status: 1)
# time="2021-04-27T23:34:49Z" level=error msg="criu failed: type NOTIFY errno 0\nlog file: image-dir/restore.log"
# CRIU log errors (if any):
# ./image-dir/restore.log-(00.079105) 27107 (native) is going to execute the syscall 11, required is 15
# ./image-dir/restore.log-(00.079116) 27107 was trapped
# ./image-dir/restore.log-(00.079118) `- Expecting exit
# ./image-dir/restore.log-(00.079126) 27107 was trapped
# ./image-dir/restore.log-(00.079129) 27107 (native) is going to execute the syscall 18446744073709551615, required is 15
# ./image-dir/restore.log:(00.079136) Error (compel/src/lib/infect.c:1513): Task 27107 is in unexpected state: f7f
# ./image-dir/restore.log:(00.079742) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated
# ./image-dir/restore.log:(00.079748) Error (criu/cr-restore.c:2416): Can't stop all tasks on rt_sigreturn
# ./image-dir/restore.log:(00.079750) Error (criu/cr-restore.c:2454): Killing processes because of failure on restore.
# ./image-dir/restore.log-The Network was unlocked so some data or a connection may have been lost.
# ./image-dir/restore.log:(00.085728) Error (criu/mount.c:3396): mnt: Can't remove the directory /tmp/.criu.mntns.eawsHK: No such file or directory
# ./image-dir/restore.log:(00.085742) Error (criu/cr-restore.c:2483): Restoring FAILED.
# --
# ./image-dir/restore.log-(00.079105) 27107 (native) is going to execute the syscall 11, required is 15
# ./image-dir/restore.log-(00.079116) 27107 was trapped
# ./image-dir/restore.log-(00.079118) `- Expecting exit
# ./image-dir/restore.log-(00.079126) 27107 was trapped
# ./image-dir/restore.log-(00.079129) 27107 (native) is going to execute the syscall 18446744073709551615, required is 15
# ./image-dir/restore.log:(00.079136) Error (compel/src/lib/infect.c:1513): Task 27107 is in unexpected state: f7f
# ./image-dir/restore.log:(00.079742) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated
# ./image-dir/restore.log:(00.079748) Error (criu/cr-restore.c:2416): Can't stop all tasks on rt_sigreturn
# ./image-dir/restore.log:(00.079750) Error (criu/cr-restore.c:2454): Killing processes because of failure on restore.
# ./image-dir/restore.log-The Network was unlocked so some data or a connection may have been lost.
# ./image-dir/restore.log:(00.085728) Error (criu/mount.c:3396): mnt: Can't remove the directory /tmp/.criu.mntns.eawsHK: No such file or directory
# ./image-dir/restore.log:(00.085742) Error (criu/cr-restore.c:2483): Restoring FAILED.
# runc restore failed

Might be related to #2760

The text was updated successfully, but these errors were encountered:

kolyshkin · 2021-04-28T00:10:39Z

@adrianreber @avagin @rppt PTAL 🙏🏻

rppt · 2021-04-28T04:13:59Z

I think it is related to #2760
Do we have lazy-pages.log?

kolyshkin · 2021-04-28T18:01:38Z

Do we have lazy-pages.log?

When restore fails, we do this:

runc/tests/integration/checkpoint.bats

Lines 92 to 93 in c2c35ae

    
           echo "CRIU log errors (if any):" 
        
           grep -B 5 Error "$workdir"/*.log ./image-dir/*.log || true

which, together with the above log, tells us there were no Errors in lazy-pages.log.

Let me know if adding something like this:

echo "Lazy pages log (if available):"
cat ./image-dir/lazy-pages.log || true

will help in figuring this out?

kolyshkin · 2021-04-28T20:04:59Z

I think that the problem is, even if the container name is different, it is restored into the same cgroup (and maybe systemd is killing that cgroup).

[kir@kir-rhat runc-tst-cpt]$ sudo ../runc/runc checkpoint xx666
[kir@kir-rhat runc-tst-cpt]$ sudo ../runc/runc restore -d --console-socket tty.sock xx666_restored
INFO[0000] container: &{id:xx666_restored root:/run/runc/xx666_restored config:0xc00033a000 cgroupManager:0xc0003421e0 intelRdtManager:<nil> initPath:/proc/self/exe initArgs:[../runc/runc init] initProcess:<nil> initProcessStartTime:0 criuPath:criu newuidmapPath: newgidmapPath: m:{state:0 sema:0} criuVersion:0 state:0xc000298b10 created:{wall:0 ext:0 loc:<nil>} fifo:<nil>} 
INFO[0000] container.config: {NoPivotRoot:false ParentDeathSignal:0 Rootfs:/home/kir/go/src/github.com/opencontainers/runc-tst-cpt/rootfs Umask:<nil> Readonlyfs:true RootPropagation:0 Mounts:[0xc00033c000 0xc00033c0b0 0xc00033c160 0xc00033c210 0xc00033c2c0 0xc00033c370 0xc00033c420] Devices:[0x55b5cf03cdc0 0x55b5cf03ce20 0x55b5cf03ce80 0x55b5cf03cee0 0x55b5cf03cf40 0x55b5cf03cfa0] MountLabel: Hostname:runc Namespaces:[{Type:NEWPID Path:} {Type:NEWNET Path:} {Type:NEWIPC Path:} {Type:NEWUTS Path:} {Type:NEWNS Path:} {Type:NEWCGROUP Path:}] Capabilities:0xc000292800 Networks:[0xc00033c4d0] Routes:[] Cgroups:0xc0002e16c0 AppArmorProfile: ProcessLabel: Rlimits:[] OomScoreAdj:<nil> UidMappings:[] GidMappings:[] MaskPaths:[/proc/acpi /proc/asound /proc/kcore /proc/keys /proc/latency_stats /proc/timer_list /proc/timer_stats /proc/sched_debug /sys/firmware /proc/scsi] ReadonlyPaths:[/proc/bus /proc/fs /proc/irq /proc/sys /proc/sysrq-trigger] Sysctl:map[] Seccomp:<nil> NoNewPrivileges:true Hooks:map[] Version:1.0.2-dev Labels:[bundle=/home/kir/go/src/github.com/opencontainers/runc-tst-cpt] NoNewKeyring:false IntelRdt:<nil> RootlessEUID:false RootlessCgroups:false} 
INFO[0000] container.config.cgroups: {Name:xx666_restored Parent: Path: ScopePrefix: Paths:map[] Resources:0xc000282300 SystemdProps:[]} 
[kir@kir-rhat runc-tst-cpt]$ sudo ../runc/runc list
ID               PID         STATUS      BUNDLE                                                    CREATED                          OWNER
xx40_restored    3612232     running     /home/kir/go/src/github.com/opencontainers/runc-tst-cpt   2021-04-28T18:43:55.588709153Z   root
xx666_restored   3637234     running     /home/kir/go/src/github.com/opencontainers/runc-tst-cpt   2021-04-28T20:03:04.521477937Z   root
xx767            3627364     running     /home/kir/go/src/github.com/opencontainers/runc-tst       2021-04-28T19:46:10.002493083Z   root
[kir@kir-rhat runc-tst-cpt]$ cat /proc/3637234/cgroup 
0::/user.slice/user-1000.slice/[email protected]/xx666

rppt · 2021-04-29T05:51:16Z

Let me know if adding something like this:
echo "Lazy pages log (if available):"
cat ./image-dir/lazy-pages.log || true
will help in figuring this out?

We'll have more information, but if you suspect it's an issue with cgoups and systemd, this information might be less relevant.

kolyshkin · 2021-06-22T02:07:49Z

Writing this down before I forgot. The issue is, criu is unable to restore into a different cgroup, and thus there's some kind of a race between runc+criu restoring the container into the same cgroup, and systemd removing the original container's cgroup.

kolyshkin · 2021-11-13T04:40:15Z

Still the case with Fedora 35 and criu 3.16. From https://cirrus-ci.com/task/6279543066460160:

not ok 24 checkpoint --lazy-pages and restore
# (from function `fail' in file tests/integration/helpers.bash, line 338,
#  from function `runc_restore_with_pipes' in file tests/integration/checkpoint.bats, line 94,
#  in test file tests/integration/checkpoint.bats, line 256)
#   `runc_restore_with_pipes ./image-dir test_busybox_restore --lazy-pages' failed
# runc spec (status=0):
# 
# uffd-noncoop is supported
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.0.2-dev",
#   "id": "test_busybox",
#   "pid": 26121,
#   "status": "running",
#   "bundle": "/tmp/bats-run-5wCTTi/runc.g8U5xW/bundle",
#   "rootfs": "/tmp/bats-run-5wCTTi/runc.g8U5xW/bundle/rootfs",
#   "created": "2021-11-12T08:51:09.980655157Z",
#   "owner": ""
# }
# __runc restore test_busybox_restore failed (status: 1)
# time="2021-11-12T08:51:10Z" level=error msg="criu failed: type NOTIFY errno 0\nlog file: image-dir/restore.log"
# CRIU log errors (if any):
# ./image-dir/restore.log-(00.058769) 26191 (native) is going to execute the syscall 11, required is 15
# ./image-dir/restore.log-(00.058809) 26191 was trapped
# ./image-dir/restore.log-(00.058812) `- Expecting exit
# ./image-dir/restore.log-(00.058853) 26191 was trapped
# ./image-dir/restore.log-(00.058857) 26191 (native) is going to execute the syscall 18446744073709551615, required is 15
# ./image-dir/restore.log:(00.058898) Error (compel/src/lib/infect.c:1543): Task 26191 is in unexpected state: f7f
# ./image-dir/restore.log:(00.058909) Error (compel/src/lib/infect.c:1549): Task stopped with 15: Terminated
# ./image-dir/restore.log:(00.058911) Error (criu/cr-restore.c:2383): Can't stop all tasks on rt_sigreturn
# ./image-dir/restore.log:(00.058912) Error (criu/cr-restore.c:2420): Killing processes because of failure on restore.
# ./image-dir/restore.log-The Network was unlocked so some data or a connection may have been lost.
# ./image-dir/restore.log:(00.062714) Error (criu/mount.c:3385): mnt: Can't remove the directory /tmp/.criu.mntns.4PLAEH: No such file or directory
# ./image-dir/restore.log:(00.062720) Error (criu/cr-restore.c:2447): Restoring FAILED.
# --
# ./image-dir/restore.log-(00.058769) 26191 (native) is going to execute the syscall 11, required is 15
# ./image-dir/restore.log-(00.058809) 26191 was trapped
# ./image-dir/restore.log-(00.058812) `- Expecting exit
# ./image-dir/restore.log-(00.058853) 26191 was trapped
# ./image-dir/restore.log-(00.058857) 26191 (native) is going to execute the syscall 18446744073709551615, required is 15
# ./image-dir/restore.log:(00.058898) Error (compel/src/lib/infect.c:1543): Task 26191 is in unexpected state: f7f
# ./image-dir/restore.log:(00.058909) Error (compel/src/lib/infect.c:1549): Task stopped with 15: Terminated
# ./image-dir/restore.log:(00.058911) Error (criu/cr-restore.c:2383): Can't stop all tasks on rt_sigreturn
# ./image-dir/restore.log:(00.058912) Error (criu/cr-restore.c:2420): Killing processes because of failure on restore.
# ./image-dir/restore.log-The Network was unlocked so some data or a connection may have been lost.
# ./image-dir/restore.log:(00.062714) Error (criu/mount.c:3385): mnt: Can't remove the directory /tmp/.criu.mntns.4PLAEH: No such file or directory
# ./image-dir/restore.log:(00.062720) Error (criu/cr-restore.c:2447): Restoring FAILED.
# runc restore failed

kolyshkin · 2021-11-18T02:33:28Z

One more; from https://cirrus-ci.com/task/6333831285309440

not ok 24 checkpoint --lazy-pages and restore
# (from function `fail' in file tests/integration/helpers.bash, line 338,
#  from function `runc_restore_with_pipes' in file tests/integration/checkpoint.bats, line 94,
#  in test file tests/integration/checkpoint.bats, line 256)
#   `runc_restore_with_pipes ./image-dir test_busybox_restore --lazy-pages' failed
# runc spec (status=0):
# 
# uffd-noncoop is supported
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.0.2-dev",
#   "id": "test_busybox",
#   "pid": 26141,
#   "status": "running",
#   "bundle": "/tmp/bats-run-8MpAhj/runc.6IeS56/bundle",
#   "rootfs": "/tmp/bats-run-8MpAhj/runc.6IeS56/bundle/rootfs",
#   "created": "2021-11-18T02:30:42.446111966Z",
#   "owner": ""
# }
# __runc restore test_busybox_restore failed (status: 1)
# time="2021-11-18T02:30:42Z" level=error msg="criu failed: type NOTIFY errno 0\nlog file: image-dir/restore.log"
# CRIU log errors (if any):
# ./image-dir/restore.log-(00.042718) 26211 (native) is going to execute the syscall 11, required is 15
# ./image-dir/restore.log-(00.042759) 26211 was trapped
# ./image-dir/restore.log-(00.042761) `- Expecting exit
# ./image-dir/restore.log-(00.042803) 26211 was trapped
# ./image-dir/restore.log-(00.042806) 26211 (native) is going to execute the syscall 18446744073709551615, required is 15
# ./image-dir/restore.log:(00.042852) Error (compel/src/lib/infect.c:1543): Task 26211 is in unexpected state: f7f
# ./image-dir/restore.log:(00.042863) Error (compel/src/lib/infect.c:1549): Task stopped with 15: Terminated
# ./image-dir/restore.log:(00.042865) Error (criu/cr-restore.c:2383): Can't stop all tasks on rt_sigreturn
# ./image-dir/restore.log:(00.042866) Error (criu/cr-restore.c:2420): Killing processes because of failure on restore.
# ./image-dir/restore.log-The Network was unlocked so some data or a connection may have been lost.
# ./image-dir/restore.log:(00.042986) Error (criu/mount.c:3385): mnt: Can't remove the directory /tmp/.criu.mntns.pjhQtV: No such file or directory
# ./image-dir/restore.log:(00.042990) Error (criu/cr-restore.c:2447): Restoring FAILED.
# --
# ./image-dir/restore.log-(00.042718) 26211 (native) is going to execute the syscall 11, required is 15
# ./image-dir/restore.log-(00.042759) 26211 was trapped
# ./image-dir/restore.log-(00.042761) `- Expecting exit
# ./image-dir/restore.log-(00.042803) 26211 was trapped
# ./image-dir/restore.log-(00.042806) 26211 (native) is going to execute the syscall 18446744073709551615, required is 15
# ./image-dir/restore.log:(00.042852) Error (compel/src/lib/infect.c:1543): Task 26211 is in unexpected state: f7f
# ./image-dir/restore.log:(00.042863) Error (compel/src/lib/infect.c:1549): Task stopped with 15: Terminated
# ./image-dir/restore.log:(00.042865) Error (criu/cr-restore.c:2383): Can't stop all tasks on rt_sigreturn
# ./image-dir/restore.log:(00.042866) Error (criu/cr-restore.c:2420): Killing processes because of failure on restore.
# ./image-dir/restore.log-The Network was unlocked so some data or a connection may have been lost.
# ./image-dir/restore.log:(00.042986) Error (criu/mount.c:3385): mnt: Can't remove the directory /tmp/.criu.mntns.pjhQtV: No such file or directory
# ./image-dir/restore.log:(00.042990) Error (criu/cr-restore.c:2447): Restoring FAILED.
# runc restore failed

When doing a lazy checkpoint/restore, we should not restore into the same cgroup, otherwise there is a race which result in occasional killing of the restored container (GH opencontainers#2760, opencontainers#2924). The fix is to use --manage-cgroup-mode=ignore, which allows to restore into a different cgroup. Note that since cgroupsPath is not set in config.json, the cgroup is derived from the container name, so calling set_cgroups_path is not needed. For the previous (unsuccessful) attempt to fix this, as well as detailed (and apparently correct) analysis, see commit 36fe3cc. Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin added the area/checkpoint-restore label Apr 28, 2021

kolyshkin mentioned this issue Apr 28, 2021

[centos7] not ok 15 checkpoint --lazy-pages and restore #2760

Closed

kolyshkin mentioned this issue Jun 22, 2021

Unable to restore into a different cgroup checkpoint-restore/criu#1524

Closed

kolyshkin mentioned this issue Jul 21, 2021

Do not use Vagrant for CentOS 7/8 #3104

Merged

kolyshkin mentioned this issue Nov 13, 2021

libct: wrap more unix errors #3270

Merged

This was referenced Apr 4, 2022

Add support for CRIU's 'ignore' manage-cgroup mode #3447

Closed

cgroup: fix --manage-cgroups=ignore checkpoint-restore/criu#1800

Merged

kolyshkin mentioned this issue Aug 4, 2022

checkpoint/restore: implement --manage-cgroups-mode ignore #3546

Merged

mrunalp closed this as completed in #3546 Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fedora] not ok 16 checkpoint --lazy-pages and restore #2924

[fedora] not ok 16 checkpoint --lazy-pages and restore #2924

kolyshkin commented Apr 28, 2021 •

edited

Loading

kolyshkin commented Apr 28, 2021

rppt commented Apr 28, 2021

kolyshkin commented Apr 28, 2021

kolyshkin commented Apr 28, 2021

rppt commented Apr 29, 2021

kolyshkin commented Jun 22, 2021

kolyshkin commented Nov 13, 2021

kolyshkin commented Nov 18, 2021

[fedora] not ok 16 checkpoint --lazy-pages and restore #2924

[fedora] not ok 16 checkpoint --lazy-pages and restore #2924

Comments

kolyshkin commented Apr 28, 2021 • edited Loading

kolyshkin commented Apr 28, 2021

rppt commented Apr 28, 2021

kolyshkin commented Apr 28, 2021

kolyshkin commented Apr 28, 2021

rppt commented Apr 29, 2021

kolyshkin commented Jun 22, 2021

kolyshkin commented Nov 13, 2021

kolyshkin commented Nov 18, 2021

kolyshkin commented Apr 28, 2021 •

edited

Loading