[centos7] not ok 15 checkpoint --lazy-pages and restore #2760

kolyshkin · 2021-01-22T04:08:54Z

~~After accidentally enabling checkpoint/restore tests on CentOS 7 (in a15ebe5, PR #2757), I found out that~~
lazy-pages test sometimes fails without providing any decent error:

not ok 15 checkpoint --lazy-pages and restore
# (in test file tests/integration/checkpoint.bats, line 193)
#   `[ "$out" = "0000000 000000 0000001" ]' failed
# runc spec (status=0):
# 
# uffd-noncoop is supported
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.0.2-dev",
#   "id": "test_busybox",
#   "pid": 9727,
#   "status": "running",
#   "bundle": "/tmp/bats-run-7505/busyboxtest",
#   "rootfs": "/tmp/bats-run-7505/busyboxtest/rootfs",
#   "created": "2021-01-22T01:12:07.588049056Z",
#   "owner": ""
# }
# time="2021-01-22T01:12:09Z" level=error msg="container is not destroyed"
# time="2021-01-22T01:12:09Z" level=error msg="criu failed: type NOTIFY errno 0\nlog file: work-dir/dump.log"

The command that fails is:

        __runc --criu "$CRIU" checkpoint --lazy-pages --page-server 0.0.0.0:${port} --status-fd ${lazy_w} --work-path ./work-dir --image-path ./image-dir test_busybox &

The text was updated successfully, but these errors were encountered:

kolyshkin · 2021-02-03T00:42:25Z

TODO: make it run multiple times, show full log, cross my fingers and hope it will fail.

kolyshkin · 2021-02-10T02:50:34Z

Got another failure ( https://github.com/opencontainers/runc/pull/2087/checks?check_run_id=1841731496), this time with slightly different output:

not ok 15 checkpoint --lazy-pages and restore
# (in test file tests/integration/checkpoint.bats, line 193)
#   `[ "$out" = "0000000 000000 0000001" ]' failed
# runc spec (status=0):
# 
# uffd-noncoop is supported
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.0.2-dev".
#   "id": "test_busybox",
#   "pid": 23602,
#   "status": "running",
#   "bundle": "/tmp/busyboxtest",
#   "rootfs": "/tmp/busyboxtest/rootfs",
#   "created": "2021-02-05T20:42:22.817417601Z",
#   "owner": ""
# }
# time="2021-02-05T20:42:25Z" level=error msg="read unixpacket @->@: EOF"

kolyshkin · 2021-02-11T20:10:09Z

It is now failing with

time="2021-02-11T19:42:06Z" level=error msg="read unixpacket @->@: EOF"

kolyshkin · 2021-02-26T00:44:40Z

Another failure in #2768, same diagnostics, for details see https://github.com/opencontainers/runc/pull/2768/checks?check_run_id=1975173190

kolyshkin · 2021-03-08T23:04:56Z

Another failure, this time from #2835. From https://github.com/opencontainers/runc/pull/2835/checks?check_run_id=2060556750:

not ok 16 checkpoint --lazy-pages and restore
# (in test file tests/integration/checkpoint.bats, line 191)
#   `[ "$out" = "0000000 000000 0000001" ]' failed
# runc spec (status=0):
# 
# uffd-noncoop is supported
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.0.2-dev",
#   "id": "test_busybox",
#   "pid": 22056,
#   "status": "running",
#   "bundle": "/tmp/bats-run-20266/runc.MHiJRh/bundle",
#   "rootfs": "/tmp/bats-run-20266/runc.MHiJRh/bundle/rootfs",
#   "created": "2021-03-08T21:57:51.397628015Z",
#   "owner": ""
# }
# time="2021-03-08T21:58:13Z" level=error msg="criu failed: type NOTIFY errno 3\nlog file: work-dir/dump.log"

kolyshkin · 2021-03-16T17:04:45Z

OK, we got a full log now

The error comes from this code:

runc/tests/integration/checkpoint.bats

Lines 157 to 191 in 249bca0

    
           @test "checkpoint --lazy-pages and restore" { 
        
           	# check if lazy-pages is supported 
        
           	if ! "${CRIU}" check --feature uffd-noncoop; then 
        
           		skip "this criu does not support lazy migration" 
        
           	fi 
        
           	setup_pipes 
        
           	runc_run_with_pipes test_busybox 
        
           	# checkpoint the running container 
        
           	mkdir image-dir 
        
           	mkdir work-dir 
        
           	# For lazy migration we need to know when CRIU is ready to serve 
        
           	# the memory pages via TCP. 
        
           	exec {pipe}<> <(:) 
        
           	# shellcheck disable=SC2094 
        
           	exec {lazy_r}</proc/self/fd/$pipe {lazy_w}>/proc/self/fd/$pipe 
        
           	exec {pipe}>&- 
        
           	# TCP port for lazy migration 
        
           	port=27277 
        
           	__runc --criu "$CRIU" checkpoint --lazy-pages --page-server 0.0.0.0:${port} --status-fd ${lazy_w} --work-path ./work-dir --image-path ./image-dir test_busybox & 
        
           	cpt_pid=$! 
        
           	# wait for lazy page server to be ready 
        
           	out=$(timeout 2 dd if=/proc/self/fd/${lazy_r} bs=1 count=1 2>/dev/null | od) 
        
           	exec {lazy_w}>&- 
        
           	# shellcheck disable=SC2116,SC2086 
        
           	out=$(echo $out) # rm newlines 
        
           	# show log in case something is wrong before we fail 
        
           	cat ./work-dir/dump.log 
        
           	# expecting \0 which od prints as 
        
           	[ "$out" = "0000000 000000 0000001" ]

The log (from https://github.com/opencontainers/runc/pull/2852/checks?check_run_id=2122290178) is here:

not ok 16 checkpoint --lazy-pages and restore
# (in test file tests/integration/checkpoint.bats, line 191)
#   `[ "$out" = "0000000 000000 0000001" ]' failed
# runc spec (status=0):
# 
# uffd-noncoop is supported
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.0.2-dev",
#   "id": "test_busybox",
#   "pid": 9452,
#   "status": "running",
#   "bundle": "/tmp/bats-run-7677/runc.OBEIVy/bundle",
#   "rootfs": "/tmp/bats-run-7677/runc.OBEIVy/bundle/rootfs",
#   "created": "2021-03-16T14:24:55.509384475Z",
#   "owner": ""
# }
# (00.000035) Version: 3.15 (gitid 0)
# (00.000061) Running on localhost.localdomain Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64
# (00.000065) Would overwrite RPC settings with values from /etc/criu/runc.conf
# (00.000086) Loaded kdat cache from /run/criu/criu.kdat
# (00.000152) ========================================
# (00.000157) Dumping processes (pid: 9452)
# (00.000160) ========================================
# (00.000169) rlimit: RLIMIT_NOFILE unlimited for self
# (00.000176) Running pre-dump scripts
# (00.000179) 	RPC
# c *:* m
# c 1:3 rwm
# c 1:5 rwm
# c 1:7 rwm
# c 1:8 rwm
# c 1:9 rwm
# c 5:0 rwm
# c 5:2 rwm
# c 10:200 rwm
# c 136:* rwm from /proc/self/fd/25/test_busybox/devices.list
# (00.941387) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/cgroup.clone_children
# (00.941401) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/notify_on_release
# (00.941418) cg: Dumping value  from /proc/self/fd/25/test_busybox/cgroup.procs
# (00.941432) cg: Dumping value  from /proc/self/fd/25/test_busybox/tasks
# (01.000305) cg: adding cgroup /proc/self/fd/25/test_busybox
# (01.000345) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/cgroup.clone_children
# (01.000360) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/notify_on_release
# (01.000390) cg: Dumping value  from /proc/self/fd/25/test_busybox/cgroup.procs
# (01.000404) cg: Dumping value  from /proc/self/fd/25/test_busybox/tasks
# (01.030145) cg: adding cgroup /proc/self/fd/25/test_busybox
# (01.030184) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/cgroup.clone_children
# (01.030199) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/notify_on_release
# (01.030216) cg: Dumping value  from /proc/self/fd/25/test_busybox/cgroup.procs
# (01.030230) cg: Dumping value  from /proc/self/fd/25/test_busybox/tasks
# (01.068097) cg: adding cgroup /proc/self/fd/25/test_busybox
# (01.068135) cg: Dumping value 9223372036854771712 from /proc/self/fd/25/test_busybox/memory.limit_in_bytes
# (01.068150) cg: Dumping value 9223372036854771712 from /proc/self/fd/25/test_busybox/memory.memsw.limit_in_bytes
# (01.068163) cg: Dumping value 30 from /proc/self/fd/25/test_busybox/memory.swappiness
# (01.068175) cg: Dumping value 9223372036854771712 from /proc/self/fd/25/test_busybox/memory.soft_limit_in_bytes
# (01.068188) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/memory.move_charge_at_immigrate
# (01.068222) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/memory.oom_control
# (01.068236) cg: Dumping value 1 from /proc/self/fd/25/test_busybox/memory.use_hierarchy
# (01.068249) cg: Dumping value 9223372036854771712 from /proc/self/fd/25/test_busybox/memory.kmem.limit_in_bytes
# (01.068262) cg: Dumping value 9223372036854771712 from /proc/self/fd/25/test_busybox/memory.kmem.tcp.limit_in_bytes
# (01.068275) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/cgroup.clone_children
# (01.068288) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/notify_on_release
# (01.068304) cg: Dumping value  from /proc/self/fd/25/test_busybox/cgroup.procs
# (01.068317) cg: Dumping value  from /proc/self/fd/25/test_busybox/tasks
# (01.100200) cg: adding cgroup /proc/self/fd/25/user.slice/user-1000.slice/session-10.scope/test_busybox
# (01.100239) cg: Dumping value 0 from /proc/self/fd/25/user.slice/user-1000.slice/session-10.scope/test_busybox/cgroup.clone_children
# (01.100255) cg: Dumping value 1 from /proc/self/fd/25/user.slice/user-1000.slice/session-10.scope/test_busybox/notify_on_release
# (01.100273) cg: Dumping value  from /proc/self/fd/25/user.slice/user-1000.slice/session-10.scope/test_busybox/cgroup.procs
# (01.100288) cg: Dumping value  from /proc/self/fd/25/user.slice/user-1000.slice/session-10.scope/test_busybox/tasks
# (01.110880) cg: adding cgroup /proc/self/fd/25/test_busybox
# (01.110920) cg: Dumping value lo 0
# eth0 0 from /proc/self/fd/25/test_busybox/net_prio.ifpriomap
# (01.110937) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/cgroup.clone_children
# (01.110950) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/notify_on_release
# (01.110966) cg: Dumping value  from /proc/self/fd/25/test_busybox/cgroup.procs
# (01.110980) cg: Dumping value  from /proc/self/fd/25/test_busybox/tasks
# (01.110994) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/net_cls.classid
# (01.111007) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/cgroup.clone_children
# (01.111020) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/notify_on_release
# (01.111033) cg: Dumping value  from /proc/self/fd/25/test_busybox/cgroup.procs
# (01.111045) cg: Dumping value  from /proc/self/fd/25/test_busybox/tasks
# (04.293129) cg: adding cgroup /proc/self/fd/25/test_busybox
# (04.293162) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/cgroup.clone_children
# (04.293177) cg: Dumping value 0 from /proc/self/fd/25/test_busybox/notify_on_release
# (04.293193) cg: Dumping value  from /proc/self/fd/25/test_busybox/cgroup.procs
# (04.293206) cg: Dumping value  from /proc/self/fd/25/test_busybox/tasks
# time="2021-03-16T14:24:59Z" level=error msg="container is not destroyed"
# time="2021-03-16T14:24:59Z" level=error msg="criu failed: type NOTIFY errno 0\nlog file: work-dir/dump.log"
ok 17 checkpoint and restore in external network namespace

Alas it's still unclear what is going on :(

//cc @adrianreber @avagin

kolyshkin · 2021-04-12T18:35:38Z

One more failure, from this run: https://github.com/opencontainers/runc/pull/2836/checks?check_run_id=2299143195

not ok 17 checkpoint --lazy-pages and restore
# (in test file tests/integration/checkpoint.bats, line 208)
#   `[ "$out" = "0000000 000000 0000001" ]' failed
# runc spec (status=0):
# 
# uffd-noncoop is supported
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.0.2-dev",
#   "id": "test_busybox",
#   "pid": 9752,
#   "status": "running",
#   "bundle": "/tmp/bats-run-7725/runc.ZdTTis/bundle",
#   "rootfs": "/tmp/bats-run-7725/runc.ZdTTis/bundle/rootfs",
#   "created": "2021-04-08T17:26:00.579771818Z",
#   "owner": ""
# }
# time="2021-04-08T17:26:02Z" level=error msg="read unixpacket @->@: EOF"

At this point I am inclined to disable this test for CentOS 7 :-(

adrianreber · 2021-04-13T07:53:00Z

At this point I am inclined to disable this test for CentOS 7 :-(

I agree. As far as I know there are no users of runc's lazy migration support in any higher levels and CentOS 7 is now really old. If we would get CRIU bugs on CentOS 7 I would also recommend to upgrade to something newer.

kolyshkin · 2021-04-28T00:09:44Z

Alas, we also have this test failing on Fedora; the errors look very similar, PTAL: ~~#2760~~ #2924

When doing a lazy checkpoint/restore, we should not restore into the same cgroup, otherwise there is a race which result in occasional killing of the restored container (GH opencontainers#2760, opencontainers#2924). The fix is to use --manage-cgroup-mode=ignore, which allows to restore into a different cgroup. Note that since cgroupsPath is not set in config.json, the cgroup is derived from the container name, so calling set_cgroups_path is not needed. For the previous (unsuccessful) attempt to fix this, as well as detailed (and apparently correct) analysis, see commit 36fe3cc. Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin mentioned this issue Jan 22, 2021

[DNM] debugging criu --lazy-pages failure #2761

Closed

kolyshkin changed the title ~~runc checkpoint --lazy-pages fails on CentOS 7 with no clear diagnostics~~ runc checkpoint --lazy-pages occasionally fails on CentOS 7 with no clear diagnostics Feb 3, 2021

kolyshkin mentioned this issue Feb 10, 2021

cgroups/systemd: add cgroup-v2 path to the list when using hybrid mode #2087

Closed

This comment has been minimized.

Sign in to view

kolyshkin changed the title ~~runc checkpoint --lazy-pages occasionally fails on CentOS 7 with no clear diagnostics~~ runc checkpoint --lazy-pages occasionally fails Feb 10, 2021

kolyshkin changed the title ~~runc checkpoint --lazy-pages occasionally fails~~ runc checkpoint --lazy-pages occasionally fails on CentOS 7 with no clear diagnostics Feb 10, 2021

kolyshkin mentioned this issue Feb 26, 2021

ci/gha: add i386 unit test #2768

Merged

kolyshkin added area/checkpoint-restore area/ci labels Mar 8, 2021

kolyshkin mentioned this issue Mar 8, 2021

tests/checkpoint: show full log lazy pages cpt #2842

Merged

kolyshkin mentioned this issue Mar 16, 2021

libcontainer/apparmor: use sync.Once for AppArmor detection #2852

Merged

kolyshkin changed the title ~~runc checkpoint --lazy-pages occasionally fails on CentOS 7 with no clear diagnostics~~ [centos7] not ok 15 checkpoint --lazy-pages and restore Apr 6, 2021

kolyshkin mentioned this issue Apr 12, 2021

Improve nsexec logging #2836

Merged

kolyshkin mentioned this issue Apr 28, 2021

[fedora] not ok 16 checkpoint --lazy-pages and restore #2924

Closed

kolyshkin mentioned this issue May 7, 2021

Enable rootless cgroup v2 tests + some fixes #2944

Merged

kolyshkin mentioned this issue Jun 22, 2021

Unable to restore into a different cgroup checkpoint-restore/criu#1524

Closed

kolyshkin mentioned this issue Nov 13, 2021

libct: wrap more unix errors #3270

Merged

This was referenced Apr 4, 2022

Add support for CRIU's 'ignore' manage-cgroup mode #3447

Closed

cgroup: fix --manage-cgroups=ignore checkpoint-restore/criu#1800

Merged

kolyshkin mentioned this issue Aug 4, 2022

checkpoint/restore: implement --manage-cgroups-mode ignore #3546

Merged

mrunalp closed this as completed in #3546 Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[centos7] not ok 15 checkpoint --lazy-pages and restore #2760

[centos7] not ok 15 checkpoint --lazy-pages and restore #2760

kolyshkin commented Jan 22, 2021 •

edited

Loading

kolyshkin commented Feb 3, 2021

kolyshkin commented Feb 10, 2021

This comment has been minimized.

kolyshkin commented Feb 11, 2021

kolyshkin commented Feb 26, 2021

kolyshkin commented Mar 8, 2021

kolyshkin commented Mar 16, 2021

kolyshkin commented Apr 12, 2021

adrianreber commented Apr 13, 2021

kolyshkin commented Apr 28, 2021 •

edited

Loading

[centos7] not ok 15 checkpoint --lazy-pages and restore #2760

[centos7] not ok 15 checkpoint --lazy-pages and restore #2760

Comments

kolyshkin commented Jan 22, 2021 • edited Loading

kolyshkin commented Feb 3, 2021

kolyshkin commented Feb 10, 2021

This comment has been minimized.

kolyshkin commented Feb 11, 2021

kolyshkin commented Feb 26, 2021

kolyshkin commented Mar 8, 2021

kolyshkin commented Mar 16, 2021

kolyshkin commented Apr 12, 2021

adrianreber commented Apr 13, 2021

kolyshkin commented Apr 28, 2021 • edited Loading

kolyshkin commented Jan 22, 2021 •

edited

Loading

kolyshkin commented Apr 28, 2021 •

edited

Loading