-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fedora] not ok 16 checkpoint --lazy-pages and restore #2924
Comments
@adrianreber @avagin @rppt PTAL 🙏🏻 |
I think it is related to #2760 |
When restore fails, we do this: runc/tests/integration/checkpoint.bats Lines 92 to 93 in c2c35ae
which, together with the above log, tells us there were no Let me know if adding something like this: echo "Lazy pages log (if available):"
cat ./image-dir/lazy-pages.log || true will help in figuring this out? |
I think that the problem is, even if the container name is different, it is restored into the same cgroup (and maybe systemd is killing that cgroup). [kir@kir-rhat runc-tst-cpt]$ sudo ../runc/runc checkpoint xx666
[kir@kir-rhat runc-tst-cpt]$ sudo ../runc/runc restore -d --console-socket tty.sock xx666_restored
INFO[0000] container: &{id:xx666_restored root:/run/runc/xx666_restored config:0xc00033a000 cgroupManager:0xc0003421e0 intelRdtManager:<nil> initPath:/proc/self/exe initArgs:[../runc/runc init] initProcess:<nil> initProcessStartTime:0 criuPath:criu newuidmapPath: newgidmapPath: m:{state:0 sema:0} criuVersion:0 state:0xc000298b10 created:{wall:0 ext:0 loc:<nil>} fifo:<nil>}
INFO[0000] container.config: {NoPivotRoot:false ParentDeathSignal:0 Rootfs:/home/kir/go/src/github.com/opencontainers/runc-tst-cpt/rootfs Umask:<nil> Readonlyfs:true RootPropagation:0 Mounts:[0xc00033c000 0xc00033c0b0 0xc00033c160 0xc00033c210 0xc00033c2c0 0xc00033c370 0xc00033c420] Devices:[0x55b5cf03cdc0 0x55b5cf03ce20 0x55b5cf03ce80 0x55b5cf03cee0 0x55b5cf03cf40 0x55b5cf03cfa0] MountLabel: Hostname:runc Namespaces:[{Type:NEWPID Path:} {Type:NEWNET Path:} {Type:NEWIPC Path:} {Type:NEWUTS Path:} {Type:NEWNS Path:} {Type:NEWCGROUP Path:}] Capabilities:0xc000292800 Networks:[0xc00033c4d0] Routes:[] Cgroups:0xc0002e16c0 AppArmorProfile: ProcessLabel: Rlimits:[] OomScoreAdj:<nil> UidMappings:[] GidMappings:[] MaskPaths:[/proc/acpi /proc/asound /proc/kcore /proc/keys /proc/latency_stats /proc/timer_list /proc/timer_stats /proc/sched_debug /sys/firmware /proc/scsi] ReadonlyPaths:[/proc/bus /proc/fs /proc/irq /proc/sys /proc/sysrq-trigger] Sysctl:map[] Seccomp:<nil> NoNewPrivileges:true Hooks:map[] Version:1.0.2-dev Labels:[bundle=/home/kir/go/src/github.com/opencontainers/runc-tst-cpt] NoNewKeyring:false IntelRdt:<nil> RootlessEUID:false RootlessCgroups:false}
INFO[0000] container.config.cgroups: {Name:xx666_restored Parent: Path: ScopePrefix: Paths:map[] Resources:0xc000282300 SystemdProps:[]}
[kir@kir-rhat runc-tst-cpt]$ sudo ../runc/runc list
ID PID STATUS BUNDLE CREATED OWNER
xx40_restored 3612232 running /home/kir/go/src/github.com/opencontainers/runc-tst-cpt 2021-04-28T18:43:55.588709153Z root
xx666_restored 3637234 running /home/kir/go/src/github.com/opencontainers/runc-tst-cpt 2021-04-28T20:03:04.521477937Z root
xx767 3627364 running /home/kir/go/src/github.com/opencontainers/runc-tst 2021-04-28T19:46:10.002493083Z root
[kir@kir-rhat runc-tst-cpt]$ cat /proc/3637234/cgroup
0::/user.slice/user-1000.slice/[email protected]/xx666 |
We'll have more information, but if you suspect it's an issue with cgoups and systemd, this information might be less relevant. |
Writing this down before I forgot. The issue is, criu is unable to restore into a different cgroup, and thus there's some kind of a race between runc+criu restoring the container into the same cgroup, and systemd removing the original container's cgroup. |
Still the case with Fedora 35 and criu 3.16. From https://cirrus-ci.com/task/6279543066460160:
|
One more; from https://cirrus-ci.com/task/6333831285309440
|
When doing a lazy checkpoint/restore, we should not restore into the same cgroup, otherwise there is a race which result in occasional killing of the restored container (GH opencontainers#2760, opencontainers#2924). The fix is to use --manage-cgroup-mode=ignore, which allows to restore into a different cgroup. Note that since cgroupsPath is not set in config.json, the cgroup is derived from the container name, so calling set_cgroups_path is not needed. For the previous (unsuccessful) attempt to fix this, as well as detailed (and apparently correct) analysis, see commit 36fe3cc. Signed-off-by: Kir Kolyshkin <[email protected]>
When doing a lazy checkpoint/restore, we should not restore into the same cgroup, otherwise there is a race which result in occasional killing of the restored container (GH opencontainers#2760, opencontainers#2924). The fix is to use --manage-cgroup-mode=ignore, which allows to restore into a different cgroup. Note that since cgroupsPath is not set in config.json, the cgroup is derived from the container name, so calling set_cgroups_path is not needed. For the previous (unsuccessful) attempt to fix this, as well as detailed (and apparently correct) analysis, see commit 36fe3cc. Signed-off-by: Kir Kolyshkin <[email protected]>
When doing a lazy checkpoint/restore, we should not restore into the same cgroup, otherwise there is a race which result in occasional killing of the restored container (GH opencontainers#2760, opencontainers#2924). The fix is to use --manage-cgroup-mode=ignore, which allows to restore into a different cgroup. Note that since cgroupsPath is not set in config.json, the cgroup is derived from the container name, so calling set_cgroups_path is not needed. For the previous (unsuccessful) attempt to fix this, as well as detailed (and apparently correct) analysis, see commit 36fe3cc. Signed-off-by: Kir Kolyshkin <[email protected]>
When doing a lazy checkpoint/restore, we should not restore into the same cgroup, otherwise there is a race which result in occasional killing of the restored container (GH opencontainers#2760, opencontainers#2924). The fix is to use --manage-cgroup-mode=ignore, which allows to restore into a different cgroup. Note that since cgroupsPath is not set in config.json, the cgroup is derived from the container name, so calling set_cgroups_path is not needed. For the previous (unsuccessful) attempt to fix this, as well as detailed (and apparently correct) analysis, see commit 36fe3cc. Signed-off-by: Kir Kolyshkin <[email protected]>
When doing a lazy checkpoint/restore, we should not restore into the same cgroup, otherwise there is a race which result in occasional killing of the restored container (GH opencontainers#2760, opencontainers#2924). The fix is to use --manage-cgroup-mode=ignore, which allows to restore into a different cgroup. Note that since cgroupsPath is not set in config.json, the cgroup is derived from the container name, so calling set_cgroups_path is not needed. For the previous (unsuccessful) attempt to fix this, as well as detailed (and apparently correct) analysis, see commit 36fe3cc. Signed-off-by: Kir Kolyshkin <[email protected]>
This is from a recent CI run (https://github.com/opencontainers/runc/pull/2923/checks?check_run_id=2452887046)
Might be related to #2760
The text was updated successfully, but these errors were encountered: