Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

centos-stream-9 CI is failing for the main branch #3760

Closed
AkihiroSuda opened this issue Mar 8, 2023 · 12 comments · Fixed by #3782
Closed

centos-stream-9 CI is failing for the main branch #3760

AkihiroSuda opened this issue Mar 8, 2023 · 12 comments · Fixed by #3782

Comments

@AkihiroSuda
Copy link
Member

https://cirrus-ci.com/task/5246167815028736 (6d0261c)

ssh -tt localhost "make -C /home/runc localintegration RUNC_USE_SYSTEMD=yes"
make: Entering directory '/home/runc'
go build -trimpath "-buildmode=pie"  -tags "seccomp urfave_cli_no_docs" -ldflags "-X main.gitCommit=6d0261c1 -X main.version=1.1.0+dev " -o runc .
go build -trimpath "-buildmode=pie"  -tags "seccomp urfave_cli_no_docs" -ldflags "-X main.gitCommit=6d0261c1 -X main.version=1.1.0+dev " -o contrib/cmd/recvtty/recvtty ./contrib/cmd/recvtty
go build -trimpath "-buildmode=pie"  -tags "seccomp urfave_cli_no_docs" -ldflags "-X main.gitCommit=6d0261c1 -X main.version=1.1.0+dev " -o contrib/cmd/sd-helper/sd-helper ./contrib/cmd/sd-helper
go build -trimpath "-buildmode=pie"  -tags "seccomp urfave_cli_no_docs" -ldflags "-X main.gitCommit=6d0261c1 -X main.version=1.1.0+dev " -o contrib/cmd/seccompagent/seccompagent ./contrib/cmd/seccompagent
bats -t tests/integration
1..181
ok 1 runc run no capability
ok 2 runc run with unknown capability
...
ok 97 runc pause and resume
ok 98 runc pause and resume with nonexist container
not ok 99 ps
# (in test file tests/integration/ps.bats, line 27)
#   `[[ "${lines[1]}" == *"$(id -un 2>/dev/null)"*[0-9]* ]]' failed
# runc spec (status=0):
# 
# runc run -d --console-socket /tmp/bats-run-78374/runc.VKglS3/tty/sock test_busybox (status=0):
# 
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.1.0-rc.1",
#   "id": "test_busybox",
#   "pid": 90743,
#   "status": "running",
#   "bundle": "/tmp/bats-run-78374/runc.VKglS3/bundle",
#   "rootfs": "/tmp/bats-run-78374/runc.VKglS3/bundle/rootfs",
#   "created": "2023-03-08T14:43:11.320425119Z",
#   "owner": ""
# }
# runc ps test_busybox (status=0):
# UID          PID    PPID  C STIME TTY          TIME CMD
# /tmp/bats-run-78374/bats.90669.src: line 27: lines[1]: unbound variable
not ok 100 ps -f json
# (in test file tests/integration/ps.bats, line 43)
#   `[[ ${lines[0]} =~ [0-9]+ ]]' failed
# runc spec (status=0):
# 
# runc run -d --console-socket /tmp/bats-run-78374/runc.myy6Ir/tty/sock test_busybox (status=0):
# 
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.1.0-rc.1",
#   "id": "test_busybox",
#   "pid": 90840,
#   "status": "running",
#   "bundle": "/tmp/bats-run-78374/runc.myy6Ir/bundle",
#   "rootfs": "/tmp/bats-run-78374/runc.myy6Ir/bundle/rootfs",
#   "created": "2023-03-08T14:43:11.547583557Z",
#   "owner": ""
# }
# runc ps -f json test_busybox (status=0):
# null
not ok 101 ps -e -x
# (in test file tests/integration/ps.bats, line 60)
#   `[[ "${lines[1]}" =~ [0-9]+ ]]' failed
# runc spec (status=0):
# 
# runc run -d --console-socket /tmp/bats-run-78374/runc.p19TSy/tty/sock test_busybox (status=0):
# 
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.1.0-rc.1",
#   "id": "test_busybox",
#   "pid": 90938,
#   "status": "running",
#   "bundle": "/tmp/bats-run-78374/runc.p19TSy/bundle",
#   "rootfs": "/tmp/bats-run-78374/runc.p19TSy/bundle/rootfs",
#   "created": "2023-03-08T14:43:11.762275245Z",
#   "owner": ""
# }
# runc ps test_busybox -e -x (status=0):
#     PID TTY      STAT   TIME COMMAND
# /tmp/bats-run-78374/bats.90669.src: line 60: lines[1]: unbound variable
ok 102 ps after the container stopped
ok 103 global --root
..
ok 166 update cpu quota with no previous period/quota set
ok 167 update cpu period in a pod cgroup with pod limit set # skip test requires cgroups_v1
not ok 168 update cgroup cpu.idle
# (from function `check_cgroup_value' in file tests/integration/helpers.bash, line 265,
#  in test file tests/integration/update.bats, line 460)
#   `check_cgroup_value "cpu.idle" "1"' failed
# runc spec (status=0):
# 
# runc run -d --console-socket /tmp/bats-run-78374/runc.dtIP36/tty/sock test_update (status=0):
# 
# current 0 !? 0
# runc update -r - test_update (status=0):
# 
# current 1 !? 1
# runc update -r - test_update (status=0):
# 
# current 0 !? 0
# runc update -r - test_update (status=0):
# 
# current 1 !? 1
# runc update --cpu-idle 1 test_update (status=0):
# 
# current 1 !? 1
# runc update --cpu-idle 0 test_update (status=0):
# 
# current 0 !? 0
# runc update --cpu-idle 1 test_update (status=0):
# 
# current 1 !? 1
# runc update --cpu-period 10000 test_update (status=0):
# 
# current 0 !? 1
ok 169 update cgroup v2 resources via unified map
ok 170 update cpuset parameters via resources.CPU
ok 171 update cpuset parameters via v2 unified map
ok 172 update cpuset cpus range via v2 unified map # skip test requires more_than_8_core
ok 173 update rt period and runtime # skip test requires cgroups_v1
ok 174 update devices [minimal transition rules]
ok 175 update paused container
ok 176 update memory vs CheckBeforeUpdate
ok 177 userns with simple mount
ok 178 userns with 2 inaccessible mounts
ok 179 userns with inaccessible mount + exec
ok 180 userns with bind mount before a cgroupfs mount # skip test requires cgroups_v1
ok 181 runc version
make: *** [Makefile:121: localintegration] Error 1
make: Leaving directory '/home/runc'
Connection to localhost closed.
Exit status: 2
@AkihiroSuda
Copy link
Member Author

CI was passing for #3757 , but its merge commit into the main branch is failing 🤔

@kolyshkin
Copy link
Contributor

Tried updating from go 1.19 to go 1.20.2 as a hunch -- didn't work (see #3761).

Looking into it.

@kolyshkin

This comment was marked as outdated.

kolyshkin added a commit to kolyshkin/runc that referenced this issue Mar 8, 2023
Apparently the reason is some issue in systemd v252-6.el9,
and upgrading to the latest release (v252-8.el9 as of now)
fixes the issue.

Fixes: opencontainers#3760

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Mar 8, 2023
Apparently the reason is some issue in systemd v252-6.el9,
and upgrading to the latest release (v252-8.el9 as of now)
fixes the issue.

Fixes: opencontainers#3760

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Mar 8, 2023
Apparently the reason is some issue in systemd v252-6.el9,
and upgrading to the latest release (v252-8.el9 as of now)
fixes the issue.

Fixes: opencontainers#3760

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin
Copy link
Contributor

OK, the issue is with systemd, sometimes it is working normally:

[root@localhost runc-tst]# ./runc --systemd-cgroup run -d 444
[root@localhost runc-tst]# journalctl -b0 | grep 444
Mar 10 00:49:14 localhost systemd[1]: Started libcontainer container 444.
[root@localhost runc-tst]# ./runc list
ID             PID         STATUS      BUNDLE           CREATED                          OWNER
1234           0           stopped     /home/runc-tst   2023-03-10T00:44:52.22112029Z    root
444            102882      running     /home/runc-tst   2023-03-10T00:49:14.558960612Z   root
test_busybox   102815      running     /home/runc-tst   2023-03-10T00:46:59.893227615Z   root
[root@localhost runc-tst]# cat /proc/102882/cgroup 
0::/system.slice/runc-444.scope
[root@localhost runc-tst]# jq '.cgroup_paths' < /run/runc/test_busybox/state.json 
{
  "": "/sys/fs/cgroup/system.slice/runc-test_busybox.scope"
}

and sometimes systemd seems to ignore the request to create a scope, and the container process is not moved to a proper cgroup.

[root@localhost runc-tst]# jq '.cgroup_paths' < /run/runc/444/state.json 
{
  "": "/sys/fs/cgroup/system.slice/runc-444.scope"
}
[root@localhost runc-tst]# cat /proc/102815/cgroup 
0::/system.slice/google-startup-scripts.service

@kolyshkin
Copy link
Contributor

OK I was not able to finish this investigation this week. Apparently, setting the cgroupPath in config.json helps, but I'm not sure why the defaults is not good enough (as they always worked before).

For now, I removed the "required" setting for the centos-stream-9 test so we can still merge some PRs.

Will continue working on that after a week .

@kolyshkin
Copy link
Contributor

centos-stream-9 also fails for release-1.1 branch.

@kolyshkin
Copy link
Contributor

So, what happens is

  • many tests are using the same container name (test_busybox);
  • for some reason, systemd is not removing unit files after the container was oom-killed in events oom test;
  • once that happens, any other container with the same name may fail (which happens in ps test.

Here's an excerpt from the test with the added debug (systemctl show and ls /run/systemd/transient/runc-test_busybox.scope.d:

ok 60 events --stats
ok 61 events --interval default
ok 62 events --interval 1s
ok 63 events --interval 100ms
not ok 64 events oom
# (from function `fail' in file tests/integration/helpers.bash, line 345,
#  from function `teardown_bundle' in file tests/integration/helpers.bash, line 618,
#  from function `teardown' in test file tests/integration/events.bats, line 10)
#   `teardown_bundle' failed
# runc spec (status=0):
#
# runc run -d --console-socket /tmp/bats-run-sW55no/runc.DMoDU9/tty/sock test_busybox (status=0):
#
# Warning: The unit file, source configuration file or drop-ins of runc-test_busybox.scope changed on disk. Run 'systemctl daemon-reload' to reload units.
# × runc-test_busybox.scope - libcontainer container test_busybox
#      Loaded: loaded (/run/systemd/transient/runc-test_busybox.scope; transient)
#   Transient: yes
#     Drop-In: /run/systemd/transient/runc-test_busybox.scope.d
#              └─50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-MemoryMax.conf
#      Active: failed (Result: oom-kill) since Wed 2023-03-22 03:03:21 UTC; 385ms ago
#    Duration: 5.718s
#         CPU: 696ms
#
# Mar 22 03:03:15 localhost systemd[1]: Started libcontainer container test_busybox.
# Mar 22 03:03:21 localhost systemd[1]: runc-test_busybox.scope: A process of this unit has been killed by the OOM killer.
# Mar 22 03:03:21 localhost systemd[1]: runc-test_busybox.scope: Killing process 243972 (sh) with signal SIGKILL.
# Mar 22 03:03:21 localhost systemd[1]: runc-test_busybox.scope: Killing process 244014 (sh) with signal SIGKILL.
# Mar 22 03:03:21 localhost systemd[1]: runc-test_busybox.scope: Killing process 244020 (dd) with signal SIGKILL.
# Mar 22 03:03:21 localhost systemd[1]: runc-test_busybox.scope: Failed with result 'oom-kill'.
# 50-DeviceAllow.conf
# 50-DevicePolicy.conf
# 50-MemoryMax.conf
# should not have test_busybox left

@kolyshkin
Copy link
Contributor

IOW, this smells like a systemd bug in CentOS Stream 9, because runc delete is called by the test and it does "stopUnit" call to systemd (which, somehow, is being ignored).

The next step is to create a repro a file a bug to systemd.

@kolyshkin
Copy link
Contributor

Ughm. There's a major bug in how runc handles creation of a systemd unit. I will file a bug.

@kolyshkin
Copy link
Contributor

Ughm. There's a major bug in how runc handles creation of a systemd unit. I will file a bug.

#3780

@kolyshkin
Copy link
Contributor

Fixed by #3782

@kolyshkin
Copy link
Contributor

Fixed by #3782

...and #3788

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants