Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the systemd is busy, runc init will hang and cannot exit, eventually ending up in D state. #3904

Open
113xiaoji opened this issue Jun 15, 2023 · 6 comments

Comments

@113xiaoji
Copy link

113xiaoji commented Jun 15, 2023

Description

In the scenario where a single node concurrently deploys more than 100 containers, the call chain is kubelet -> containerd -> containerd-shim-runc-v2 ->runc root create -> runc init -> dbus -> systemd -> cgroup. However, systemd is single-threaded and when it's busy, it continuously occupies one core. This causes the runc main process to be unable to get a response in a timely manner, with no timeout mechanism, leading to a deadlock. The related flame graph is as follows:

image
image

As shown, a main goroutine is blocked waiting for a message from a channel, while another goroutine, ReadMsgUnix, never receives a response, causing the main goroutine to block indefinitely.

The primary issue lies in the fact that runc init, as a client, doesn't incorporate a timeout mechanism. A potential solution would be to introduce a timeout, enabling runc to actively disconnect in the event of abnormal scenarios, such as systemd being overly busy. We could modify the context for the startUnit function in libcontainer/cgroups/systemd/common.go to include a timeout. Currently, the context for startUnit is context.TODO(), which has no timeout. We could change this to contextWithTimeout, _ := context.WithTimeout(context.Background(), 30*time.Second), ensuring that runc would not wait indefinitely in the face of unresponsive components.

Steps to reproduce the issue

  1. a single node deploys more than 100 pod at the same time with k8s

Describe the results you received and expected

  1. The process will not end up in a D (uninterruptible sleep) state.

What version of runc are you using?

1.1.2

Host OS information

x86

Host kernel information

Linux master1 5.10.0-60.18.0.50.h665.eulerosv2r11.x86_64 #1 SMP Fri Dec 23 16:12:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

@113xiaoji 113xiaoji changed the title When the system is busy, runc init will hang and cannot exit, eventually ending up in D state. When the systemd is busy, runc init will hang and cannot exit, eventually ending up in D state. Jun 15, 2023
@kolyshkin
Copy link
Contributor

runc init is there for a reason -- you can use runc create and runc start separately, and in between these two invocations runc init is a process that keeps a created container.

OTOH it makes sense to pass a context with timeout to c.StartTransientUnitContext.

Can you check if something like this fixes your issue? The patch is against runc main branch.

diff --git a/libcontainer/cgroups/systemd/common.go b/libcontainer/cgroups/systemd/common.go
index c577a22b..c970e941 100644
--- a/libcontainer/cgroups/systemd/common.go
+++ b/libcontainer/cgroups/systemd/common.go
@@ -130,7 +130,9 @@ func startUnit(cm *dbusConnManager, unitName string, properties []systemdDbus.Pr
 
 retry:
        err := cm.retryOnDisconnect(func(c *systemdDbus.Conn) error {
-               _, err := c.StartTransientUnitContext(context.TODO(), unitName, "replace", properties, statusChan)
+               ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+               defer cancel()
+               _, err := c.StartTransientUnitContext(ctx, unitName, "replace", properties, statusChan)
                return err
        })
        if err != nil {

@113xiaoji
Copy link
Author

113xiaoji commented Jun 27, 2023

@kolyshkin
After studying the source code, I found that when the cgroup is set to v2, the Set method no longer needs to freeze and thaw. However, even if I switch to cgroup v2, some scenarios still require freezing and thawing. For instance, the signalAllProcesses method, regardless of whether it's cgroup v1 or cgroup v2, still requires freezing and thawing. I'm considering switching to cgroup v2 and no longer performing any freeze and thaw logic. This consideration is due to extreme scenarios where, after the runc task freezes the runc init and if the runc task hangs, there will be no process to thaw the runc init. This leads to the runc init entering a D state and, with its parent process being the systemd process, the runc init becomes an orphan process. I'm wondering if I stop executing all freeze and thaw logic when cgroup is v2, except when explicitly set to a freeze state via command line, would this approach cause any issues?

libcontainer/cgroups/systemd/v2.go

func (m *UnifiedManager) Set(r *configs.Resources) error {
	if r == nil {
		return nil
	}
	properties, err := genV2ResourcesProperties(m.fsMgr.Path(""), r, m.dbus)
	if err != nil {
		return err
	}

	if err := setUnitProperties(m.dbus, getUnitName(m.cgroups), properties...); err != nil {
		return fmt.Errorf("unable to set unit properties: %w", err)
	}

	return m.fsMgr.Set(r)
}

libcontainer/init_linux.go

// signalAllProcesses freezes then iterates over all the processes inside the
// manager's cgroups sending the signal s to them.
// If s is SIGKILL then it will wait for each process to exit.
// For all other signals it will check if the process is ready to report its
// exit status and only if it is will a wait be performed.
func signalAllProcesses(m cgroups.Manager, s os.Signal) error {
	var procs []*os.Process
	if err := m.Freeze(configs.Frozen); err != nil {
		logrus.Warn(err)
	}
	pids, err := m.GetAllPids()
	if err != nil {
		if err := m.Freeze(configs.Thawed); err != nil {
			logrus.Warn(err)
		}
		return err
	}
	for _, pid := range pids {
		p, err := os.FindProcess(pid)
		if err != nil {
			logrus.Warn(err)
			continue
		}
		procs = append(procs, p)
		if err := p.Signal(s); err != nil {
			logrus.Warn(err)
		}
	}
	if err := m.Freeze(configs.Thawed); err != nil {
		logrus.Warn(err)
	}

	subreaper, err := system.GetSubreaper()
	if err != nil {
		// The error here means that PR_GET_CHILD_SUBREAPER is not
		// supported because this code might run on a kernel older
		// than 3.4. We don't want to throw an error in that case,
		// and we simplify things, considering there is no subreaper
		// set.
		subreaper = 0
	}

	for _, p := range procs {
		if s != unix.SIGKILL {
			if ok, err := isWaitable(p.Pid); err != nil {
				if !errors.Is(err, unix.ECHILD) {
					logrus.Warn("signalAllProcesses: ", p.Pid, err)
				}
				continue
			} else if !ok {
				// Not ready to report so don't wait
				continue
			}
		}

		// In case a subreaper has been setup, this code must not
		// wait for the process. Otherwise, we cannot be sure the
		// current process will be reaped by the subreaper, while
		// the subreaper might be waiting for this process in order
		// to retrieve its exit code.
		if subreaper == 0 {
			if _, err := p.Wait(); err != nil {
				if !errors.Is(err, unix.ECHILD) {
					logrus.Warn("wait: ", err)
				}
			}
		}
	}
	return nil
}

In the signalAllProcesses method, it's unclear why there is a need for freeze and thaw operations. In most scenarios, the invocations of m.GetAllPids() and p.Signal(s) don't require freeze and thaw. So why does this particular method demand a freeze operation?

@kolyshkin
Copy link
Contributor

@113xiaoji I'm sorry, we were discussing the runc hang caused by non-responding systemd in this issue, and now it's about freezing. Maybe you added a comment to the wrong issue, and in fact you mean #3803? Please clarify

As for using signalAllProcesses, its logic (and the logic of its users) was changed in #3825.

@kolyshkin kolyshkin reopened this Jun 27, 2023
@113xiaoji
Copy link
Author

113xiaoji commented Jun 28, 2023

@113xiaoji I'm sorry, we were discussing the runc hang caused by non-responding systemd in this issue, and now it's about freezing. Maybe you added a comment to the wrong issue, and in fact you mean #3803? Please clarify

As for using signalAllProcesses, its logic (and the logic of its users) was changed in #3825.

@kolyshkin Thank you for your response. In this issue, we are focusing on the runc hang problem caused by systemd not responding. I believe that all APIs related to interacting with systemd, such as: startUnit, stopUnit, resetFailedUnit, getUnitTypeProperty, setUnitProperties, should have a timeout mechanism added. Otherwise, when systemd is busy, we can't predict exactly at which step it will hang. In our scenario, a large number of Pods are being deployed, and since systemd needs to listen for changes in /proc/self/mountinfo and update synchronously, when there are a large number of mount points, it causes systemd to become extremely busy. Related issue are #2532

@113xiaoji
Copy link
Author

@113xiaoji I'm sorry, we were discussing the runc hang caused by non-responding systemd in this issue, and now it's about freezing. Maybe you added a comment to the wrong issue, and in fact you mean #3803? Please clarify

As for using signalAllProcesses, its logic (and the logic of its users) was changed in #3825.

I will discuss the issue of runc init becoming an orphan process in #3803 and #3825.

@cyphar
Copy link
Member

cyphar commented Jul 5, 2023

I wouldn't mind adding timeouts, but it should be noted that this will not result in containers starting without issue -- it will result in containers failing to start in cases where we need to talk to systemd to start our containers. In the case of Kubernetes I would expect Kubernetes to try to start the container again which would probably not help with the system load issue. It should be noted that the original issue that caused the systemd busy-ness problem has been fixed (#2532).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants