Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pebble service is not restarted on on-check-failure if the OS process was killed #149

Closed
marceloneppel opened this issue Sep 20, 2022 · 17 comments · Fixed by #275
Closed

Comments

@marceloneppel
Copy link
Member

marceloneppel commented Sep 20, 2022

Issue

If the OS process related to a pebble service is killed (like using pkill --signal SIGKILL -f process_name) and that service has a health check that instructs pebble to restart it on a check failure (on-check-failure field from the pebble layer spec), pebble tries to restart the process but finishes with an no such process error. The number of check failures continues to grow and pebble doesn't try to, for example, just start the service, as there is no old OS process for that service to be stopped before the service start again.

There are more details in an example repo that I used to test that behaviour. I added some more description and some logs comparing what happens with pebble and also when using another tool, in that case supervisor, to start the service and later kill the OS process.

Is there a way to solve that issue (making the service restart correctly using the pebble checks) using other fields from the pebble layer? I found a workaround that I implemented in my charm code, but I would like to confirm if this is a bug or not.

@benhoyt
Copy link
Contributor

benhoyt commented Sep 21, 2022

Hi @marceloneppel, thanks for the report. This is a strange one. The health check shouldn't actually be needed here, as the default behaviour for on-failure is restart. So as soon as SIGKILL is sent to the process you should see something like this in the Pebble log:

2022-09-21T22:57:05.707Z [pebble] DEBUG Reaper received SIGCHLD.
2022-09-21T22:57:05.707Z [pebble] DEBUG Reaped PID 113973 which exited with code 137.
2022-09-21T22:57:05.707Z [pebble] DEBUG Service "svc1" exited with code 137.
2022-09-21T22:57:05.707Z [pebble] Service "patroni" stopped unexpectedly with code 137
2022-09-21T22:57:05.707Z [pebble] Service "patroni" on-failure action is "restart", waiting ~500ms before restart (backoff 1)

Note that the DEBUG lines will only be visible if you start Pebble with the PEBBLE_DEBUG=1 environment variable set, but the others should show up in any case. The fact that this isn't showing up in your logs indicates that maybe the /usr/local/bin/patroni script is doing something funky -- maybe it's forking or going into daemon mode and confusing Pebble's child reaper?

Can you try a couple of things for me?

  • I presume you're using the latest master version of Pebble?
  • Turn on debug logs with PEBBLE_DEBUG=1 when running Pebble.
  • Try your test case again with a simpler script than patroni, for example:
import http.server
import socketserver

PORT = 8081

Handler = http.server.SimpleHTTPRequestHandler

with socketserver.TCPServer(("", PORT), Handler) as httpd:
    print("serving at port", PORT)
    httpd.serve_forever()

And let me know whether you still see the same behaviour then.

If not it's something Patroni-specific, and we can dig into your repro case with petroni a bit deeper.

@marceloneppel
Copy link
Member Author

marceloneppel commented Sep 21, 2022

Hi @benhoyt! Thanks for all this information.

I checked the things that you commented:

  • I am currently using the latest master version of Pebble.
  • When enabling debug logs I got the following logs (Patroni was not restarted):
2022-09-21T23:47:11.098Z [pebble] DEBUG socket "/home/postgres/pebble/.pebble.socket" was not activated; listening
2022-09-21T23:47:11.098Z [pebble] DEBUG socket "/home/postgres/pebble/.pebble.socket.untrusted" was not activated; listening
2022-09-21T23:47:11.098Z [pebble] DEBUG Reaper started, waiting for SIGCHLD.
2022-09-21T23:47:11.098Z [pebble] Started daemon.
2022-09-21T23:47:11.098Z [pebble] DEBUG activation done in 5ms
2022-09-21T23:47:11.099Z [pebble] DEBUG Configuring check manager (stopping 0, starting 0)
2022-09-21T23:47:11.102Z [pebble] DEBUG pid=1;uid=0;socket=/home/postgres/pebble/.pebble.socket; POST /v1/services 3.346246ms 202
2022-09-21T23:47:11.102Z [pebble] POST /v1/services 3.346246ms 202
2022-09-21T23:47:11.102Z [pebble] DEBUG Running task 1 on Do: Start service "patroni"
2022-09-21T23:47:11.102Z [pebble] Started default services with change 1.
2022-09-21T23:47:11.104Z [pebble] Service "patroni" starting: /usr/bin/python3 /usr/local/bin/patroni /home/postgres/patroni.yml
2022-09-21T23:47:11.105Z [pebble] DEBUG Service "patroni" started with PID 18
2022-09-21T23:47:11.105Z [pebble] DEBUG Service "patroni" transitioning to state "starting"
2022-09-21T23:47:12.106Z [pebble] DEBUG Service "patroni" transitioning to state "running"
2022-09-21T23:47:41.105Z [pebble] DEBUG Service "patroni" backoff reset elapsed, resetting backoff state (was 0: 0s)
2022-09-21T23:48:01.146Z [pebble] DEBUG Reaper received SIGCHLD.
2022-09-21T23:48:01.146Z [pebble] DEBUG Reaped PID 18 which exited with code 137.
  • And when trying with the simpler script you shared, I noticed some backoff states, but the service was restarted (logs here).

So, in summary, that is something specific to Patroni. I am going to get some more information and maybe create another simpler script that works similar to Patroni (like you commented, maybe running as a daemon or forking), so we can understand better what is happening in that case.

@benhoyt
Copy link
Contributor

benhoyt commented Sep 22, 2022

Were there any other logs after that last line? From following the code through: after DEBUG Reaped PID 18 ... I would have expected to see DEBUG Service "patroni" exited with code 137. and other logs.

@marceloneppel
Copy link
Member Author

marceloneppel commented Sep 22, 2022

Were there any other logs after that last line? From following the code through: after DEBUG Reaped PID 18 ... I would have expected to see DEBUG Service "patroni" exited with code 137. and other logs.

No, unfortunately there were no more logs. I waited for ten minutes and checked the logs again. The last line in the logs is DEBUG Reaped PID 18 which exited with code 137.

@marceloneppel
Copy link
Member Author

marceloneppel commented Sep 22, 2022

@benhoyt, I could create a simple script that shows the same behaviour that Patroni has when it is killed and Pebble doesn't restart it.

It is a script that starts another process using subprocess:

import subprocess
import time


subprocess.Popen(["/usr/bin/python3", "/home/app/script2.py"])

# Just a loop to keep the program running.
while True:
    time.sleep(30)

The other script that is started (script2.py) can be anything (in my case I used the TCP server code that you shared with me).

After killing the the first script process, I get the following logs:

2022-09-22T22:49:39.020Z [pebble] DEBUG socket "/home/app/pebble/.pebble.socket" was not activated; listening
2022-09-22T22:49:39.020Z [pebble] DEBUG Reaper started, waiting for SIGCHLD.
2022-09-22T22:49:39.020Z [pebble] DEBUG socket "/home/app/pebble/.pebble.socket.untrusted" was not activated; listening
2022-09-22T22:49:39.020Z [pebble] Started daemon.
2022-09-22T22:49:39.020Z [pebble] DEBUG activation done in 6ms
2022-09-22T22:49:39.021Z [pebble] DEBUG Configuring check manager (stopping 0, starting 0)
2022-09-22T22:49:39.024Z [pebble] DEBUG Running task 1 on Do: Start service "app"
2022-09-22T22:49:39.024Z [pebble] DEBUG pid=1;uid=0;socket=/home/app/pebble/.pebble.socket; POST /v1/services 2.660573ms 202
2022-09-22T22:49:39.024Z [pebble] POST /v1/services 2.660573ms 202
2022-09-22T22:49:39.024Z [pebble] Started default services with change 1.
2022-09-22T22:49:39.025Z [pebble] Service "app" starting: /usr/bin/python3 /home/app/script1.py
2022-09-22T22:49:39.026Z [pebble] DEBUG Service "app" started with PID 15
2022-09-22T22:49:39.026Z [pebble] DEBUG Service "app" transitioning to state "starting"
2022-09-22T22:49:40.027Z [pebble] DEBUG Service "app" transitioning to state "running"
2022-09-22T22:50:09.026Z [pebble] DEBUG Service "app" backoff reset elapsed, resetting backoff state (was 0: 0s)
2022-09-22T22:50:14.557Z [pebble] DEBUG Reaper received SIGCHLD.
2022-09-22T22:50:14.557Z [pebble] DEBUG Reaped PID 15 which exited with code 137.

You can check all the files here.

@benhoyt
Copy link
Contributor

benhoyt commented Sep 28, 2022

Thanks for the minimal repro -- I can repro this easily locally now. It will probably happen for any service that itself starts up a child process. What's happening is Pebble's "process reaper" detects that the parent PID exited, and then it calls Go's os/exec's cmd.Wait() to clean up ... but that waits for all the child processes (I guess the process group) to terminate.

So if you just send SIGKILL to the parent process, Pebble won't detect that it died ... until all the child processes die or are killed too.

You note that it's different from supervisord (I haven't tested systemd). I could be wrong, but this seems to me a feature, not a bug: it wouldn't be ideal to be restarting the parent process when there are running child processes still around. In the case of Patroni, it would mean Pebble would try to restart Patroni, which would restart a bunch of Postgres processes ... but Postgres is already running.

Is the right "fix" here for you to kill the process group instead of just the Patroni process? You can do this with kill -9 -54321, where 54321 is the process group ID, which in Linux is the PID of the parent process. I'm not sure why you're killing the process -- maybe you could give a bit more context -- but can you kill the process group instead, or do a more graceful stop or even use pebble stop?

Note that Pebble itself uses SIGTERM/SIGKILL on the process group when it executes a stop operation.

@benhoyt
Copy link
Contributor

benhoyt commented Sep 28, 2022

I've discussed this with @hpidcock, and he noted the systemd docs for KillMode and ExecStop -- it's a bit tricky to piece together, but they imply that in the case where the parent service is SIGKILL'd, systemd will send the SIGTERM...SIGKILL sequence to the remaining processes in the process group. This is when KillMode is set to control-group, the default, and what Pebble is trying to emulate.

I've verified this will a little test systemd service:

[Unit]
Description=Testy service

[Service]
ExecStart=/usr/bin/python3 /home/ben/w/pebble/script.py  # spawns script2.py
Type=simple

When you send SIGTERM or SIGKILL to the parent process and kill it, systemd sends SIGTERM to the child process too.

So between supervisor and systemd, that definitely seems like the right default behaviour. We'll need to fix this -- updating the service manager code to kick off the terminating state when it detects that child processes are still running.

In the meantime, I'd still love to hear more context about your use case here.

@marceloneppel
Copy link
Member Author

Hi Ben! Thanks again for all this information about Pebble. It is great for me to understand some more aspects of it.

I haven't tried to kill the Patroni process group, but I can test it here. Besides that, nice that it's something that could be tested on systemd to validate the behavior.

Sorry, I forgot to give more details about my use case. It isn't a use case related to a feature that I need to implement, it is more related to a shakedown test that we need to implement in different database charmed operators that we are developing on the Data Platform team.

In my case, I need to test that the PostgreSQL charmed operator recovers correctly from some situations, like if the database process is killed, after it's freezed and later unfreezed, and also other situations that should cause some disruption in the service (for that I am currently using SIGKILL, SIGSTOP, SIGCONT and other signals, on both the Patroni process - the parent process - and the PostgreSQL process - the child process). The final idea is to test that a cluster of multiple units keeps working and the failing unit recovers itself (becoming part of the cluster again).

So in summary, this situation that I am trying to test is kind of a chaos scenario, to ensure that the charm (and Pebble) manages correctly the workload.

@taurus-forever
Copy link

Hi @benhoyt and @hpidcock, nice to meet you!

In the meantime, I'd still love to hear more context about your use case.

Let me summarize our use case:

  • we are developing a charm for Patroni->PostgreSQL. Pebble is used to start and manage Patroni.
  • to provide high quality for the charm, we are preparing a "test" and the "issue" above affects our "test" only.
  • the test mimics some users who like to kill random parts. We want to be sure the system will always recover to a normal state.

As you confirmed, the current Pebble behavior is not fully replicate the default Systemd/Supervisor logic.
Can you share estimates of the possible fix date/release version (for internal planning)?

Meanwhile, @marceloneppel is testing the workaround to send a KILL signal to the whole group instead of a single process.
Anyway, to sleep well, we will need to be sure the Pebble will address the situation with killed Patroni (by user/oom/...). Unfortunately, we cannot prevent users from killing the "parent" process only.

Thank you in advance!

@benhoyt
Copy link
Contributor

benhoyt commented Oct 11, 2022

Thanks for the additional context @taurus-forever. I do hope to fix this at some point, however, for the next (few?) weeks we're finishing/fixing/testing Juju 3.0, so it'll almost certainly be after that.

@marceloneppel
Copy link
Member Author

marceloneppel commented Oct 12, 2022

Hi @benhoyt and @taurus-forever!

Thanks for the updates. I tested killing the process group, but it still doesn't make Pebble restart the service. I researched why and figured out that the command (pg_ctl) that Patroni is using to start the child process (PostgreSQL) is forking and using setsid, so the child process has another group id. Sorry, I missed that information when I created the first minimal repo.

With those three scripts below we can check that the child process has another group:

script1.py

import subprocess
import time


subprocess.call(["/usr/bin/python3", "/home/app/script2.py"])

# Just a loop to keep the program running.
while True:
    time.sleep(30)

script2.py

import os
import subprocess

pid = os.fork()

# Pid equal to 0 represents the created child process.
if pid == 0:
    os.setsid()
    subprocess.Popen(["/usr/bin/python3", "/home/app/script3.py"])

script3.py

import http.server
import socketserver

PORT = 8081

Handler = http.server.SimpleHTTPRequestHandler

with socketserver.TCPServer(("", PORT), Handler) as httpd:
    print("serving at port", PORT)
    httpd.serve_forever()

Logs:

root@app-0:/# ps -e -o pid,ppid,pgid,cmd
    PID    PPID    PGID CMD
      1       0       1 /usr/bin/pebble run
     14       1      14 /usr/bin/python3 /home/app/script1.py
     17       1      16 /usr/bin/python3 /home/app/script3.py
     19       0      19 bash
     30      19      30 ps -e -o pid,ppid,pgid,cmd

So it seems to be the case that you commented in the beginning: Patroni is forking (and also moving the new process to a new group).

From what I checked, supervisord is restarting the service, but it keeps the original script3.py process running (that may be a problem in some cases, maybe).

@benhoyt
Copy link
Contributor

benhoyt commented Mar 23, 2023

Just for the record, some further reading on this subject that I came across today: "The Unix process API is unreliable and unsafe" and his supervise project. Also Hacker News comments and another related article. Also this comment from a guy who works on the Bazel build system with the exact same issue.

@benhoyt
Copy link
Contributor

benhoyt commented Jul 12, 2023

For the record, we plan to address this by using cgroups as part of work later in this cycle.

@benhoyt benhoyt added the 23.10 label Jul 12, 2023
@marceloneppel
Copy link
Member Author

Thanks for the update @benhoyt!

benhoyt added a commit to benhoyt/pebble that referenced this issue Aug 15, 2023
Add WaitDelay to ensure cmd.Wait() returns in a reasonable timeframe if
the goroutines that cmd.Start() uses to copy Stdin/Stdout/Stderr are
blocked when copying due to a sub-subprocess holding onto them. Read
more details in these issues:

- golang/go#23019
- golang/go#50436

This isn't the original intent of kill-delay, but it seems reasonable
to reuse it in this context.

Fixes canonical#149
@benhoyt
Copy link
Contributor

benhoyt commented Aug 15, 2023

@marceloneppel Okay, I've come in a weird circle on this one. I had actually misdiagnosed the original problem with the Patroni issue you present above, where Patroni wasn't restarting properly even after a KILL signal was sent to it.

I thought this problem was the issue with Pebble not tracking the process tree of processes that daemonise and change their process group (which is still an issue, but is not causing this problem). But after actually diving into your patroni-restart-test repro in more detail, I found that Pebble wasn't marking the Patroni process as having exited at all, due to being blocked on cmd.Wait(). Patroni starts sub-processes and "forwards" stdin/out/err, so the copy goroutines block. Thankfully Go 1.20 introduced WaitDelay to allow you to easily work around this exact problem, and I've added this in #275.

In my tests with Patroni that fixes the problem nicely: when you send KILL to the Patroni process, it uses WaitDelay to wait up to 5 seconds and then exits (even though the os/exec copying goroutines are stuck). This causes Pebble to restart Patroni, which sees that Postgres is already running, and keeps humming along nicely.

The reason I came in a weird circle with this is because I started down the track of "we need to use cgroups to solve this" too early. I even had a rough plan to do so, but it turns out using that cgroups in containers is not possible or not trivial in many cases, so that came to a head, and I started looking at other ways to solve the original problem, hence circling back around. I should have noticed this much earlier, sorry!

We may still improve Pebble's process tree handling, but I believe this takes the pressure off, and we may be able to get away without doing that.

In any case, my one-liner WaitDelay fix seems to solve the Patroni issue and similar issues. But I'd appreciate your review and testing with Patroni and your charms that were originally affected. It might be tricky to test using my Pebble branch, but I can give you instructions on how to do that if you want (a replace directive in Juju's go.mod and then building Juju manually).

@marceloneppel
Copy link
Member Author

@marceloneppel Okay, I've come in a weird circle on this one. I had actually misdiagnosed the original problem with the Patroni issue you present above, where Patroni wasn't restarting properly even after a KILL signal was sent to it.

I thought this problem was the issue with Pebble not tracking the process tree of processes that daemonise and change their process group (which is still an issue, but is not causing this problem). But after actually diving into your patroni-restart-test repro in more detail, I found that Pebble wasn't marking the Patroni process as having exited at all, due to being blocked on cmd.Wait(). Patroni starts sub-processes and "forwards" stdin/out/err, so the copy goroutines block. Thankfully Go 1.20 introduced WaitDelay to allow you to easily work around this exact problem, and I've added this in #275.

In my tests with Patroni that fixes the problem nicely: when you send KILL to the Patroni process, it uses WaitDelay to wait up to 5 seconds and then exits (even though the os/exec copying goroutines are stuck). This causes Pebble to restart Patroni, which sees that Postgres is already running, and keeps humming along nicely.

The reason I came in a weird circle with this is because I started down the track of "we need to use cgroups to solve this" too early. I even had a rough plan to do so, but it turns out using that cgroups in containers is not possible or not trivial in many cases, so that came to a head, and I started looking at other ways to solve the original problem, hence circling back around. I should have noticed this much earlier, sorry!

We may still improve Pebble's process tree handling, but I believe this takes the pressure off, and we may be able to get away without doing that.

In any case, my one-liner WaitDelay fix seems to solve the Patroni issue and similar issues. But I'd appreciate your review and testing with Patroni and your charms that were originally affected. It might be tricky to test using my Pebble branch, but I can give you instructions on how to do that if you want (a replace directive in Juju's go.mod and then building Juju manually).

Hi @benhoyt! No worries and thank you for the effort on this! I'm gonna check it with the charm this week to confirm whether it worked and keep you posted.

Thanks again!

benhoyt added a commit that referenced this issue Aug 18, 2023
…/out/err (#275)

Use os.exec's Cmd.WaitDelay to ensure cmd.Wait returns in a reasonable
timeframe if the goroutines that cmd.Start() uses to copy stdin/out/err
are blocked when copying due to a sub-subprocess holding onto them.
Read more details about the issue in
golang/go#23019 and the proposed solution
(that was added in Go 1.20) in
golang/go#50436.

This solves issue #149, where Patroni
wasn't restarting properly even after a `KILL` signal was sent to it.
I had originally mis-diagnosed this problem as an issue with Pebble
not tracking the process tree of processes that daemonise and change
their process group (which is still an issue, but is not causing this
problem). The Patroni process wasn't being marked as finished at all
due to being blocked on the `cmd.Wait()`. Patroni starts
sub-processes and "forwards" stdin/out/err, so the copy goroutines
block. Thankfully Go 1.20 introduced `WaitDelay` to allow you to
easily work around this exact problem.

The fix itself is [this one-liner]
(#275):

s.cmd.WaitDelay = s.killDelay() * 9 / 10  // 90% of kill-delay

This will really only be a problem for services, but we make the same
change for exec and exec health checks as it won't hurt there
either.

Also, as a drive-by, this PR also canonicalises some log messages: our
style is to start with an uppercase letter (for logs, not errors) and
to use "Cannot X" rather than "Error Xing".

Fixes #149.
@marceloneppel
Copy link
Member Author

Thanks so much @benhoyt! I double checked with the charm and the issue was fixed.

To test, I used the following replace directive on Juju's go.mod file:

replace github.com/canonical/pebble => github.com/canonical/pebble v1.2.1-0.20230818120001-fd34da17bd83

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants