Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arbiter: don't log if handling SIGCHLD #3064

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hydrargyrum
Copy link

Logging when handling a signal is a bad practice
See https://docs.python.org/3/library/logging.html#thread-safety

Fixes #2816

@benoitc
Copy link
Owner

benoitc commented Sep 5, 2023

Thanks but this is done on purpose like the comment says. It's not common. I prefer to keep it like this. This place is threadsafe also.

@benoitc benoitc closed this Sep 5, 2023
@hydrargyrum
Copy link
Author

hydrargyrum commented Sep 5, 2023

@benoitc what is done on purpose? raising exceptions and breaking logging system? Please see #2816 that many people encounter

@benoitc
Copy link
Owner

benoitc commented Sep 5, 2023

This code should be threadsafe. I suspect the error is happening before. We need first to understand why this code is re-entrant here.

Let keep it open since it is actually solving the cause. But I woud insist that we need to understand why this comes now.

@benoitc benoitc reopened this Sep 5, 2023
@benoitc
Copy link
Owner

benoitc commented Sep 5, 2023

linking to #3050 I think we should revisit this change that introducde more logging.

@hydrargyrum if you happen to be around i'm on irc.

@benoitc benoitc self-assigned this Sep 5, 2023
@ja0x
Copy link

ja0x commented Oct 10, 2023

Wanted to shine a little bit of info I found since I have encountered this bug as well. The issue isn't with thread-safety but rather with re-entrancy within the logging module and the underlying BufferedWriter (from IO module).

The issue occurs (usually on a slow disk) when the SIGCHLD handler interrupts the main process while it was in the process of writing its logging buffer. The handler (which is the same thread) will try to write to that buffer but will be unable because it doesn't allow reentrant code, it will then raise an exception.

If logger is set to write stderr to that same stream or file, it then sets off a cascading reentrant code error, eventually resulting in a recursion error.

It is a hard bug to reproduce, because the main thread has to interrupt itself exactly when it was writing the buffer.

Ref:
https://docs.python.org/3/library/io.html#reentrancy,
https://github.com/python/cpython/blob/e24f9ae7035cfb4c11c4a5c2820121fc22463f8b/Modules/_io/bufferedio.c#L295C11-L295C11

@TomiBelan
Copy link
Contributor

@ja0x is correct. Python programs MUST NOT print or log anything inside a signal handler. https://stackoverflow.com/q/45680378 is another link that explains the root cause of this problem.

This gunicorn bug has a long and sad history.

Dec 2017: Issue #1663 was reported: In some cases (for example OOM) workers die without any error logged, or even crashloop.
Dec 2019: Issue #2215 was reported: Essentially the same as #1663, but with a reproducible example, and focused on missing logging.
Apr 2020: PR #2315 (ee1e5c1) was created by @tilgovi. It logs an error in the SIGCHLD handler if a worker exits with a nonzero code. And adds more logging to arbiter halt (that part is okay).
Sep 2020: PR #2419 (28df992, ca36d41) was created by @mildebrandt trying to fix issue #1663. It logs an error in the SIGCHLD handler if a worker exits with a nonzero code or is killed by a signal.
Dec 2020: PR #2475 (b695b49) was created by @aberres trying to fix issue #2215, and immediately merged by @benoitc. It logs a warning in the SIGCHLD handler if a worker is killed by a signal.
Apr 2021: gunicorn 20.1.0 was released containing PR #2475. Soon, issue #2564 was reported: Users started experiencing crashes with RuntimeError: reentrant call.
Oct 2021: PR #2667 was created by @BoasKluiving to revert PR #2475 in order to fix issue #2564.
Dec 2021: PR #2475 was reverted by @benoitc in commit 76f8da2 in order to fix issue #2564.
Jan 2022: PR #2667 was closed by @javabrett because he noticed that PR #2475 was reverted in the meantime.
Jun 2022: Issue #2816 was reported by @InfernalAzazel. AFAICT, it's actually just an exact duplicate of issue #2564, because the fix in commit 76f8da2 wasn't released.
Dec 2022: PR #2908 was created by myself. It logs a warning if a worker is killed by a signal, but the log is written later in the main loop, not directly in the SIGCHLD handler. I did not receive any response in the PR.
May 2023: PR #2315 and PR #2419 were merged by @benoitc, reintroducing the RuntimeError: reentrant call bug. Issue #2564 was closed (incorrectly).
Jul 2023: gunicorn 21.0.0 was released containing PRs #2315 and #2419.
Aug 2023: Issue #3050 was reported by @feliixx, who wants the log to be a warning, not an error.
Sep 2023: PR #3064 (this PR) was created by @hydrargyrum, trying to fix issue #2816 by skipping logging in the SIGCHLD handler. (FWIW: you missed one self.log.error call, and nobody except handle_chld calls reap_workers anyway.)

If I may be frank, I feel very disappointed and frustrated by the gunicorn project. It is a cruel irony that my PR #2908 which attempts to handle this correctly was ignored, but PRs #2315 and #2419 were merged, causing a regression.

But I'll try to swallow my bitter feelings and offer constructive advice.

My advice to @benoitc is to do this:

  1. git revert -m 1 1fd1c82195 (revert both commits of Add additional logs when worker exits abnormally #2419)
  2. git revert ee1e5c1928 (revert the bad half of More verbose exit logging #2315, keep the good half)
  3. Review and (if I may be so bold) merge my PR Re-add and improve the warning for workers terminated due to a signal #2908
  4. Reject PR arbiter: don't log if handling SIGCHLD #3064 (this PR)

Doing that fixes #2564 properly (at least I think so), fixes #3050, doesn't regress #2215, and removes the need for PR #3064 (this PR).

Hope that helps.

@tilgovi
Copy link
Collaborator

tilgovi commented Oct 31, 2023

This is allextremely helpful and politely offered advice. I'll try to find time to push this along. I appreciate all your effort here and the exception patience with helping to sort this out.

Most of the maintainers are volunteers and we don't always find the sustained time to reconstruct all of this context, so gathering it in one place is immensely helpful.

@aberres
Copy link
Contributor

aberres commented Nov 1, 2023

@TomiBelan Thank you for the write-up. Since my failed approach three years ago I am silently hoping to get the issue fixed for good :)

@tilgovi
Copy link
Collaborator

tilgovi commented Dec 28, 2023

I've rebased #2314, which moves reaping to the main thread. That should move the logging out of the signal handler. Any reviews or testing of that would be appreciated. I've assigned myself the related issues to follow up on any other changes afterwards.

@tilgovi tilgovi self-assigned this Dec 28, 2023
@tilgovi tilgovi added this to the 22.0 milestone Dec 29, 2023
@piskvorky
Copy link

piskvorky commented Feb 2, 2024

Also bitten by this. I debugged and googled my way to this PR – I have nothing to add to @ja0x and @TomiBelan excellent analysis, so just a +1. I experience the same issue with reentrant logging inside Arbiter.reap_workers() during SIGCHLD https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L242

To make matters worse, in my case gunicorn is auto-launched and monitored from supervisord, running a Flask app. After the unhandled exception due to RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>, the server port was not released correctly, and the newly restarted gunicorn arbiter process (launched by supervisor) keeps printing this forever:

[2024-01-25 06:50:00 +0000] [384109] [INFO] Starting gunicorn 21.2.0
[2024-01-25 06:50:00 +0000] [384109] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2024-01-25 06:50:00 +0000] [384109] [ERROR] Retrying in 1 second.
[2024-01-25 06:50:01 +0000] [384109] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2024-01-25 06:50:01 +0000] [384109] [ERROR] Retrying in 1 second.
…

I.e. the server is bricked after the [ERROR] Unhandled exception in main loop, somehow.

On a positive note, I've been using gunicorn for years and this is the first major issue. Kudos and huge thanks to the maintainers!

@sylt
Copy link

sylt commented Feb 2, 2024

Without really checking what PRs were opened already, I posted a PR yesterday (#3148) with an attempt at solving various signal handling issues in gunicorn I had the misfortune to encounter. I couldn't see that any of the open proposals had used the same approach to solve the problem, so I thought people could have a look before I eventually close it.

@randomir
Copy link

@TomiBelan, thanks for the excellent summary. I just started digging myself when I discovered it.

In my case it's gunicorn (sometimes) dying on SIGHUP during a log rotate.

To maintainers, please note the seriousness of this bug, even if it's intermittent. And the fact it was re-introduced in the same version in which it was fixed (21.0.0) makes it open for 3+ years. 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'> (# 2667)
9 participants