Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully shutdown child processes #393

Merged
merged 8 commits into from
Mar 4, 2021

Conversation

cjolowicz
Copy link
Collaborator

@cjolowicz cjolowicz commented Mar 2, 2021

Fixes #139

This PR changes Nox's behavior when receiving a keyboard interrupt (CTRL-C, SIGINT) while executing a command with session.run or session.install.

The current behavior is to terminate the child immediately (SIGTERM on Posix, TerminateProcess on Windows) and to reraise the KeyboardInterrupt, ultimately exiting with failure.

The new behavior is to shut down the child gracefully and only raise if the child did not exit with success. The shutdown sequence consists of these four steps:

  1. Check if the child has terminated, with a 300ms timeout. If it did, skip to the last step. Note that the OS delivers the signal to the child as well.
  2. Send SIGTERM and check again, with a 200ms timeout. If it did, skip to the last step.
  3. Send SIGKILL and wait without timeout.
  4. If the exit status of the child process is non-zero, reraise the KeyboardInterrupt. Otherwise, we consider the keyboard interrupt handled and return normally.

A few points of note:

  • We should probably move logger.error("Interrupted...") from command.run to popen.popen, so the user gets immediate feedback before the shutdown sequence takes course. We'd need to pass logger to popen.popen, though. What do you think?

  • The timeouts correspond to tox's defaults for its interrupt_timeout and terminate_timeout settings. For commands that need longer to shut down, it may be a good idea to make this configurable in Nox as well. These could be keyword arguments for session.run. I don't think this needs to be addressed by this PR though, right? (Interestingly, tox4 no longer seems to have these configurable.)

  • tox does the shutdown slightly differently. It first checks if the child has terminated (with a timeout of zero), and sends a duplicate SIGINT if it hasn't. I believe the reason is that the shutdown protocol is also used in cases where there was no initial SIGINT. The rest of the shutdown sequence is the same. tox always reraises the interrupt, even if the child exited with success. (tox4 essentially behaves like tox3, but busy-waits and skips step 3 on Windows where Popen.kill is an alias for Popen.terminate.)

  • The use case with pytest --pdb is not covered here; Nox still insists on terminating the child. When I tried this, I did get to see pytest's traceback before Nox killed it, which is an improvement. One way to support this use case would be a command-line option to disable sending additional signals to the child. I don't think this needs to be addressed by this PR either. tox also does not support this currently (configuration option disabling sucicide on pytest subprocess tox-dev/tox#1791).

  • When output is captured and the child handles the interrupt, we silently discard anything Popen.communicate buffered for us before it was interrupted. However, the only documented way to capture output from session.run is the silent flag, and in that case it doesn't matter.

The u"" literal triggers an internal error with black 19.3b0. While strictly
unrelated to this feature branch, we fix it here because the internal error only
reproduces on this branch.

  ❯ nox -s lint

  nox > Running session lint
  nox > Creating virtual environment (virtualenv) using python3.8 in .nox/lint
  nox > python -m pip install flake8==3.7.8 black==19.3b0 isort==4.3.21 mypy==0.720
  nox > mypy --disallow-untyped-defs --warn-unused-ignores --ignore-missing-imports nox
  nox > black --check nox tests noxfile.py setup.py
  error: cannot format …/nox/tests/test_command.py: INTERNAL ERROR: Black
  produced code that is not equivalent to the source. Please report a bug on
  https://github.com/ambv/black/issues. This diff might be helpful: /tmp/blk_d4e49p90.log
  All done! 💥 💔 💥
  38 files would be left unchanged, 1 file would fail to reformat.
  nox > Command black --check nox tests noxfile.py setup.py failed with exit code 123
  nox > Session lint failed.

  ❯ cat /tmp/blk_d4e49p90.log

  --- src
  +++ dst
  @@ -2092,18 +2092,18 @@
                    value=
                      Dict(
                        keys=
                          Constant(
                            kind=
  -                            'u',  # str
  +                            None,  # NoneType
                            value=
                              'SIGIL',  # str
                          )  # /Constant
                        values=
                          Constant(
                            kind=
  -                            'u',  # str
  +                            None,  # NoneType
                            value=
                              '123',  # str
                          )  # /Constant
                      )  # /Dict
                  )  # /keyword
@cjolowicz
Copy link
Collaborator Author

The Windows CI fails due to an unexpected KeyboardInterrupt in the test test_interrupt_handled, where the interrupt should have been handled by the child process. The exception was raised when invoking WaitForSingleObject, in the context of Popen.communicate.

I have been able to reproduce this on a Windows system. In fact, both of the test_interrupt_* tests fail some of the time due to KeyboardInterrupt being raised in the parent process at seemingly random places: apart from subprocess.py (when invoking WaitForSingleObject), also at various locations in typing.py and pytest.

A possible explanation for this is an SO answer by @eryksun:

Python's os.kill wraps two unrelated APIs on Windows. It calls GenerateConsoleCtrlEvent when the sig parameter is CTRL_C_EVENT or CTRL_BREAK_EVENT. In this case the pid parameter is a process group ID. [...]

The test calls os.kill indirectly via Popen.send_signal from the our monkeypatched Popen.communicate, passing signal.CTRL_C_EVENT.

So this means that we send the ^C to ourselves as well, because the parent process is in the same process group as the child. At first, this sounds like we might be able to fix this by passing subprocess.CREATE_NEW_PROCESS_GROUP when creating the child process on Windows. However, reading on in the SO answer, this would also require calling SetConsoleCtrlHandler(NULL, FALSE) in the child, which we can't. My conclusion so far is that we should skip these tests on Windows.

My next step was to paste the Python programs from the tests into a Noxfile, and trigger an actual ^C from the console. There was no spurious KeyboardInterrupt in any of these runs. The results were quite interesting, though:

  • The session without signal handlers behaves as expected: The process immediately terminates, and the session fails.
  • The session with a signal handler exiting with zero also behaves as expected: The process immediately terminates, and the session succeeds.
  • The session ignoring SIGINT, on the other hand, ignores not only the ^C, but also the TerminateProcess sent from Nox. The process completes its sleep, and the session succeeds.
  • The session ignoring both SIGINT and SIGTERM behaves the same.

This another loose end that I haven't had time yet to investigate.

@cjolowicz
Copy link
Collaborator Author

833d44f fixes the spurious interrupts in the test on Windows, leveraging the SO post quoted above. Unfortunately, even though all Nox sessions succeed in CI now, Nox mysteriously exits with 1. Haven't been able to reproduce this locally, on my test machine Nox exits with success.

@theacodes
Copy link
Collaborator

This looks good to me so far, let me know if you get stuck on anything or need some help deciding on something.

@cjolowicz
Copy link
Collaborator Author

cjolowicz commented Mar 3, 2021

72f7cd2 fixes the spurious interrupts in the Windows CI. While Nox itself exited with success, its parent process (the one running the CI step) exited with failure. This happened because as a console process, the parent also received the keyboard interrupt triggered by the test. The fix was to simply not send a real interrupt when we're on Windows CI. This means we have to skip the test where the child handles the interrupt (test_interrupt_handled). For the other test (test_interrupt_raises), it does not matter. Outside of CI, both tests pass on Windows.

So this finally ready for review.

Sidenote:

The session ignoring SIGINT, on the other hand, ignores not only the ^C, but
also the TerminateProcess sent from Nox. The process completes its sleep, and
the session succeeds.

This turned out to be not entirely true; what happened was that Nox never called
TerminateProcess. It looks like the KeyboardInterrupt only materialized
after the primary Popen.communicate returned, and the child process had
completed. (CPython sets a Windows event in its signal handler to wake up the
main thread on SIGINT. Somehow that did not happen here.)

Taking a step back, there is no reason Nox has to guarantee termination of a
process on ^C if that process ignores ^C. It seems just fine to provide a best
effort approach that works on POSIX, escalating to SIGTERM and SIGKILL.

@eryksun
Copy link

eryksun commented Mar 3, 2021

I'm not familiar with your specific needs, but here's some general advice.

If you need to test the effect of sending CTRL_C_EVENT (i.e. cancel) to a target process that doesn't manually enable Ctrl+C, then you can't spawn it in a new process group as the target of the console control event. You need to send the event to group 0, i.e. all processes that are attached to the current console session. However, doing that is unreliable for an arbitrary console session, since you don't know what other processes are attached and what will be the effect of sending them Ctrl+C.

You can spawn the test in a new console session, allocated with the process creation flag CREATE_NEW_CONSOLE (or with CREATE_NO_WINDOW if the process works correctly in a windowless console session). Report the result back to the main test process over stdout or some other IPC channel. The test process in the new session can implement the test as follows:

  • enable Ctrl+C to ensure it's inherited as such
  • spawn the target process to be tested, without any of the following creation flags:
    • CREATE_NEW_PROCESS_GROUP
    • CREATE_NEW_CONSOLE
    • CREATE_NO_WINDOW
    • DETACHED_PROCESS
  • wait for the target process to be ready
  • disable Ctrl+C
  • call os.kill(0, CTRL_C_EVENT) or GenerateConsoleCtrlEvent(CTRL_C_EVENT, 0)
  • wait for the target process to exit

ctypes wrapper functions:

import signal
import ctypes

kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

def enable_ctrl_c():
    if not kernel32.SetConsoleCtrlHandler(None, False):
        raise ctypes.WinError(ctypes.get_last_error())

def disable_ctrl_c():
    if not kernel32.SetConsoleCtrlHandler(None, True):
        raise ctypes.WinError(ctypes.get_last_error())

def send_ctrl_event(pgid=0, ctrl_event=signal.CTRL_C_EVENT):
    if not kernel32.GenerateConsoleCtrlEvent(ctrl_event, pgid):
        raise ctypes.WinError(ctypes.get_last_error())

@cjolowicz
Copy link
Collaborator Author

Thank you @eryksun for taking the time to comment here. Your comment, as well as your posts on SO and bpo, were tremendously helpful!

dd9f3c2 implements the solution proposed above: On Windows, the keyboard interrupt tests are run in a subprocess that is not attached to the current console window. This means that all tests now run on Windows as well, both locally and in CI.

Copy link
Collaborator

@theacodes theacodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM!

skip_on_windows_console = pytest.mark.skipif(
platform.system() == "Windows" and "DETACHED_FROM_CONSOLE" not in os.environ,
reason="Do not run this test attached to a Windows console.",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording here is possibly misleading. CREATE_NO_WINDOW and CREATE_NEW_CONSOLE run the process attached to a new console session. The creation flag DETACHED_PROCESS, on the other hand, tells the system to spawn a process detached from any console session. A truly detached process is definitely not wanted here since the test needs to generate and receive a console control event.

FYI, in Windows 8.1+, a console session is a set of processes that are connected to an instance of conhost.exe by way of the ConDrv kernel device. The session window is optional and may be excluded because of CREATE_NO_WINDOW or, in Windows 10, because the console host is running in headless conpty mode (pseudoconsole). For example, each tab in Windows Terminal uses a headless conpty session, with an open-source build of "conhost.exe" that's named "openconsole.exe".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing that out. d6b0e59 changes the naming, hopefully for the better 😉

@cjolowicz
Copy link
Collaborator Author

@theacodes Thanks for the approval! I pushed two more commits:

  • d6b0e59 improves naming in the tests, following @eryksun 's comment above
  • 3df8312 ensures that the subprocess always uses the same pytest as the main process (even when it's not on PATH)

@theacodes
Copy link
Collaborator

Cool, still looks good to me. Feel free to merge when you and @eryksun are happy. :)

@theacodes
Copy link
Collaborator

Oh, I noticed you don't have merge permissions. Feel free to ping me when you're ready to merge (and if you want merge rights, let me know!)

@cjolowicz
Copy link
Collaborator Author

Feel free to ping me when you're ready to merge

This is good to go from my side.

(and if you want merge rights, let me know!)

Thanks for offering this, I'd be honored :)

@theacodes
Copy link
Collaborator

Thanks for offering this, I'd be honored :)

Done. I'll let you merge this :) 🎆

This PR is eligible for $50 from our open collective. Feel free to submit an expense for that if you'd like.

@cjolowicz cjolowicz merged commit 7ad30a0 into wntrblm:main Mar 4, 2021
@cjolowicz cjolowicz deleted the graceful-shutdown branch March 4, 2021 19:04
@theacodes
Copy link
Collaborator

theacodes commented Mar 4, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

CTRL-C always results in an error
3 participants