Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_concurrent_futures.test_shutdown.test_processes_terminate() hangs randomly when run multiple times #125451

Closed
vstinner opened this issue Oct 14, 2024 · 11 comments
Labels
tests Tests in the Lib/test dir

Comments

@vstinner
Copy link
Member

vstinner commented Oct 14, 2024

Example on Fedora 40:

$ ./python -m test test_concurrent_futures.test_shutdown -m test_processes_terminate -v -R 3:3 --timeout=15
(...)
OK
.test_processes_terminate (test.test_concurrent_futures.test_shutdown.ProcessPoolForkProcessPoolShutdownTest.test_processes_terminate) ... /home/vstinner/python/main/Lib/multiprocessing/popen_fork.py:67: DeprecationWarning: This process (pid=290121) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
0.02s ok
test_processes_terminate (test.test_concurrent_futures.test_shutdown.ProcessPoolForkserverProcessPoolShutdownTest.test_processes_terminate) ... 0.22s ok
test_processes_terminate (test.test_concurrent_futures.test_shutdown.ProcessPoolSpawnProcessPoolShutdownTest.test_processes_terminate) ... Timeout (0:00:15)!
Thread 0x00007f955fe006c0 (most recent call first):
  File "/home/vstinner/python/main/Lib/concurrent/futures/process.py", line 182 in _on_queue_feeder_error
  File "/home/vstinner/python/main/Lib/multiprocessing/queues.py", line 290 in _feed
  File "/home/vstinner/python/main/Lib/threading.py", line 992 in run
  File "/home/vstinner/python/main/Lib/threading.py", line 1041 in _bootstrap_inner
  File "/home/vstinner/python/main/Lib/threading.py", line 1012 in _bootstrap

Thread 0x00007f9564c006c0 (most recent call first):
  File "/home/vstinner/python/main/Lib/threading.py", line 1092 in join
  File "/home/vstinner/python/main/Lib/multiprocessing/queues.py", line 217 in _finalize_join
  File "/home/vstinner/python/main/Lib/multiprocessing/util.py", line 216 in __call__
  File "/home/vstinner/python/main/Lib/multiprocessing/queues.py", line 149 in join_thread
  File "/home/vstinner/python/main/Lib/concurrent/futures/process.py", line 566 in _join_executor_internals
  File "/home/vstinner/python/main/Lib/concurrent/futures/process.py", line 557 in join_executor_internals
  File "/home/vstinner/python/main/Lib/concurrent/futures/process.py", line 380 in run
  File "/home/vstinner/python/main/Lib/threading.py", line 1041 in _bootstrap_inner
  File "/home/vstinner/python/main/Lib/threading.py", line 1012 in _bootstrap

Thread 0x00007f9573c83740 (most recent call first):
  File "/home/vstinner/python/main/Lib/threading.py", line 1092 in join
  File "/home/vstinner/python/main/Lib/concurrent/futures/process.py", line 854 in shutdown
  File "/home/vstinner/python/main/Lib/test/test_concurrent_futures/test_shutdown.py", line 274 in test_processes_terminate
  File "/home/vstinner/python/main/Lib/unittest/case.py", line 606 in _callTestMethod
  File "/home/vstinner/python/main/Lib/unittest/case.py", line 660 in run
  File "/home/vstinner/python/main/Lib/unittest/case.py", line 716 in __call__
  File "/home/vstinner/python/main/Lib/unittest/suite.py", line 122 in run
  File "/home/vstinner/python/main/Lib/unittest/suite.py", line 84 in __call__
  File "/home/vstinner/python/main/Lib/unittest/suite.py", line 122 in run
  File "/home/vstinner/python/main/Lib/unittest/suite.py", line 84 in __call__
  File "/home/vstinner/python/main/Lib/unittest/runner.py", line 240 in run
  File "/home/vstinner/python/main/Lib/test/libregrtest/single.py", line 57 in _run_suite
  File "/home/vstinner/python/main/Lib/test/libregrtest/single.py", line 37 in run_unittest
  File "/home/vstinner/python/main/Lib/test/libregrtest/single.py", line 135 in test_func
  File "/home/vstinner/python/main/Lib/test/libregrtest/refleak.py", line 132 in runtest_refleak
  File "/home/vstinner/python/main/Lib/test/libregrtest/single.py", line 87 in regrtest_runner
  File "/home/vstinner/python/main/Lib/test/libregrtest/single.py", line 138 in _load_run_test
  File "/home/vstinner/python/main/Lib/test/libregrtest/single.py", line 181 in _runtest_env_changed_exc
  File "/home/vstinner/python/main/Lib/test/libregrtest/single.py", line 281 in _runtest
  File "/home/vstinner/python/main/Lib/test/libregrtest/single.py", line 310 in run_single_test
  File "/home/vstinner/python/main/Lib/test/libregrtest/main.py", line 363 in run_test
  File "/home/vstinner/python/main/Lib/test/libregrtest/main.py", line 397 in run_tests_sequentially
  File "/home/vstinner/python/main/Lib/test/libregrtest/main.py", line 541 in _run_tests
  File "/home/vstinner/python/main/Lib/test/libregrtest/main.py", line 576 in run_tests
  File "/home/vstinner/python/main/Lib/test/libregrtest/main.py", line 739 in main
  File "/home/vstinner/python/main/Lib/test/libregrtest/main.py", line 747 in main
  File "/home/vstinner/python/main/Lib/test/__main__.py", line 2 in <module>
  File "/home/vstinner/python/main/Lib/runpy.py", line 88 in _run_code
  File "/home/vstinner/python/main/Lib/runpy.py", line 198 in _run_module_as_main

Example on AMD64 Fedora Stable Refleaks PR buildbot: https://buildbot.python.org/#/builders/474/builds/1716

Linked PRs

@vstinner
Copy link
Member Author

It's a recent regression introduced by commit fca5529:

commit fca552993da32044165223eec2297b6aaaac60ad
Author: Kumar Aditya <[email protected]>
Date:   Tue Oct 8 20:17:32 2024 +0530

    gh-117721: use PyMutex in `_thread.lock` (#125110)

cc @kumaraditya303

@kumaraditya303
Copy link
Contributor

I can't reproduce this locally, any other buildbot which is failing on this?

@colesbury
Copy link
Contributor

I'm able to reproduce this locally. I'll take a look as well.

@vstinner
Copy link
Member Author

I can't reproduce this locally

Did you try on Linux? If yes, what are your compiler and glibc versions?

I reproduce the issue on Fedora 40: gcc (GCC) 14.2.1 and glibc 2.39-22.

@kumaraditya303
Copy link
Contributor

Did you try on Linux? If yes, what are your compiler and glibc versions?

== CPython 3.14.0a0 (heads/main:67f6e08147b, Oct 14 2024, 13:17:49) [Clang 18.1.8 (Fedora 18.1.8-1.fc40)]
== Linux-6.5.0-1025-azure-x86_64-with-glibc2.39 little-endian

@vstinner vstinner changed the title test_concurrent_futures.test_shutdown.test_processes_terminate() hangs when run multiple times test_concurrent_futures.test_shutdown.test_processes_terminate() hangs randomly when run multiple times Oct 14, 2024
@vstinner
Copy link
Member Author

Ah, the hang is random. To make it more likely, add more iteration. Example:

./python -m test test_concurrent_futures.test_shutdown -m test_processes_terminate -v -R 3:30 

@colesbury
Copy link
Contributor

colesbury commented Oct 14, 2024

I'm still investigating this, but I don't think this test works like it's intended to:

class ProcessPoolShutdownTest(ExecutorShutdownTest):
def test_processes_terminate(self):
def acquire_lock(lock):
lock.acquire()

You can't pickle a local (nested) function, so none of the child processes ever call acquire_lock. They all error out immediately, but we never check the result of the executor.submit so the error is silently ignored.

@picnixz picnixz added the tests Tests in the Lib/test dir label Oct 14, 2024
@colesbury
Copy link
Contributor

My current theory is the deadlock was pre-existing and is now more likely to occur with slightly different performance characteristics of _thread.Lock. Here is the deadlock I see:

The main process has four threads: the main thread, faulthandler, _ExecutorManagerThread, and Queue._feed. The deadlock is between the _ExecutorManagerThread and Queue._feed once the main thread calls self.executor.shutdown()

  1. _ExecutorManagerThread holds self.shutdown_lock and is waiting for the Queue._feed thread. Here's the stacktrace

  2. Queue._feed is waiting on the self.shutdown_lock. Here's the stacktrace

I'm still not sure about the relative timings needed to trigger this.

@colesbury
Copy link
Contributor

I can reproduce this deadlock on older versions (prior to fca5529) by inserting very short sleep statements at certain places:

https://gist.github.com/colesbury/a7f2a9465d33fe1fc5b99206bf5a4c93

@gpshead
Copy link
Member

gpshead commented Oct 14, 2024

You can't pickle a local (nested) function, so none of the child processes ever call acquire_lock. They all error out immediately, but we never check the result of the executor.submit so the error is silently ignored.

Agreed. A subtle thing easily overlooked without that error check. Regardless of that, I'm not surprised that this is really an existing bug given the code.

Historical context: The locking in this test appears to have come from
1ac6e37 for #19453 in April 2020 - but we didn't notice that in review. (I'm not blaming anyone here; I just like to search for historical context to better understand the code).

colesbury added a commit to colesbury/cpython that referenced this issue Oct 15, 2024
There was a deadlock when `ProcessPoolExecutor` shuts down at the same
time that a queueing thread handles an error when processing a task.
Don't use `_shutdown_lock` to protect the `_ThreadWakeup` pipes -- use
an internal lock instead. This fixes the ordering deadlock where the
`ExecutorManagerThread` holds the `_shutdown_lock` and joins the
queueing thread, while the queueing thread is attempting to acquire the
`_shutdown_lock` while closing the `_ThreadWakeup`.
vstinner added a commit to vstinner/cpython that referenced this issue Oct 15, 2024
The test hangs randomly. It tries to serialize local lock and a local
function which are not possible.
vstinner added a commit to vstinner/cpython that referenced this issue Oct 15, 2024
The test hangs randomly. It tries to serialize local lock and a local
function which are not possible.
colesbury added a commit to colesbury/cpython that referenced this issue Oct 15, 2024
vstinner added a commit that referenced this issue Oct 15, 2024
The test hangs randomly. It tries to serialize local lock and a local
function which are not possible.
colesbury added a commit to colesbury/cpython that referenced this issue Oct 15, 2024
colesbury added a commit that referenced this issue Oct 16, 2024
There was a deadlock when `ProcessPoolExecutor` shuts down at the same
time that a queueing thread handles an error processing a task.

Don't use `_shutdown_lock` to protect the `_ThreadWakeup` pipes -- use
an internal lock instead. This fixes the ordering deadlock where the
`ExecutorManagerThread` holds the `_shutdown_lock` and joins the
queueing thread, while the queueing thread is attempting to acquire the
`_shutdown_lock` while closing the `_ThreadWakeup`.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 16, 2024
…GH-125492)

There was a deadlock when `ProcessPoolExecutor` shuts down at the same
time that a queueing thread handles an error processing a task.

Don't use `_shutdown_lock` to protect the `_ThreadWakeup` pipes -- use
an internal lock instead. This fixes the ordering deadlock where the
`ExecutorManagerThread` holds the `_shutdown_lock` and joins the
queueing thread, while the queueing thread is attempting to acquire the
`_shutdown_lock` while closing the `_ThreadWakeup`.
(cherry picked from commit 760872e)

Co-authored-by: Sam Gross <[email protected]>
colesbury added a commit to colesbury/cpython that referenced this issue Oct 16, 2024
…ythonGH-125492)

There was a deadlock when `ProcessPoolExecutor` shuts down at the same
time that a queueing thread handles an error processing a task.

Don't use `_shutdown_lock` to protect the `_ThreadWakeup` pipes -- use
an internal lock instead. This fixes the ordering deadlock where the
`ExecutorManagerThread` holds the `_shutdown_lock` and joins the
queueing thread, while the queueing thread is attempting to acquire the
`_shutdown_lock` while closing the `_ThreadWakeup`.
(cherry picked from commit 760872e)

Co-authored-by: Sam Gross <[email protected]>
colesbury added a commit that referenced this issue Oct 16, 2024
…5492) (GH-125598)

There was a deadlock when `ProcessPoolExecutor` shuts down at the same
time that a queueing thread handles an error processing a task.

Don't use `_shutdown_lock` to protect the `_ThreadWakeup` pipes -- use
an internal lock instead. This fixes the ordering deadlock where the
`ExecutorManagerThread` holds the `_shutdown_lock` and joins the
queueing thread, while the queueing thread is attempting to acquire the
`_shutdown_lock` while closing the `_ThreadWakeup`.
(cherry picked from commit 760872e)

Co-authored-by: Sam Gross <[email protected]>
colesbury added a commit that referenced this issue Oct 16, 2024
…5492) (#125599)

There was a deadlock when `ProcessPoolExecutor` shuts down at the same
time that a queueing thread handles an error processing a task.

Don't use `_shutdown_lock` to protect the `_ThreadWakeup` pipes -- use
an internal lock instead. This fixes the ordering deadlock where the
`ExecutorManagerThread` holds the `_shutdown_lock` and joins the
queueing thread, while the queueing thread is attempting to acquire the
`_shutdown_lock` while closing the `_ThreadWakeup`.
(cherry picked from commit 760872e)
@colesbury
Copy link
Contributor

I think this is fixed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Tests in the Lib/test dir
Projects
None yet
Development

No branches or pull requests

5 participants