-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test__xxsubinterpreters is Occasionally Crashing #104341
Comments
I've been able to reproduce the crash locally with |
|
Also when running |
I think I've got this figured out. It looks like the interpreter is getting finalized before the thread is completely done in Based on the above, there are two points at which it should have waited:
I'm going to figure out why things didn't wait properly. I suspect it looks something like this:
The race lies between steps (8) and (12). If (8) doesn't finish before (12) happens then we'd see the crash we're seeing. What I don't understand is:
To be honest, I still haven't figured out what we are actually doing in (2) such that it affects FYI, this doesn't happen when the GIL is shared because (9) is blocked by the GIL being held still in the sub-thread. |
This may also be related to (or a variant of) gh-96387. |
Hmm, looks like I need to add |
FYI, the two fixes I have up for review don't solve the crashes yet, but I'm getting closer. |
Currently here's where things are at:
Here's a detailed breakdown of execution: (expand)
|
@pitrou, any ideas on this? |
Might this be a consequence of thread scheduling? When the sub-thread releases the GIL mutex in That doesn't seem right, but I'm not very experienced with how the pthread APIs and OS thread scheduling works. If it is right, what's the best way to make sure the sub-thread gets scheduled right away? |
It has occurred very often recently. We got the hypothesis test (Ubuntu) failure on the 6/10 most recent commits on main. |
I cannot reproduce them on my machine (WSL Ubuntu 22.04, AMD 7950X, gcc-11 and gcc-9) I also rent a AMD VPS (Ubuntu 22.04, gcc-11), still cannot reproduce with either
or
|
…ion (gh-104437) With the move to a per-interpreter GIL, this check slipped through the cracks.
* main: pythonGH-104510: Fix refleaks in `_io` base types (python#104516) pythongh-104539: Fix indentation error in logging.config.rst (python#104545) pythongh-104050: Don't star-import 'types' in Argument Clinic (python#104543) pythongh-104050: Add basic typing to CConverter in clinic.py (python#104538) pythongh-64595: Fix write file logic in Argument Clinic (python#104507) pythongh-104523: Inline minimal PGO rules (python#104524) pythongh-103861: Fix Zip64 extensions not being properly applied in some cases (python#103863) pythongh-69152: add method get_proxy_response_headers to HTTPConnection class (python#104248) pythongh-103763: Implement PEP 695 (python#103764) pythongh-104461: Run tkinter test_configure_screen on X11 only (pythonGH-104462) pythongh-104469: Convert _testcapi/watchers.c to use Argument Clinic (python#104503) pythongh-104482: Fix error handling bugs in ast.c (python#104483) pythongh-104341: Adjust tstate_must_exit() to Respect Interpreter Finalization (pythongh-104437) pythonGH-102613: Fix recursion error from `pathlib.Path.glob()` (pythonGH-104373)
My fix in gh-104560 should take care of this. |
The fix in gh-104672 should take care of the buildbot failures. I'll probably need to update the docs for |
Having a separate lock means Thread.join() doesn't need to wait for the thread to be cleaned up first. It can wait for the thread's Python target to finish running. This gives us some flexibility in how we clean up threads. (This is a minor cleanup as part of a fix for gh-104341.)
…ck for Each Thread (pythongh-104754)" This reverts commit 097b783.
After a conversation with @markshannon, I had an epiphany about how to solve this minimally: gh-105109. |
It seems each interpreter could maintain its current number of threads? It would be decremented at the very end of |
Yeah, that makes sense. |
Hmm, daemon threads would be a problem still, no? |
We should just disallow daemon threads from Subinterpreters. (Silently or loudly, no opinion) |
We already do by default for interpreters created with |
…urrent Thread (gh-105109) This avoids the problematic race in drop_gil() by skipping the FORCE_SWITCHING code there for finalizing threads. (The idea for this approach came out of discussions with @markshannon.)
… the Current Thread (pythongh-105109) This avoids the problematic race in drop_gil() by skipping the FORCE_SWITCHING code there for finalizing threads. (The idea for this approach came out of discussions with @markshannon.) (cherry picked from commit 3698fda) Co-authored-by: Eric Snow <[email protected]>
…g the Current Thread (gh-105109) (gh-105209) This avoids the problematic race in drop_gil() by skipping the FORCE_SWITCHING code there for finalizing threads. (The idea for this approach came out of discussions with @markshannon.) (cherry picked from commit 3698fda) Co-authored-by: Eric Snow [email protected]
I've merged my fix to main and 3.12. We should be good to go now. FYI, I do plan on circling back to the underlying problem in the 3.13 timeframe. |
FYI, there are other crashes I know about but they are much less common and have a different root cause. Regarding the additional work I mentioned, I'll open a new issue once I get around to it. |
This appears to have started once gh-104210 was merged.
"Fatal Python error: drop_gil: PyMUTEX_LOCK(gil->switch_mutex) failed"
It may be specific to
test.test__xxsubinterpreters.RunStringTests.test_create_thread
.See:
Linked PRs
The text was updated successfully, but these errors were encountered: