-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-33613, test_semaphore_tracker_sigint: fix race condition #7850
Conversation
Fail `test_semaphore_tracker_sigint` if no warnings are expected and one is received. Fix race condition when the child receives SIGINT before it can register signal handlers for it. The race condition occurs when the parent calls `_semaphore_tracker.ensure_running()` (which in turn spawns the semaphore_tracker using `_posixsubprocess.fork_exec`), the child registers the signal handlers and the parent tries to kill the child. What seem to happen is that in some slow systems, the parent sends the signal to kill the child before the child protects against the signal. There is no reliable and portable solution for the parent to wait until the child has register the signal handlers to send the signal to kill the child so a `sleep` is introduced between the spawning of the child and the parent sending the signal to give time to the child to register the handlers.
Regarding the unrelated change of the warning check, FWIW I find the existing code (without the change in this PR) clearer. |
@taleinat I am happy to undo that change. The problem is that the test was silently failing when I think in order to check that the test works as intended without the need to run the suite with -Wall, we need to modify the test. |
4b8f76f
to
9408236
Compare
Lib/test/_test_multiprocessing.py
Outdated
time.sleep(0.5) # give it time to die | ||
old_stderr = sys.stderr | ||
r, w = os.pipe() | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that we cannot use test.support.captured_stderr()
as it does not support fileno()
.
Lib/test/_test_multiprocessing.py
Outdated
old_stderr = sys.stderr | ||
r, w = os.pipe() | ||
try: | ||
sys.stderr = open(w, "bw") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should be before the try:
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in d075473
Lib/test/_test_multiprocessing.py
Outdated
# information. | ||
_semaphore_tracker._send("PING", "") | ||
with open(r, "rb") as pipe: | ||
data = pipe.readline() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we okay with this potentially hanging indefinitely if something goes wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit d075473 implements a way to fail if reading takes too long.
… the same process
…runs in the same process
@pitrou Could you take a look at this? Here is a summary of all the changes to make reviewing this easier:
|
From a high-level POV:
Is that important? If we have very slow buildbots, we could skip the test on them. Testing this piece of functionality on all buildbots is not critical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we validate this approach, there are still problems with the implementation.
r, w = os.pipe() | ||
try: | ||
fds_to_pass.append(r) | ||
# process will out live us, so no need to wait on pid | ||
exe = spawn.get_executable() | ||
args = [exe] + util._args_from_interpreter_flags() | ||
args += ['-c', cmd % r] | ||
args += ['-c', cmd.format(r, sys.stderr.fileno())] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work if sys.stderr
isn't an actual file:
>>> io.StringIO().fileno()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
io.UnsupportedOperation: fileno
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was relaying on the fact that the tracker seems to work under the assumption that sys.stderr
has a file descriptor associated:
https://github.com/python/cpython/blob/master/Lib/multiprocessing/semaphore_tracker.py#L60
Relevant lines:
fds_to_pass = []
try:
fds_to_pass.append(sys.stderr.fileno())
except Exception:
pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou I am missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the except Exception
should be clear, no? :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ups, my bad. :)
Let me investigate what options do we have. Do you have a preferred approach on how to handle this?
@@ -128,6 +128,8 @@ def main(fd): | |||
cache.add(name) | |||
elif cmd == b'UNREGISTER': | |||
cache.remove(name) | |||
elif cmd == b'PING': | |||
os.write(fd_write, b"PONG\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.write
may not write all bytes. You probably want to wrap fd_write
in a buffered writer as is done for fd
above.
When you're done making the requested changes, leave the comment: |
…esting runs in the same process
e16e027
to
9460da4
Compare
That was one of the original problems in the issue. It does not happen only on the slowest buildbots, but these are a reliable place to test for the race condition to happen. In my humble opinion, the existence of the race makes the tests more unreliable and we should fix that, but I you think there is a better approach or a different compromise, I am more than happy to go for that. |
@pitrou wrote: "I'm a bit uneasy with this. What if spawnv_passfds takes a very long time for some reason? The user will try to stop it using ^C... and nothing will happen." You are true that SIGINT is blocked while spawnv_passfds() is running, but spawnv_passfds() should be quick in the parent: to oversimplify, it just calls fork() which should be very quick (it's not a O(n) operating thanks to copy-on-write). I prefer to see this race condition fixed. IMHO the short time window where CTRL+c is blocked is small enough to be acceptable. Note: signals are not lost or ignored, it's just that signals are only handled once spawnv_passfds() completes (once the signals are unblocked). I suggest to backport the change to 2.7, 3.6 and 3.7 branches to make our buildbots more reliable. |
@pitrou: @pablogsal updated his PR, and it now LGTM. Would you mind to have a second look? Note: this PR has a long history since @pablogsal chose to rewrite his PR with a different approach ("ping" then pthread_sigmask) rather than creating a new PR. IMHO pthread_sigmask is smpler and more reliable than the "ping" idea. There is no need to establish (and then close, reliably) a new communication channel with pthread_sigmask. |
Fair enough. I think this is good to go. |
@pitrou: Please replace |
Thanks @pablogsal for the PR, and @pitrou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.6. |
Thanks @pablogsal for the PR, and @pitrou for merging it 🌮🎉.. I'm working now to backport this PR to: 2.7. |
Thanks @pablogsal for the PR, and @pitrou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7. |
Sorry, @pablogsal and @pitrou, I could not cleanly backport this to |
GH-9055 is a backport of this pull request to the 3.7 branch. |
…H-7850) Fail `test_semaphore_tracker_sigint` if no warnings are expected and one is received. Fix race condition when the child receives SIGINT before it can register signal handlers for it. The race condition occurs when the parent calls `_semaphore_tracker.ensure_running()` (which in turn spawns the semaphore_tracker using `_posixsubprocess.fork_exec`), the child registers the signal handlers and the parent tries to kill the child. What seem to happen is that in some slow systems, the parent sends the signal to kill the child before the child protects against the signal. (cherry picked from commit ec74d18) Co-authored-by: Pablo Galindo <[email protected]>
Sorry, @pablogsal and @pitrou, I could not cleanly backport this to |
Please don't backport this. This isn't fixing a user-visible bug. |
It hurts on the buildbots, and so it makes our life harder (Pablo and me who watch the random failures on buildbots). |
Right, but we shouldn't modify library code to heal the buildbots (and risk introducing regressions). |
If you don't want to backport the code, I suggest to remove or skip the test in 3.6 and 3.7 branches. I don't want to have known race condition in our test suite. |
I'm ok with skipping the tests on the buildbots, as long as it's not skipped unconditionally. |
Fail
test_semaphore_tracker_sigint
if no warnings are expected andone is received.
Fix race condition when the child receives SIGINT
before it can register signal handlers for it.
The race condition occurs when the parent calls
_semaphore_tracker.ensure_running()
(which in turn spawns thesemaphore_tracker using
_posixsubprocess.fork_exec
), the childregisters the signal handlers and the parent tries to kill the child.
What seems to happen is that in some slow systems, the parent sends the
signal to kill the child before the child protects against the signal.
There is no reliable and portable solution for the parent to wait until
the child has register the signal handlers to send the signal to kill
the child so a
sleep
is introduced between the spawning of the childand the parent sending the signal to give time to the child to register
the handlers.
https://bugs.python.org/issue33613