bpo-33613, test_semaphore_tracker_sigint: fix race condition #7850

pablogsal · 2018-06-21T21:40:14Z

Fail test_semaphore_tracker_sigint if no warnings are expected and
one is received.

Fix race condition when the child receives SIGINT
before it can register signal handlers for it.

The race condition occurs when the parent calls
_semaphore_tracker.ensure_running() (which in turn spawns the
semaphore_tracker using _posixsubprocess.fork_exec), the child
registers the signal handlers and the parent tries to kill the child.
What seems to happen is that in some slow systems, the parent sends the
signal to kill the child before the child protects against the signal.

There is no reliable and portable solution for the parent to wait until
the child has register the signal handlers to send the signal to kill
the child so a sleep is introduced between the spawning of the child
and the parent sending the signal to give time to the child to register
the handlers.

https://bugs.python.org/issue33613

Fail `test_semaphore_tracker_sigint` if no warnings are expected and one is received. Fix race condition when the child receives SIGINT before it can register signal handlers for it. The race condition occurs when the parent calls `_semaphore_tracker.ensure_running()` (which in turn spawns the semaphore_tracker using `_posixsubprocess.fork_exec`), the child registers the signal handlers and the parent tries to kill the child. What seem to happen is that in some slow systems, the parent sends the signal to kill the child before the child protects against the signal. There is no reliable and portable solution for the parent to wait until the child has register the signal handlers to send the signal to kill the child so a `sleep` is introduced between the spawning of the child and the parent sending the signal to give time to the child to register the handlers.

taleinat · 2018-06-24T07:05:18Z

Regarding the unrelated change of the warning check, FWIW I find the existing code (without the change in this PR) clearer.

pablogsal · 2018-06-24T12:50:02Z

@taleinat I am happy to undo that change. The problem is that the test was silently failing when SIGINT was being delivered as the test does not check that no warnings are raised. Notice that the presence of a warning means that the process died and therefore SIGINT did kill the process, which is precisely what the test checks that does not happen.

I think in order to check that the test works as intended without the need to run the suite with -Wall, we need to modify the test.

pablogsal · 2018-06-24T20:34:22Z

Lib/test/_test_multiprocessing.py

+            time.sleep(0.5)  # give it time to die
+        old_stderr = sys.stderr
+        r, w = os.pipe()
+        try:


Notice that we cannot use test.support.captured_stderr() as it does not support fileno().

…me process

taleinat · 2018-06-30T13:29:09Z

Lib/test/_test_multiprocessing.py

+        old_stderr = sys.stderr
+        r, w = os.pipe()
+        try:
+            sys.stderr = open(w, "bw")


This line should be before the try:.

Done in d075473

taleinat · 2018-06-30T13:29:37Z

Lib/test/_test_multiprocessing.py

+            # information.
+            _semaphore_tracker._send("PING", "")
+            with open(r, "rb") as pipe:
+                data = pipe.readline()


Are we okay with this potentially hanging indefinitely if something goes wrong?

Commit d075473 implements a way to fail if reading takes too long.

… the same process

…runs in the same process

pablogsal · 2018-07-24T12:09:11Z

@pitrou Could you take a look at this?

Here is a summary of all the changes to make reviewing this easier:

Updated the tracker to answer to a PING command to we know when is alive.
The test function check_semaphore_tracker_death restarts the tracker every time to avoid interference between test runs and allow parallel testing.
The test function check_semaphore_tracker_death now checks for warnings as these are indicative of the tracker being dead and restarted (this was one of the main errors and the reason tests were passing without the -W error option).
Fixed the race condition in the tests: the parent was killing the child before the child could protect itself against the signals. I was able to reproduce the problem on one of the slowest buildbots (gcc112.fsffrance.org and gcc110.fsffrance.org) and I can confirm that this PR fixes the problem.

pitrou · 2018-07-24T12:19:53Z

From a high-level POV:

Fixed the race condition in the tests: the parent was killing the child before the child could protect itself against the signals. I was able to reproduce the problem on one of the slowest buildbots (gcc112.fsffrance.org and gcc110.fsffrance.org) and I can confirm that this PR fixes the problem.

Is that important? If we have very slow buildbots, we could skip the test on them. Testing this piece of functionality on all buildbots is not critical.

pitrou

Assuming we validate this approach, there are still problems with the implementation.

pitrou · 2018-07-24T12:17:17Z

Lib/multiprocessing/semaphore_tracker.py

            r, w = os.pipe()
            try:
                fds_to_pass.append(r)
                # process will out live us, so no need to wait on pid
                exe = spawn.get_executable()
                args = [exe] + util._args_from_interpreter_flags()
-                args += ['-c', cmd % r]
+                args += ['-c', cmd.format(r, sys.stderr.fileno())]


This won't work if sys.stderr isn't an actual file:

>>> io.StringIO().fileno() Traceback (most recent call last): File "<stdin>", line 1, in <module> io.UnsupportedOperation: fileno

I was relaying on the fact that the tracker seems to work under the assumption that sys.stderr has a file descriptor associated:

https://github.com/python/cpython/blob/master/Lib/multiprocessing/semaphore_tracker.py#L60

Relevant lines:

fds_to_pass = [] try: fds_to_pass.append(sys.stderr.fileno()) except Exception: pass

@pitrou I am missing something?

Well, the except Exception should be clear, no? :-)

Ups, my bad. :)

Let me investigate what options do we have. Do you have a preferred approach on how to handle this?

pitrou · 2018-07-24T12:18:11Z

Lib/multiprocessing/semaphore_tracker.py

@@ -128,6 +128,8 @@ def main(fd):
                        cache.add(name)
                    elif cmd == b'UNREGISTER':
                        cache.remove(name)
+                    elif cmd == b'PING':
+                        os.write(fd_write, b"PONG\n")


os.write may not write all bytes. You probably want to wrap fd_write in a buffered writer as is done for fd above.

bedevere-bot · 2018-07-24T12:20:28Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

…esting runs in the same process

pablogsal · 2018-07-24T12:34:50Z

Is that important? If we have very slow buildbots, we could skip the test on them. Testing this piece of functionality on all buildbots is not critical.

That was one of the original problems in the issue. It does not happen only on the slowest buildbots, but these are a reliable place to test for the race condition to happen. In my humble opinion, the existence of the race makes the tests more unreliable and we should fix that, but I you think there is a better approach or a different compromise, I am more than happy to go for that.

vstinner · 2018-09-03T20:49:57Z

@pitrou wrote: "I'm a bit uneasy with this. What if spawnv_passfds takes a very long time for some reason? The user will try to stop it using ^C... and nothing will happen."

You are true that SIGINT is blocked while spawnv_passfds() is running, but spawnv_passfds() should be quick in the parent: to oversimplify, it just calls fork() which should be very quick (it's not a O(n) operating thanks to copy-on-write). I prefer to see this race condition fixed. IMHO the short time window where CTRL+c is blocked is small enough to be acceptable.

Note: signals are not lost or ignored, it's just that signals are only handled once spawnv_passfds() completes (once the signals are unblocked).

I suggest to backport the change to 2.7, 3.6 and 3.7 branches to make our buildbots more reliable.

vstinner · 2018-09-03T20:55:35Z

@pitrou: @pablogsal updated his PR, and it now LGTM. Would you mind to have a second look?

Note: this PR has a long history since @pablogsal chose to rewrite his PR with a different approach ("ping" then pthread_sigmask) rather than creating a new PR.

IMHO pthread_sigmask is smpler and more reliable than the "ping" idea. There is no need to establish (and then close, reliably) a new communication channel with pthread_sigmask.

pitrou · 2018-09-04T08:14:48Z

Fair enough. I think this is good to go.

bedevere-bot · 2018-09-04T08:53:57Z

@pitrou: Please replace # with GH- in the commit message next time. Thanks!

miss-islington · 2018-09-04T08:55:59Z

Thanks @pablogsal for the PR, and @pitrou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.6.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

miss-islington · 2018-09-04T08:55:59Z

Thanks @pablogsal for the PR, and @pitrou for merging it 🌮🎉.. I'm working now to backport this PR to: 2.7.
🐍🍒⛏🤖

miss-islington · 2018-09-04T08:55:59Z

Thanks @pablogsal for the PR, and @pitrou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

miss-islington · 2018-09-04T08:56:09Z

Sorry, @pablogsal and @pitrou, I could not cleanly backport this to 2.7 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker ec74d187f50a8a48f94eb37023300583fbd644cc 2.7

bedevere-bot · 2018-09-04T08:56:19Z

GH-9055 is a backport of this pull request to the 3.7 branch.

…H-7850) Fail `test_semaphore_tracker_sigint` if no warnings are expected and one is received. Fix race condition when the child receives SIGINT before it can register signal handlers for it. The race condition occurs when the parent calls `_semaphore_tracker.ensure_running()` (which in turn spawns the semaphore_tracker using `_posixsubprocess.fork_exec`), the child registers the signal handlers and the parent tries to kill the child. What seem to happen is that in some slow systems, the parent sends the signal to kill the child before the child protects against the signal. (cherry picked from commit ec74d18) Co-authored-by: Pablo Galindo <[email protected]>

miss-islington · 2018-09-04T08:56:21Z

Sorry, @pablogsal and @pitrou, I could not cleanly backport this to 3.6 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker ec74d187f50a8a48f94eb37023300583fbd644cc 3.6

pitrou · 2018-09-04T08:57:02Z

Please don't backport this. This isn't fixing a user-visible bug.

vstinner · 2018-09-04T08:57:51Z

Please don't backport this. This isn't fixing a user-visible bug.

It hurts on the buildbots, and so it makes our life harder (Pablo and me who watch the random failures on buildbots).

pitrou · 2018-09-04T08:59:00Z

Right, but we shouldn't modify library code to heal the buildbots (and risk introducing regressions).

vstinner · 2018-09-04T08:59:31Z

If you don't want to backport the code, I suggest to remove or skip the test in 3.6 and 3.7 branches. I don't want to have known race condition in our test suite.

pitrou · 2018-09-04T09:00:35Z

I'm ok with skipping the tests on the buildbots, as long as it's not skipped unconditionally.

Lib/test/_test_multiprocessing.py

pablogsal requested a review from vstinner June 21, 2018 21:40

the-knights-who-say-ni added the CLA signed label Jun 21, 2018

bedevere-bot added the awaiting merge label Jun 21, 2018

pablogsal force-pushed the bpo33613 branch from 008c17c to 5f1df08 Compare June 21, 2018 22:34

pablogsal force-pushed the bpo33613 branch 3 times, most recently from 4b8f76f to 9408236 Compare June 24, 2018 18:35

Implement PONG command in the semaphore tracker

4f9babb

pablogsal force-pushed the bpo33613 branch from 9408236 to 4f9babb Compare June 24, 2018 20:01

pablogsal commented Jun 24, 2018

View reviewed changes

Ensure the tracker is killed to allow multiple testing runs in the sa…

26d81a3

…me process

pablogsal force-pushed the bpo33613 branch from 1215a11 to 26d81a3 Compare June 24, 2018 20:35

taleinat reviewed Jun 30, 2018

View reviewed changes

pablogsal force-pushed the bpo33613 branch from d075473 to b315e8e Compare July 6, 2018 23:58

fixup! Ensure the tracker is killed to allow multiple testing runs in…

a7d5c59

… the same process

pablogsal force-pushed the bpo33613 branch from b315e8e to a7d5c59 Compare July 7, 2018 20:17

fixup! fixup! Ensure the tracker is killed to allow multiple testing …

a735d49

…runs in the same process

pitrou requested changes Jul 24, 2018

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting merge labels Jul 24, 2018

fixup! fixup! fixup! Ensure the tracker is killed to allow multiple t…

9460da4

…esting runs in the same process

pablogsal force-pushed the bpo33613 branch 2 times, most recently from e16e027 to 9460da4 Compare July 24, 2018 12:29

Remove extraneous NEWS file

c0a088c

pitrou approved these changes Sep 4, 2018

View reviewed changes

pitrou merged commit ec74d18 into python:master Sep 4, 2018

bedevere-bot removed the awaiting merge label Sep 4, 2018

vstinner added needs backport to 3.6 labels Sep 4, 2018

miss-islington assigned pitrou Sep 4, 2018

bedevere-bot removed the needs backport to 3.7 label Sep 4, 2018

pitrou removed needs backport to 2.7 labels Sep 4, 2018

serhiy-storchaka reviewed Oct 6, 2018

View reviewed changes

Lib/test/_test_multiprocessing.py Show resolved Hide resolved

pablogsal deleted the bpo33613 branch May 19, 2021 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-33613, test_semaphore_tracker_sigint: fix race condition #7850

bpo-33613, test_semaphore_tracker_sigint: fix race condition #7850

pablogsal commented Jun 21, 2018 •

edited by bedevere-bot

Loading

taleinat commented Jun 24, 2018

pablogsal commented Jun 24, 2018 •

edited

Loading

pablogsal Jun 24, 2018

taleinat Jun 30, 2018

pablogsal Jul 1, 2018

taleinat Jun 30, 2018

pablogsal Jul 1, 2018

pablogsal commented Jul 24, 2018

pitrou commented Jul 24, 2018

pitrou left a comment

pitrou Jul 24, 2018

pablogsal Jul 24, 2018 •

edited

Loading

pablogsal Jul 24, 2018

pitrou Jul 24, 2018

pablogsal Jul 24, 2018

pitrou Jul 24, 2018

bedevere-bot commented Jul 24, 2018

pablogsal commented Jul 24, 2018

vstinner commented Sep 3, 2018

vstinner commented Sep 3, 2018

pitrou commented Sep 4, 2018

bedevere-bot commented Sep 4, 2018

miss-islington commented Sep 4, 2018

miss-islington commented Sep 4, 2018

miss-islington commented Sep 4, 2018

miss-islington commented Sep 4, 2018

bedevere-bot commented Sep 4, 2018

miss-islington commented Sep 4, 2018

pitrou commented Sep 4, 2018

vstinner commented Sep 4, 2018

pitrou commented Sep 4, 2018

vstinner commented Sep 4, 2018

pitrou commented Sep 4, 2018

bpo-33613, test_semaphore_tracker_sigint: fix race condition #7850

bpo-33613, test_semaphore_tracker_sigint: fix race condition #7850

Conversation

pablogsal commented Jun 21, 2018 • edited by bedevere-bot Loading

taleinat commented Jun 24, 2018

pablogsal commented Jun 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablogsal commented Jul 24, 2018

pitrou commented Jul 24, 2018

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablogsal Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bedevere-bot commented Jul 24, 2018

pablogsal commented Jul 24, 2018

vstinner commented Sep 3, 2018

vstinner commented Sep 3, 2018

pitrou commented Sep 4, 2018

bedevere-bot commented Sep 4, 2018

miss-islington commented Sep 4, 2018

miss-islington commented Sep 4, 2018

miss-islington commented Sep 4, 2018

miss-islington commented Sep 4, 2018

bedevere-bot commented Sep 4, 2018

miss-islington commented Sep 4, 2018

pitrou commented Sep 4, 2018

vstinner commented Sep 4, 2018

pitrou commented Sep 4, 2018

vstinner commented Sep 4, 2018

pitrou commented Sep 4, 2018

pablogsal commented Jun 21, 2018 •

edited by bedevere-bot

Loading

pablogsal commented Jun 24, 2018 •

edited

Loading

pablogsal Jul 24, 2018 •

edited

Loading