[wptserve] Eliminate race condition #14024

jugglinmike · 2018-11-12T21:00:24Z

This race condition was expressed during testing sessions where the
first test to use the Stash feature issued did so with multiple requests
made in parallel.

The tests are brittle, but given the conditions we need to model, brittleness seems unavoidable.

The server code hasn't changed recently, so it's a little surprising that a race condition should suddenly become an issue in Taskcluster and in the results-collection project. Reviewing the status for each commit in master shows that this intermittent error first surfaced on November 5. Guy Fawkes is the obvious suspect, but the trail's gone cold. I turned to more technical explanations.

The most recent Chromedriver release is version 2.43, and that was published on October 16, ruling out a regression there. Chrome itself may have regressed, but I'm not set up to do any sort of bisecting, so I'm proceeding under the assumption that this is an issue in WPT.

A small incongruity between the Buildbot and Taskcluster setup is useful here. Each system experiences the issue on a different "chunk" of WPT, but they define the segments in different terms. Buildbot uses 20 "chunks" containing tests of all types (failing on number 3) while Taskcluster uses 15 "chunks" for the testharness.js tests (failing on 13). We can narrow the set of suspect files by using the union of the following two lists:

./wpt run --list-tests --this-chunk 3 --total-chunks 20 chrome
./wpt run --list-tests --this-chunk 13 --total chunks 15 --test-type testdriver -- chrome

That returns 187 test files to consider. Feeding those files into git log shows that only one of them changed on that day:

$ cat both.txt | xargs git log --after=2018-11-04 --before=2018-11-06 --oneline --
7e66d13 Ensure eval flag is properly transfered to context from CSPRO

content-security-policy/script-src/eval-allowed-in-report-only-mode-and-sends-report.html doesn't appear to be doing anything invalid. It does make parallel requests to a Stash-enabled endpoint, though (via checkReport.sub.js). The race condition fixed by this patch would only be expressed when the first test to use Stash did so via two parallel requests; my guess is that's uncommon but became true on November 5.

This race condition was expressed during testing sessions where the first test to use the Stash feature issued did so with multiple requests made in parallel.

gsnedders · 2018-11-13T00:10:32Z

At first glance this looks good, but I definitely need to have a closer look while more awake; thanks for tracking this down!

jgraham

Excellent job diagnosing this issue and providing a fix.

I am slightly concerned about the tests; it took a lot longer to understand the test code than it did to understand the fix. If we want to keep them I think we should factor out the 90% common code, and comment it thoroughly so that it's easier to understand what is supposed to be happening at each point; a test involving two processes and two threads is inherently complex and difficult to understand.

gsnedders

(Mostly just agree with @jgraham)

gsnedders · 2018-11-13T18:29:00Z

tools/wptserve/tests/test_stash.py

+    """Ensure that delays in proxied Lock retrieval do not interfere with
+    initialization in parallel threads."""
+
+    class SlowLock(BaseManager):


What makes this slow?

gsnedders · 2018-11-13T18:29:10Z

tools/wptserve/tests/test_stash.py

+    """Ensure that delays in proxied `dict` retrieval do not interfere with
+    initialization in parallel threads."""
+
+    class SlowDict(BaseManager):


What makes this slow?

gsnedders · 2018-11-13T18:31:26Z

tools/wptserve/tests/test_stash.py

+    SlowLock.register("get_dict", callable=lambda: {})
+    SlowLock.register("Lock", callable=handle_lock_request)
+
+    slowlock = SlowLock(("localhost", 4543), b"some key")


I hate to be stupid, but what effect does this manager have?

jugglinmike · 2018-11-13T20:19:51Z

Thanks for the review, @jgraham and @gsnedders. I've attempted to address @jgraham's feedback by answering @gsnedders' questions via inline documentation. I've also reduced duplication by factoring out the multiprocessing target (run) and sharing it between both tests.

jgraham

This looks better, and I don't want to block it landing, but there is possibly scope for one more round of cleanup.

jgraham · 2018-11-14T14:07:13Z

tools/wptserve/tests/test_stash.py

+
+    response_lock.release()
+
+    # Wait for both threads to complete and report their stateto the test


type: stateto

jgraham · 2018-11-14T14:13:30Z

tools/wptserve/tests/test_stash.py

+        response_lock.acquire()
+        return threading.Lock()
+
+    SlowLock.register("get_dict", callable=lambda: {})


It still somewhat seems like you could write these tets as as a parameterised test in which your parameters are get_dict and Lock, with one case having the parameters (lambda :{}, mutex_lock_request) and one having the parameters (mutex_get_dict, lambda: threading.Lock()). I also think the names with "mutex" in make more sense than the generic handle_ names.

Or maybe blockable rather than mutex

jugglinmike · 2018-11-14T17:30:31Z

Those functions can't be parameterized like that because they need to reference
locks (currently named request_lock and response_lock) which are shared
with the child process created by the test.

We could also parameterize the locks, and maybe make some sort of
mutex-function-producing factory function, but that kind of abstraction is
contrary to our goal of making the test easier to understand.

I agree that mutex_ is more descriptive than handle_. I've updated the
function names accordingly.

jgraham · 2018-11-14T17:33:31Z

I mean you could make the locks globals and set them at the start of the test and unset them in the cleanup. But I'm going to merge this PR now for the correctness fix, and if we want to clean up the tests more we can do that in a later PR.

jugglinmike · 2018-11-14T18:05:08Z

Thanks!

[wptserve] Eliminate race condition

ea04f37

This race condition was expressed during testing sessions where the first test to use the Stash feature issued did so with multiple requests made in parallel.

wpt-pr-bot added infra wptserve labels Nov 12, 2018

wpt-pr-bot requested review from gsnedders and jgraham November 12, 2018 21:00

This was referenced Nov 12, 2018

Taskcluster Chrome runs for commit 71fe0a6c3d (epochs/daily) failed #13989

Closed

Missing Safari stable results for 349d418380 web-platform-tests/results-collection#626

Closed

fixup! [wptserve] Eliminate race condition

e6e1cab

jgraham approved these changes Nov 13, 2018

View reviewed changes

gsnedders reviewed Nov 13, 2018

View reviewed changes

fixup! [wptserve] Eliminate race condition

9cd269d

jgraham approved these changes Nov 14, 2018

View reviewed changes

fixup! [wptserve] Eliminate race condition

f6513f4

jgraham merged commit cbb25e2 into web-platform-tests:master Nov 14, 2018

gsnedders mentioned this pull request May 21, 2019

stash tests fail on Windows #16938

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wptserve] Eliminate race condition #14024

[wptserve] Eliminate race condition #14024

jugglinmike commented Nov 12, 2018

gsnedders commented Nov 13, 2018

jgraham left a comment

gsnedders left a comment

gsnedders Nov 13, 2018

gsnedders Nov 13, 2018

gsnedders Nov 13, 2018

jugglinmike commented Nov 13, 2018

jgraham left a comment

jgraham Nov 14, 2018

jgraham Nov 14, 2018

jgraham Nov 14, 2018

jugglinmike commented Nov 14, 2018

jgraham commented Nov 14, 2018

jugglinmike commented Nov 14, 2018


		response_lock.release()

		# Wait for both threads to complete and report their stateto the test

[wptserve] Eliminate race condition #14024

[wptserve] Eliminate race condition #14024

Conversation

jugglinmike commented Nov 12, 2018

gsnedders commented Nov 13, 2018

jgraham left a comment

Choose a reason for hiding this comment

gsnedders left a comment

Choose a reason for hiding this comment

gsnedders Nov 13, 2018

Choose a reason for hiding this comment

gsnedders Nov 13, 2018

Choose a reason for hiding this comment

gsnedders Nov 13, 2018

Choose a reason for hiding this comment

jugglinmike commented Nov 13, 2018

jgraham left a comment

Choose a reason for hiding this comment

jgraham Nov 14, 2018

Choose a reason for hiding this comment

jgraham Nov 14, 2018

Choose a reason for hiding this comment

jgraham Nov 14, 2018

Choose a reason for hiding this comment

jugglinmike commented Nov 14, 2018

jgraham commented Nov 14, 2018

jugglinmike commented Nov 14, 2018