Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wptserve] Eliminate race condition #14024

Merged
merged 4 commits into from
Nov 14, 2018

Conversation

jugglinmike
Copy link
Contributor

This race condition was expressed during testing sessions where the
first test to use the Stash feature issued did so with multiple requests
made in parallel.


The tests are brittle, but given the conditions we need to model, brittleness seems unavoidable.

The server code hasn't changed recently, so it's a little surprising that a race condition should suddenly become an issue in Taskcluster and in the results-collection project. Reviewing the status for each commit in master shows that this intermittent error first surfaced on November 5. Guy Fawkes is the obvious suspect, but the trail's gone cold. I turned to more technical explanations.

The most recent Chromedriver release is version 2.43, and that was published on October 16, ruling out a regression there. Chrome itself may have regressed, but I'm not set up to do any sort of bisecting, so I'm proceeding under the assumption that this is an issue in WPT.

A small incongruity between the Buildbot and Taskcluster setup is useful here. Each system experiences the issue on a different "chunk" of WPT, but they define the segments in different terms. Buildbot uses 20 "chunks" containing tests of all types (failing on number 3) while Taskcluster uses 15 "chunks" for the testharness.js tests (failing on 13). We can narrow the set of suspect files by using the union of the following two lists:

./wpt run --list-tests --this-chunk 3 --total-chunks 20 chrome
./wpt run --list-tests --this-chunk 13 --total chunks 15 --test-type testdriver -- chrome

That returns 187 test files to consider. Feeding those files into git log shows that only one of them changed on that day:

$ cat both.txt | xargs git log --after=2018-11-04 --before=2018-11-06 --oneline --
7e66d13 Ensure eval flag is properly transfered to context from CSPRO

content-security-policy/script-src/eval-allowed-in-report-only-mode-and-sends-report.html doesn't appear to be doing anything invalid. It does make parallel requests to a Stash-enabled endpoint, though (via checkReport.sub.js). The race condition fixed by this patch would only be expressed when the first test to use Stash did so via two parallel requests; my guess is that's uncommon but became true on November 5.

This race condition was expressed during testing sessions where the
first test to use the Stash feature issued did so with multiple requests
made in parallel.
@gsnedders
Copy link
Member

At first glance this looks good, but I definitely need to have a closer look while more awake; thanks for tracking this down!

Copy link
Contributor

@jgraham jgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent job diagnosing this issue and providing a fix.

I am slightly concerned about the tests; it took a lot longer to understand the test code than it did to understand the fix. If we want to keep them I think we should factor out the 90% common code, and comment it thoroughly so that it's easier to understand what is supposed to be happening at each point; a test involving two processes and two threads is inherently complex and difficult to understand.

Copy link
Member

@gsnedders gsnedders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Mostly just agree with @jgraham)

"""Ensure that delays in proxied Lock retrieval do not interfere with
initialization in parallel threads."""

class SlowLock(BaseManager):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes this slow?

"""Ensure that delays in proxied `dict` retrieval do not interfere with
initialization in parallel threads."""

class SlowDict(BaseManager):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes this slow?

SlowLock.register("get_dict", callable=lambda: {})
SlowLock.register("Lock", callable=handle_lock_request)

slowlock = SlowLock(("localhost", 4543), b"some key")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to be stupid, but what effect does this manager have?

@jugglinmike
Copy link
Contributor Author

Thanks for the review, @jgraham and @gsnedders. I've attempted to address @jgraham's feedback by answering @gsnedders' questions via inline documentation. I've also reduced duplication by factoring out the multiprocessing target (run) and sharing it between both tests.

Copy link
Contributor

@jgraham jgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better, and I don't want to block it landing, but there is possibly scope for one more round of cleanup.


response_lock.release()

# Wait for both threads to complete and report their stateto the test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type: stateto

response_lock.acquire()
return threading.Lock()

SlowLock.register("get_dict", callable=lambda: {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still somewhat seems like you could write these tets as as a parameterised test in which your parameters are get_dict and Lock, with one case having the parameters (lambda :{}, mutex_lock_request) and one having the parameters (mutex_get_dict, lambda: threading.Lock()). I also think the names with "mutex" in make more sense than the generic handle_ names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe blockable rather than mutex

@jugglinmike
Copy link
Contributor Author

Those functions can't be parameterized like that because they need to reference
locks (currently named request_lock and response_lock) which are shared
with the child process created by the test.

We could also parameterize the locks, and maybe make some sort of
mutex-function-producing factory function, but that kind of abstraction is
contrary to our goal of making the test easier to understand.

I agree that mutex_ is more descriptive than handle_. I've updated the
function names accordingly.

@jgraham
Copy link
Contributor

jgraham commented Nov 14, 2018

I mean you could make the locks globals and set them at the start of the test and unset them in the cleanup. But I'm going to merge this PR now for the correctness fix, and if we want to clean up the tests more we can do that in a later PR.

@jgraham jgraham merged commit cbb25e2 into web-platform-tests:master Nov 14, 2018
@jugglinmike
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants