-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wptserve] Eliminate race condition #14024
[wptserve] Eliminate race condition #14024
Conversation
This race condition was expressed during testing sessions where the first test to use the Stash feature issued did so with multiple requests made in parallel.
At first glance this looks good, but I definitely need to have a closer look while more awake; thanks for tracking this down! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent job diagnosing this issue and providing a fix.
I am slightly concerned about the tests; it took a lot longer to understand the test code than it did to understand the fix. If we want to keep them I think we should factor out the 90% common code, and comment it thoroughly so that it's easier to understand what is supposed to be happening at each point; a test involving two processes and two threads is inherently complex and difficult to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Mostly just agree with @jgraham)
"""Ensure that delays in proxied Lock retrieval do not interfere with | ||
initialization in parallel threads.""" | ||
|
||
class SlowLock(BaseManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes this slow?
"""Ensure that delays in proxied `dict` retrieval do not interfere with | ||
initialization in parallel threads.""" | ||
|
||
class SlowDict(BaseManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes this slow?
SlowLock.register("get_dict", callable=lambda: {}) | ||
SlowLock.register("Lock", callable=handle_lock_request) | ||
|
||
slowlock = SlowLock(("localhost", 4543), b"some key") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate to be stupid, but what effect does this manager have?
Thanks for the review, @jgraham and @gsnedders. I've attempted to address @jgraham's feedback by answering @gsnedders' questions via inline documentation. I've also reduced duplication by factoring out the multiprocessing target ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better, and I don't want to block it landing, but there is possibly scope for one more round of cleanup.
|
||
response_lock.release() | ||
|
||
# Wait for both threads to complete and report their stateto the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type: stateto
response_lock.acquire() | ||
return threading.Lock() | ||
|
||
SlowLock.register("get_dict", callable=lambda: {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still somewhat seems like you could write these tets as as a parameterised test in which your parameters are get_dict
and Lock
, with one case having the parameters (lambda :{}, mutex_lock_request)
and one having the parameters (mutex_get_dict, lambda: threading.Lock())
. I also think the names with "mutex" in make more sense than the generic handle_
names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe blockable
rather than mutex
Those functions can't be parameterized like that because they need to reference We could also parameterize the locks, and maybe make some sort of I agree that |
I mean you could make the locks globals and set them at the start of the test and unset them in the cleanup. But I'm going to merge this PR now for the correctness fix, and if we want to clean up the tests more we can do that in a later PR. |
Thanks! |
This race condition was expressed during testing sessions where the
first test to use the Stash feature issued did so with multiple requests
made in parallel.
The tests are brittle, but given the conditions we need to model, brittleness seems unavoidable.
The server code hasn't changed recently, so it's a little surprising that a race condition should suddenly become an issue in Taskcluster and in the results-collection project. Reviewing the status for each commit in
master
shows that this intermittent error first surfaced on November 5. Guy Fawkes is the obvious suspect, but the trail's gone cold. I turned to more technical explanations.The most recent Chromedriver release is version 2.43, and that was published on October 16, ruling out a regression there. Chrome itself may have regressed, but I'm not set up to do any sort of bisecting, so I'm proceeding under the assumption that this is an issue in WPT.
A small incongruity between the Buildbot and Taskcluster setup is useful here. Each system experiences the issue on a different "chunk" of WPT, but they define the segments in different terms. Buildbot uses 20 "chunks" containing tests of all types (failing on number 3) while Taskcluster uses 15 "chunks" for the testharness.js tests (failing on 13). We can narrow the set of suspect files by using the union of the following two lists:
That returns 187 test files to consider. Feeding those files into
git log
shows that only one of them changed on that day:content-security-policy/script-src/eval-allowed-in-report-only-mode-and-sends-report.html
doesn't appear to be doing anything invalid. It does make parallel requests to a Stash-enabled endpoint, though (viacheckReport.sub.js
). The race condition fixed by this patch would only be expressed when the first test to use Stash did so via two parallel requests; my guess is that's uncommon but became true on November 5.