-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wptrunner] Reject tests that almost time out in stability check #11570
[wptrunner] Reject tests that almost time out in stability check #11570
Conversation
Looks okay, but I am not a python expert. Maybe turn it on in warning mode only until timeout=long is supported? |
I was hoping @gsnedders or someone could help me figure out how to check for |
The numbers in #9972 (comment) were from a very beefy workstation, and we don't yet know what typical execution time in Travis is. Can we start with a very high limit, like 25 seconds, that would probably catch no tests at all? In parallel, it would be great to start collecting the wptreport.json from all PRs, so that we can have a look at what the distribution of execution time is, and compare that to what the execution time in other infrastructures is. |
Let's use 30s for now. |
I'm not sure there's any easy way to get at that from here, because if I'm not mistaken we've not got as far as loading the manifest at the point of running the stability entry point. |
We almost want to count the longest duration for tests with normal timeout and tests with long timeout separately (instead of using different threshold depending on what the slowest test is), because otherwise the longest running test is very likely one with long timeout and we miss those tests with normal timeout but perhaps closer to timeout. As for the implementation, @gsnedders is right; it's very hard to consult the manifest here. I'm thinking perhaps the wptrunner can add the timeout of the test as an additional field to its |
I've tried to poke around a bit, but can't figure out how to do this. Any pointers? |
@zcorpan alright this needs some plumbing; sit tight.
Caveats:
|
Thanks, I think I got it working (but doesn't do anything with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! I'm giving my r+, but please also wait for opinions from others.
@Hexcles can you merge this if there has been no review by June 29? Also, if there are review comments, can you address them, please? |
I'd like to see what @jgraham has to say, but he's on holiday till after then IIRC. |
Ok to wait for James. |
I haven't reviewed this because I'm away, but as noted on the original issue, I'm pretty skeptical of failing travis because of long-running tests. In particular it's hostile to tests for features that aren't implemented in any browser we run in CI which are more likely to time out. I think we should discuss some mechanism to warn for this kind of thing rather than hard-failing. |
Indeed, "Failing by timeout" is sometimes the most natural way to write a test (for example, a fetch abort test). I was assuming that reviewers would be able to apply their judgement in these cases. |
Hmm I didn't intend do fail the check for TIMEOUT. |
So to clarify (based on my understanding):
And I think this is a reasonable approach to prevent flakiness with minimal cost of productivity. |
Oh, if that's how this works then I agree that the approach makes sense :) Sorry for not twigging that the patch wasn't treating "timeout" and "nearly time out" the same (I still haven't reviewed the code, but based on the discussion I would be more inclined to log the actual test timeout, including any timeout multipler, on |
if not file_result.extra: | ||
file_result.extra = { "test_timeout": test.timeout } | ||
else: | ||
file_result.extra.test_timeout = test.timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work? I would expect it to be file_result.extra["test_timeout"]
unless extra
isn't a dictionary for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're probably right. @Hexcles ?
@jgraham @gsnedders I've pushed a few new commits to this PR. PTAL again. The PR now calculates the longest duration per test, and checks if it exceeds 80% of the timeout for each test individually (and timeout multiplier is taken into account), instead of only checking the longest duration of all tests, as tests can have different timeouts. Besides, consistent timeout is now allowed. @jgraham I didn't move the logic into @zcorpan my apologies for the long delay; I was busy fixing some high-priority issues on wpt.fyi. Apparently it's been over a month and you've already come back; time does fly... Sigh. |
To test:
|
69fd109
to
35cd861
Compare
def is_slow(test): | ||
# We are interested in tests that almost time out (i.e. likely to be | ||
# flaky). Tests that consistently time out are not considered "slow". | ||
if test["status"].keys() == ['TIMEOUT']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CRASH
should also be excluded because we can change the status from TIMEOUT
to CRASH
if we detect evidence of a crash after the test finished. In general I wonder if testing
Also, it seems arguable that the handling of a test that sometimes times out and sometimes doesn't is wrong; if we get PASS
, PASS
, PASS
, TIMEOUT
, PASS
then we'll call the test slow. But if the passes take 1s and the timeout is 10s then we are probably seeing a race condition that causes the test to hang sometimes, and the obvious solution to a slow test (Add a longer timeout) isn't going to help. So I think we should either store the duration per result and filter here, or only update the duration when we see a non-TIMEOUT, non-CRASH result. The latter is easier, but the former could be used to provide more detailed logging if we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right about CRASH
.
As for the second example (inconsistent results), it won't actually happen as we call is_inconsistent
before is_slow
, so such tests will be considered as flaky instead of slow. But I do see the code can be confusing and error-prone as the prerequisite isn't obvious. I'll tweak it a bit.
c427051
to
18f3d3d
Compare
@jgraham I think I addressed your comment this time. PTAL at And you can see stability checks rejecting the example slow test in action: |
threshold = test["timeout"] * FLAKY_THRESHOLD | ||
for status in ['PASS', 'FAIL', 'OK', 'ERROR', 'INTERNAL-ERROR']: | ||
if (status in test["longest_duration"] and | ||
test["longest_duration"][status] > threshold): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird indentation (yeah I know pyflakes likes it, but I think it's a bad rule which we don't enforce).
In stability check (wpt run --verify), reject tests that almost time out, i.e. take more than 80% of the timeout allowed to run). These tests will be listed in a new section, "slow tests", in the output. Fixes #9972.
Previously we record the longest duration of all tests. However, each test can have different timeout. Besides, test runs with different results should be considered differently; runs that time out should be ignored. Therefore, we now keep track of the longest duration for each result status of each test. Also add a unit test.
17ac857
to
8503633
Compare
In #11570, we added a check to reject slow tests in stability checks. The code didn't consider an edge case where tests are skipped and don't contain "extra" info. This change silently skips a test in this check if it does not have necessary extra information.
In #11570, we added a check to reject slow tests in stability checks. The code didn't consider an edge case where tests are skipped and don't contain "extra" info. This change silently skips a test in this check if it does not have necessary extra information.
…ed, a=testonly Automatic update from web-platform-testsFix stability check when tests are skipped In web-platform-tests/wpt#11570, we added a check to reject slow tests in stability checks. The code didn't consider an edge case where tests are skipped and don't contain "extra" info. This change silently skips a test in this check if it does not have necessary extra information. -- wpt-commits: eb145fc244f183d7c3aab7a93560330e666159e9 wpt-pr: 12884
…ed, a=testonly Automatic update from web-platform-testsFix stability check when tests are skipped In web-platform-tests/wpt#11570, we added a check to reject slow tests in stability checks. The code didn't consider an edge case where tests are skipped and don't contain "extra" info. This change silently skips a test in this check if it does not have necessary extra information. -- wpt-commits: eb145fc244f183d7c3aab7a93560330e666159e9 wpt-pr: 12884
…ed, a=testonly Automatic update from web-platform-testsFix stability check when tests are skipped In web-platform-tests/wpt#11570, we added a check to reject slow tests in stability checks. The code didn't consider an edge case where tests are skipped and don't contain "extra" info. This change silently skips a test in this check if it does not have necessary extra information. -- wpt-commits: eb145fc244f183d7c3aab7a93560330e666159e9 wpt-pr: 12884 UltraBlame original commit: 06905fbaafb424ef27d009c39d89411901d62a06
…ed, a=testonly Automatic update from web-platform-testsFix stability check when tests are skipped In web-platform-tests/wpt#11570, we added a check to reject slow tests in stability checks. The code didn't consider an edge case where tests are skipped and don't contain "extra" info. This change silently skips a test in this check if it does not have necessary extra information. -- wpt-commits: eb145fc244f183d7c3aab7a93560330e666159e9 wpt-pr: 12884 UltraBlame original commit: 06905fbaafb424ef27d009c39d89411901d62a06
…ed, a=testonly Automatic update from web-platform-testsFix stability check when tests are skipped In web-platform-tests/wpt#11570, we added a check to reject slow tests in stability checks. The code didn't consider an edge case where tests are skipped and don't contain "extra" info. This change silently skips a test in this check if it does not have necessary extra information. -- wpt-commits: eb145fc244f183d7c3aab7a93560330e666159e9 wpt-pr: 12884 UltraBlame original commit: 06905fbaafb424ef27d009c39d89411901d62a06
Fixes #9972.
TODO: allow longer time for timeout=long tests.