-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent slow tests from landing #9972
Comments
I looked a bit closer at the two examples. It seems like they haven't been changed for a while (they are always slow). I don't know why they became slower and flaky only recently. Perhaps just some side effects of other non-related Chromium or infra changes, or just bad luck. @foolip and I discussed a bit. So here are my ideas of we could do in upstream to alleviate the problem for different failure modes:
For 2.b in particular, we can also use some heuristics on the Chromium side during import. If the upstream doesn't want to block merging in this case, we can heuristically mark these slow tests as expected to timeout (or even skip them) during Chromium import. WDYT? |
This isn't a chromium-specific problem, we have had lots of cases where tests are intermittent for want of a Generally the situation improved since the travis infra was set up, since as noted travis is just as slow as browser infra, at least hardware wise (we don't run debug builds on travis, which would be slower. In gecko infra we increase the timeout multiplier for these builds so they geta much longer timeout). I could accept rejecting a test that times out everywhere, although I'm not so worried about tests that are expected to timeout forever, which should be rare, as for tests that time out because the feature isn't yet implemented in any of the travis browsers, which seems like it could be quite common. |
Yes, most of the timeout tests I've seen belong to this case. So you are saying it might be fine to reject these tests @jgraham ? (That means the test authors either have to rewrite the test or wait until at least one browser has implemented the feature before merging it.) And what do you think about 2.b in my last comment? E.g. if a test uses 80% of the timeout to finish (and consistently so), shall we give a warning or even reject them in the worry of future flakes? |
Sorry, I mean I don't think we want to reject such tests. If they are the majority case we probably don't want to reject-on-timeout. I think that warning if the tests are over or close to the timeout is a good idea. |
Some kind of advice or warning that depends on the time to run seems like it would be a good idea, so the first step would be to collect that information. Is there already a flag that causes the time to be written to the results of |
For tests that time out everywhere, I think that that could be rolled into #7475. It's useful to highlight these cases as an FYI just like it is for new failing tests, but PR authors and reviewers have to judge if that makes sense or not. |
@ricea, what do you think the priority of fixing this is? Is it the most annoying thing about using WPT now, or is there anything else you'd rather see fixed? |
I'm adding the roadmap label so that we get back to it when planning the next bucket of infra work, but right now nobody is signed up to do it. |
@foolip For me personally, it's the most annoying thing, yes. |
|
OK, so we're currently not working on this, and it depends on a few other pieces falling in place before it makes sense to prioritize. Leaving it as priority:roadmap so that it will come up again in 60 days. |
#10026 is now addressed. What do we want here? Reject tests that take more than 5s to run (or 30s for |
@zcorpan I would prefer lower numbers, but it probably makes sense to start conservative and squeeze it lower as we get a better feel for the false negative vs. false positive rate. |
A histogram of current test times would be helpful. We don't want a limit that would prevent 10% of currently existing tests from being landed, unless perhaps that 10% is actually flaky. (Probably not.) |
I did a quick run on my workstation (which is unrealistically fast). My guess is that Travis will be 2 or 3 times slower. My graphing skills are regrettable: 90th percentile is 0.53s I intentionally excluded the tests marked TIMEOUT in Blink. |
Do you have a list of the tests that take 2.63s or more? |
Here's a slightly better histogram that does at least contain some details: |
@zcorpan Sure: https://gist.github.com/ricea/fe717612c3579400cd2055c94b52c0bf Disclaimer: This is from one run on one machine and not necessarily representative of typical behaviour. |
Thanks! I can check the tests in that list and fix low-hanging fruit. We can also revisit when there's a way to filter by slow tests in wpt.fyi or so. |
Filed web-platform-tests/wpt.fyi#282 about exposing the information on wpt.fyi. |
@zcorpan I should have mentioned that the list of slow tests doesn't contain the tests marked TIMEOUT in Blink's test expectations. I can get you that list if you don't have it. |
@ricea it sounds like you got the data using Blink run_web_tests? |
@Hexcles Yes. Specifically:
|
@ricea I have that list in https://bocoup.github.io/wpt-disabled-tests-report/#filter=%5C%5B%20timeout |
So to continue #9972 (comment) here's a proposal: In |
I was thinking we'd just look at the average run time and if it's above some threshold, fail the job with a message saying just that. |
The trouble with forcing slow tests to be flaky is that it makes it harder to track down the problem. A clear message "I rejected this test because it was slow" will help people resolve the problem faster. |
Fixes #9972. TODO: allow longer time for timeout=long tests.
OK I have implemented a check as part of the stability checker that will reject if the longest duration is >5s. I don't know how to check if |
I have concerns about this. These are not performance tests, yet this issue is attempting to enforce a performance requirement. For example, what if a browser implementation makes a change that causes a large number of tests to run slower and exceed this threshold. No changes to the tests could then be made without triggering this blocking mode. Also, if it runs fast enough in 3 browsers, but is slow in 1 browser, is it the slowest browser sets the most constrained policy here? It seems more reasonable for long tests to use the long timeout meta tag to annotate potentially long running tests and for implementors to adjust their local automation to account for that. I feel like I've seen test splitting happening where the meta tag could have been used instead. |
Also, I would note that as a framework we should probably be encouraging folks to be writing more parallel async_tests instead of serialized promise_tests. Or provide a parallel promise_test method. It seems a lot of serialized tests have been added lately (at least to the components I work on). |
Another infrastructure change that could be made is a meta tag annotation that indicates that individual tests in the file can be run independently. The harness could then support running these test cases individually or in smaller chunks. This could be controlled by flags passed to the runner or, if we want to get fancy, automatic time measurements. I think something like this would scale a lot more easily than requiring all test writers to think about execution time of every possible browser implementation. We want to make tests easy for people to write. |
I have a strong preference for promise_tests precisely because they are serialized. It's very hard to write tests that don't have unexpected interactions when interleaved. It's very hard to debug async_tests when they do have unexpected interactions. If a set of tests would be faster when run interleaved that implies they are not CPU bound on the main thread. In some cases a test is capable of taking advantage of multiple cores or has legitimate reasons why it's not CPU bound. However, in most cases I've seen the test just contains delays. Replacing the delays with explicit synchronisation is always the best way to make such tests faster and more robust. |
Hmm. The desire for serialized tests and the desire for short tests seem at odds with one another. I have some frustration when this leads to a lot of churn splitting tests that could have easily been written to run with async_test in parallel or use the meta tag to annotate long test. As far as debugging goes I typically comment out all unrelated test cases, promise or async, to reduce clutter in my session. Of course, our personal debugging preferences should probably not drive policy for all the browser vendors.
This has not been my experience in service workers. I've seen a number of service worker tests changed recently from the chromium team because registering/unregistering service workers seems to take longer than other browsers. (This is fine and not a criticism. Registration management is not really a hot function in the wild.) Many of these tests have been changed to move registration to a single promise_test up front and then leave it registered across all tests. I think this is undesirable because it increases coupling between tests because they are now sharing state and not independent. I think this is strictly worse, but I haven't really objected because its just been a handful of tests so far. But if this is going to be widespread policy I feel I have to say something. Overall I think a performance requirement on tests is going to make it much harder for test writers:
|
The goal is detecting tests that are near their timeout, since they are likely to produce unstable results (sometimes time out) in CI. When the check fails, a reasonable fix is to specify timeout=long. |
If the goal is to make sure the timeout=long meta tag is used appropriately, that sounds reasonable. From the discussion of splitting tests in #11571 it seemed like this may not have been the case. I also now see James suggested this should be a warning and not a hard failure. I think that addresses a lot of my concerns with breaking workflow when the browsers change performance out from under the test. I thought we were talking about a hard failure. Sorry for getting my soapbox out. |
It's worth noting that there's already a hard performance requirement on tests, in that they have to complete in some finite amount of time. There's no way to get away from that. However, tests that run too close to the timeout end up being flaky in practice. Flaky tests kill productivity. I've seen many tests that were slow simply by accident, and if it had been caught when those tests were added it could have been fixed easily. In cases where a test is legitimately slow, adding timeout=long is the best fix, and it will save a lot of people a lot of work if it's there when the test is first introduced. |
In stability check (wpt run --verify), reject tests that almost time out, i.e. take more than 80% of the timeout allowed to run). These tests will be listed in a new section, "slow tests", in the output. Fixes #9972.
In stability check (wpt run --verify), reject tests that almost time out, i.e. take more than 80% of the timeout allowed to run). These tests will be listed in a new section, "slow tests", in the output. Fixes #9972.
In stability check (wpt run --verify), reject tests that almost time out, i.e. take more than 80% of the timeout allowed to run). These tests will be listed in a new section, "slow tests", in the output. Fixes #9972.
Chromium developers are spending a lot of time triaging test flakiness caused by tests that are simply too slow. I just triaged https://bugs.chromium.org/p/chromium/issues/detail?id=820334 and https://bugs.chromium.org/p/chromium/issues/detail?id=815286 back-to-back.
As a rule of thumb, a test can take 10 times longer on a bot than it takes to run locally.
It would be nice if the CI check rejected new tests that do pass, but take too long to do so. Sometimes the only reliable way to detect failure is via a timeout, so I'm not proposing outlawing tests that timeout.
The text was updated successfully, but these errors were encountered: