Prevent slow tests from landing #9972

ricea · 2018-03-12T14:07:32Z

Chromium developers are spending a lot of time triaging test flakiness caused by tests that are simply too slow. I just triaged https://bugs.chromium.org/p/chromium/issues/detail?id=820334 and https://bugs.chromium.org/p/chromium/issues/detail?id=815286 back-to-back.

As a rule of thumb, a test can take 10 times longer on a bot than it takes to run locally.

It would be nice if the CI check rejected new tests that do pass, but take too long to do so. Sometimes the only reliable way to detect failure is via a timeout, so I'm not proposing outlawing tests that timeout.

Hexcles · 2018-03-12T15:43:21Z

If a test takes like 5 seconds on Chromium's testing bots, I suspect the test would take much longer on Travis CI and hence be very flaky (timeouts), so if we fix #9932 (block PR merging on Chrome stability checks) then we should see some immediate improvements. cc @foolip

Hexcles · 2018-03-12T16:31:47Z

I looked a bit closer at the two examples. It seems like they haven't been changed for a while (they are always slow). I don't know why they became slower and flaky only recently. Perhaps just some side effects of other non-related Chromium or infra changes, or just bad luck.

@foolip and I discussed a bit. So here are my ideas of we could do in upstream to alleviate the problem for different failure modes:

If a test is already flaky on Travis CI: this is the easiest one. We never want to merge flaky tests at all, and we should fix the Chrome stability check on Travis.
If a test isn't flaky on Travis, and it has
a. Consistent timeouts: perhaps we should just reject a test that times out everywhere on Travis? Though as @ricea mentions, there might be rare but legitimate use cases for timeouts.
b. Consistent non-timeout but slow results: we can give warnings (or even block the merge) if a test runs so slowly that the execution time is close to timeout. The rationale is that the risk of flake is very high in such case.

For 2.b in particular, we can also use some heuristics on the Chromium side during import. If the upstream doesn't want to block merging in this case, we can heuristically mark these slow tests as expected to timeout (or even skip them) during Chromium import. WDYT?

jgraham · 2018-03-12T16:43:25Z

This isn't a chromium-specific problem, we have had lots of cases where tests are intermittent for want of a <meta name=timeout content=long> in gecko infrastructure. Probably a large number of such tests just got disabled.

Generally the situation improved since the travis infra was set up, since as noted travis is just as slow as browser infra, at least hardware wise (we don't run debug builds on travis, which would be slower. In gecko infra we increase the timeout multiplier for these builds so they geta much longer timeout).

I could accept rejecting a test that times out everywhere, although I'm not so worried about tests that are expected to timeout forever, which should be rare, as for tests that time out because the feature isn't yet implemented in any of the travis browsers, which seems like it could be quite common.

Hexcles · 2018-03-12T16:51:25Z

as for tests that time out because the feature isn't yet implemented in any of the travis browsers, which seems like it could be quite common.

Yes, most of the timeout tests I've seen belong to this case. So you are saying it might be fine to reject these tests @jgraham ? (That means the test authors either have to rewrite the test or wait until at least one browser has implemented the feature before merging it.)

And what do you think about 2.b in my last comment? E.g. if a test uses 80% of the timeout to finish (and consistently so), shall we give a warning or even reject them in the worry of future flakes?

jgraham · 2018-03-12T16:56:33Z

Sorry, I mean I don't think we want to reject such tests. If they are the majority case we probably don't want to reject-on-timeout.

I think that warning if the tests are over or close to the timeout is a good idea.

foolip · 2018-03-13T03:35:07Z

Some kind of advice or warning that depends on the time to run seems like it would be a good idea, so the first step would be to collect that information. Is there already a flag that causes the time to be written to the results of --log-wptreport?

foolip · 2018-03-13T03:36:20Z

For tests that time out everywhere, I think that that could be rolled into #7475. It's useful to highlight these cases as an FYI just like it is for new failing tests, but PR authors and reviewers have to judge if that makes sense or not.

foolip · 2018-03-13T03:37:05Z

@ricea, what do you think the priority of fixing this is? Is it the most annoying thing about using WPT now, or is there anything else you'd rather see fixed?

foolip · 2018-03-13T03:37:49Z

I'm adding the roadmap label so that we get back to it when planning the next bucket of infra work, but right now nobody is signed up to do it.

ricea · 2018-03-13T06:11:13Z

@foolip For me personally, it's the most annoying thing, yes.

jgraham · 2018-03-13T11:02:44Z

--log-wptreport doesn't include the runtime; we could extend the format to include that. Alterntively the raw report does include the runtime as the time difference between the test start and test end actions.

foolip · 2018-05-17T22:47:47Z

OK, so we're currently not working on this, and it depends on a few other pieces falling in place before it makes sense to prioritize. Leaving it as priority:roadmap so that it will come up again in 60 days.

zcorpan · 2018-06-16T09:43:31Z

#10026 is now addressed.

What do we want here? Reject tests that take more than 5s to run (or 30s for timeout=long)?

ricea · 2018-06-18T05:26:32Z

@zcorpan I would prefer lower numbers, but it probably makes sense to start conservative and squeeze it lower as we get a better feel for the false negative vs. false positive rate.

foolip · 2018-06-18T09:50:48Z

A histogram of current test times would be helpful. We don't want a limit that would prevent 10% of currently existing tests from being landed, unless perhaps that 10% is actually flaky. (Probably not.)

ricea · 2018-06-18T10:50:58Z

I did a quick run on my workstation (which is unrealistically fast). My guess is that Travis will be 2 or 3 times slower. My graphing skills are regrettable:

https://docs.google.com/spreadsheets/d/e/2PACX-1vQAPeDKvbe9VRVs5NLKpDsp-x_Q4oN1_9uadUt5YzcjZ8la3ietqNzSO6YgOmDCUa8wwOAuHbabFL9j/pubchart?oid=579431732&format=interactive

90th percentile is 0.53s
95th percentile is 0.85s
99th percentile is 2.63s
99.9th percentile is 11.05s

I intentionally excluded the tests marked TIMEOUT in Blink.

zcorpan · 2018-06-18T11:00:10Z

Do you have a list of the tests that take 2.63s or more?

ricea · 2018-06-18T13:34:28Z

Here's a slightly better histogram that does at least contain some details:
https://docs.google.com/spreadsheets/d/e/2PACX-1vRVR9RSA3TA4oCydk5v1_x5qPmrYeW9J_Jf7SsLstAanOww2HAC4VEZ3KXlCcHIAB-l_2EBK1IlKTLL/pubchart?oid=340694928&format=interactive

ricea · 2018-06-18T13:42:44Z

@zcorpan Sure: https://gist.github.com/ricea/fe717612c3579400cd2055c94b52c0bf

Disclaimer: This is from one run on one machine and not necessarily representative of typical behaviour.

zcorpan · 2018-06-18T13:55:59Z

Thanks! I can check the tests in that list and fix low-hanging fruit. We can also revisit when there's a way to filter by slow tests in wpt.fyi or so.

foolip · 2018-06-18T14:10:46Z

Filed web-platform-tests/wpt.fyi#282 about exposing the information on wpt.fyi.

ricea · 2018-06-18T14:47:32Z

@zcorpan I should have mentioned that the list of slow tests doesn't contain the tests marked TIMEOUT in Blink's test expectations. I can get you that list if you don't have it.

Hexcles · 2018-06-18T14:59:58Z

@ricea it sounds like you got the data using Blink run_web_tests?

ricea · 2018-06-18T15:06:20Z

@Hexcles Yes. Specifically:

run_web_tests external/wpt --no-ret --timing --skip-timeouts --verbose

zcorpan · 2018-06-18T15:11:42Z

@ricea I have that list in https://bocoup.github.io/wpt-disabled-tests-report/#filter=%5C%5B%20timeout

zcorpan · 2018-06-18T18:54:10Z

So to continue #9972 (comment) here's a proposal:

In wpt run --verify (which the stability checker uses), in half of the runs we change the harness timeout to 5s, or 30s for timeout=long. This way it will appear as unstable if it takes 5-10s (or 30-60s).

foolip · 2018-06-19T06:58:39Z

I was thinking we'd just look at the average run time and if it's above some threshold, fail the job with a message saying just that.

ricea · 2018-06-19T07:15:24Z

The trouble with forcing slow tests to be flaky is that it makes it harder to track down the problem. A clear message "I rejected this test because it was slow" will help people resolve the problem faster.

Fixes #9972. TODO: allow longer time for timeout=long tests.

zcorpan · 2018-06-19T13:16:09Z

OK I have implemented a check as part of the stability checker that will reject if the longest duration is >5s. I don't know how to check if timeout=long is specified.

wanderview · 2018-06-21T14:17:22Z

I have concerns about this. These are not performance tests, yet this issue is attempting to enforce a performance requirement.

For example, what if a browser implementation makes a change that causes a large number of tests to run slower and exceed this threshold. No changes to the tests could then be made without triggering this blocking mode.

Also, if it runs fast enough in 3 browsers, but is slow in 1 browser, is it the slowest browser sets the most constrained policy here?

It seems more reasonable for long tests to use the long timeout meta tag to annotate potentially long running tests and for implementors to adjust their local automation to account for that. I feel like I've seen test splitting happening where the meta tag could have been used instead.

wanderview · 2018-06-21T14:19:15Z

Also, I would note that as a framework we should probably be encouraging folks to be writing more parallel async_tests instead of serialized promise_tests. Or provide a parallel promise_test method. It seems a lot of serialized tests have been added lately (at least to the components I work on).

wanderview · 2018-06-21T14:27:01Z

Another infrastructure change that could be made is a meta tag annotation that indicates that individual tests in the file can be run independently. The harness could then support running these test cases individually or in smaller chunks. This could be controlled by flags passed to the runner or, if we want to get fancy, automatic time measurements.

I think something like this would scale a lot more easily than requiring all test writers to think about execution time of every possible browser implementation. We want to make tests easy for people to write.

ricea · 2018-06-21T14:45:41Z

Also, I would note that as a framework we should probably be encouraging folks to be writing more parallel async_tests instead of serialized promise_tests.

I have a strong preference for promise_tests precisely because they are serialized. It's very hard to write tests that don't have unexpected interactions when interleaved. It's very hard to debug async_tests when they do have unexpected interactions.

If a set of tests would be faster when run interleaved that implies they are not CPU bound on the main thread. In some cases a test is capable of taking advantage of multiple cores or has legitimate reasons why it's not CPU bound. However, in most cases I've seen the test just contains delays. Replacing the delays with explicit synchronisation is always the best way to make such tests faster and more robust.

wanderview · 2018-06-21T15:01:21Z

I have a strong preference for promise_tests precisely because they are serialized.

Hmm. The desire for serialized tests and the desire for short tests seem at odds with one another. I have some frustration when this leads to a lot of churn splitting tests that could have easily been written to run with async_test in parallel or use the meta tag to annotate long test.

As far as debugging goes I typically comment out all unrelated test cases, promise or async, to reduce clutter in my session. Of course, our personal debugging preferences should probably not drive policy for all the browser vendors.

However, in most cases I've seen the test just contains delays.

This has not been my experience in service workers. I've seen a number of service worker tests changed recently from the chromium team because registering/unregistering service workers seems to take longer than other browsers. (This is fine and not a criticism. Registration management is not really a hot function in the wild.)

Many of these tests have been changed to move registration to a single promise_test up front and then leave it registered across all tests. I think this is undesirable because it increases coupling between tests because they are now sharing state and not independent. I think this is strictly worse, but I haven't really objected because its just been a handful of tests so far. But if this is going to be widespread policy I feel I have to say something.

Overall I think a performance requirement on tests is going to make it much harder for test writers:

You want to add one more case to an existing test, but it pushes you over the threshold (in one other browser)? Guess you just signed up to rewrite the whole test.
Some implementation got slower since the test landed and you need to change something in the test? Guess you are the lucky person to deal with that and rewrite the test to be fast enough in that slower browser.
Writing a test in one browser and sync'ing to the others is going to get a lot harder. We spec behavior, but generally not performance.

zcorpan · 2018-06-21T15:12:24Z

The goal is detecting tests that are near their timeout, since they are likely to produce unstable results (sometimes time out) in CI. When the check fails, a reasonable fix is to specify timeout=long.

wanderview · 2018-06-21T15:24:52Z

The goal is detecting tests that are near their timeout, since they are likely to produce unstable results (sometimes time out) in CI. When the check fails, a reasonable fix is to specify timeout=long.

If the goal is to make sure the timeout=long meta tag is used appropriately, that sounds reasonable. From the discussion of splitting tests in #11571 it seemed like this may not have been the case.

I also now see James suggested this should be a warning and not a hard failure. I think that addresses a lot of my concerns with breaking workflow when the browsers change performance out from under the test. I thought we were talking about a hard failure.

Sorry for getting my soapbox out.

ricea · 2018-06-22T05:16:33Z

It's worth noting that there's already a hard performance requirement on tests, in that they have to complete in some finite amount of time. There's no way to get away from that. However, tests that run too close to the timeout end up being flaky in practice. Flaky tests kill productivity. I've seen many tests that were slow simply by accident, and if it had been caught when those tests were added it could have been fixed easily.

In cases where a test is legitimately slow, adding timeout=long is the best fix, and it will save a lot of people a lot of work if it's there when the test is first introduced.

In stability check (wpt run --verify), reject tests that almost time out, i.e. take more than 80% of the timeout allowed to run). These tests will be listed in a new section, "slow tests", in the output. Fixes #9972.

Hexcles added the infra label Mar 12, 2018

foolip added the priority:roadmap label Mar 13, 2018

jgraham mentioned this issue Mar 14, 2018

Add test execution time to wptreport.json format #10026

Closed

zcorpan added a commit that referenced this issue Jun 19, 2018

[wptrunner] Reject tests that take >5s to run

fa5ea6c

Fixes #9972. TODO: allow longer time for timeout=long tests.

zcorpan mentioned this issue Jun 19, 2018

[wptrunner] Reject tests that almost time out in stability check #11570

Merged

zcorpan mentioned this issue Jun 19, 2018

Fix slow tests #11571

Open

Hexcles closed this as completed in #11570 Sep 5, 2018

Prevent slow tests from landing #9972

Prevent slow tests from landing #9972

Comments

ricea commented Mar 12, 2018

Hexcles commented Mar 12, 2018

Hexcles commented Mar 12, 2018 • edited Loading

jgraham commented Mar 12, 2018

Hexcles commented Mar 12, 2018

jgraham commented Mar 12, 2018

foolip commented Mar 13, 2018

foolip commented Mar 13, 2018

foolip commented Mar 13, 2018

foolip commented Mar 13, 2018

ricea commented Mar 13, 2018

jgraham commented Mar 13, 2018

foolip commented May 17, 2018

zcorpan commented Jun 16, 2018

ricea commented Jun 18, 2018

foolip commented Jun 18, 2018

ricea commented Jun 18, 2018

zcorpan commented Jun 18, 2018 • edited Loading

ricea commented Jun 18, 2018

ricea commented Jun 18, 2018

zcorpan commented Jun 18, 2018

foolip commented Jun 18, 2018

ricea commented Jun 18, 2018

Hexcles commented Jun 18, 2018

ricea commented Jun 18, 2018

zcorpan commented Jun 18, 2018

zcorpan commented Jun 18, 2018

foolip commented Jun 19, 2018

ricea commented Jun 19, 2018

zcorpan commented Jun 19, 2018

wanderview commented Jun 21, 2018 • edited Loading

wanderview commented Jun 21, 2018

wanderview commented Jun 21, 2018

ricea commented Jun 21, 2018

wanderview commented Jun 21, 2018 • edited Loading

zcorpan commented Jun 21, 2018

wanderview commented Jun 21, 2018

ricea commented Jun 22, 2018

Hexcles commented Mar 12, 2018 •

edited

Loading

zcorpan commented Jun 18, 2018 •

edited

Loading

wanderview commented Jun 21, 2018 •

edited

Loading

wanderview commented Jun 21, 2018 •

edited

Loading