Travis jobs for slow tests will time out, blocking merge #7073

foolip · 2017-08-31T10:31:28Z

In #7006 it looks like the changed tests taken together were too slow to finish running 10 times in the stability checker, and so all stability checker jobs failed.

The Firefox job ran for 49 minutes before:

The job exceeded the maximum time limit for jobs, and has been terminated.

The Chrome job ran for 19 minutes before:

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated

It would probably ultimately fail in the same way as Firefox, but could a periodic watchdog "echo still running" allow it for run for longer?

@bobholt

foolip · 2017-08-31T10:38:49Z

https://docs.travis-ci.com/user/customizing-the-build#Build-Timeouts documents the 50 minute timeout. Perhaps this limit could be raised by paying Travis, but that'd still mean that PR results take a very long time to come in, which is a problem in itself.

Idea: First run the tests exactly once and make note of how long that took. Only run the tests again if there's enough time. After each run, use the longest of the previous runs to estimate if it's going to be possible to run again. And add some margin of safety, so that we don't start an expected 8 minute run at the 40 minute mark. @bobholt @jgraham, WDYT?

We will still probably have cases where running all affected tests even once isn't possible in 50 minutes. Let's cross that bridge when we get there, however.

mfalken · 2017-09-01T01:58:08Z

Regarding #7006, I just noticed Firefox might have some configuration for disabling the test(s) that were split up for taking too long:
https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/meta/service-workers/service-worker/registration.https.html.ini
https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/meta/service-workers/service-worker/register-link-element.https.html.ini

Notably https://bugzilla.mozilla.org/show_bug.cgi?id=1351890 is disabling registration.https.html on windows due to timeout. #7006 split that test into multiple files, since on Chrome it was also taking too long to run. But probably Mozilla will need to update their configuration to match the new test file names. @wanderview

jgraham · 2017-09-04T09:50:02Z

The chrome bug where there's no output is, I think, a problem with Chrome? I suppose I hadn't considered the possibility that it's just not outputting anything when the tests are running; maybe Gecko creates enough log spew that that doesn't happen.

We could of course arrange things so that we only run the tests as many times as we can fit into a 50 minute timeout, but we get the issue that if you update a lot of tests you end up bypassing the stability check because they are only running once. That doesn't seem ideal for test stability.

foolip · 2017-09-04T09:52:59Z

Is there any way around that, short of maintaining non-Travis infrastructure that is always able to run the full test suite 10 times in 50 minutes?

jgraham · 2017-09-04T10:01:57Z

Well with different infrastructure we could run for longer (e.g. if we hooked up to Mozilla's taskcluster infrastructure — which I believe is possible [1] — the default timeout is 3 hours). Of course there has to be some limit, and it's going to be pretty annoying if PRs take multiple hours per push to test. In theory one could parallelise, but that could be difficult given the limitations of travis. Multiple parallel runs on a single machine might be possible, but that itself could cause intermittency, and if we are resource limited might not cause speedups.

I don't have a great suggestion here.

[1] https://docs.taskcluster.net/manual/using/github

jgraham · 2017-09-04T10:02:34Z

Er, sorry 1 hour on TC, but it's configurable.

foolip · 2017-10-12T08:33:17Z

@lukebjerring, FYI, I had some ideas in #7073 (comment) and see also #7660 for a similar problem with changing many tests.

foolip · 2018-04-03T08:06:34Z

Closing this in favor of #7660, since it comes down to the same thing, whether there are many fast tests or fewer slow tests, it's not always possible to finish running in 50 minutes.

foolip added the infra label Aug 31, 2017

foolip mentioned this issue Aug 31, 2017

service worker WPT tests: split very big registration tests into multiple files #7006

Merged

bobholt added the priority:roadmap label Oct 4, 2017

foolip mentioned this issue Oct 10, 2017

When many tests are affected, CI stability jobs will time out #7660

Closed

foolip mentioned this issue Oct 18, 2017

Test every commit of web-platform-tests within 1 hour web-platform-tests/results-collection#164

Closed

foolip added priority:backlog and removed priority:roadmap labels Nov 17, 2017

foolip mentioned this issue Nov 27, 2017

Move /css/css-block/ to css/css-display/run-in/ #8437

Merged

foolip closed this as completed Apr 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Travis jobs for slow tests will time out, blocking merge #7073

Travis jobs for slow tests will time out, blocking merge #7073

foolip commented Aug 31, 2017

foolip commented Aug 31, 2017

mfalken commented Sep 1, 2017

jgraham commented Sep 4, 2017

foolip commented Sep 4, 2017

jgraham commented Sep 4, 2017

jgraham commented Sep 4, 2017

foolip commented Oct 12, 2017

foolip commented Apr 3, 2018

Travis jobs for slow tests will time out, blocking merge #7073

Travis jobs for slow tests will time out, blocking merge #7073

Comments

foolip commented Aug 31, 2017

foolip commented Aug 31, 2017

mfalken commented Sep 1, 2017

jgraham commented Sep 4, 2017

foolip commented Sep 4, 2017

jgraham commented Sep 4, 2017

jgraham commented Sep 4, 2017

foolip commented Oct 12, 2017

foolip commented Apr 3, 2018