Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis jobs for slow tests will time out, blocking merge #7073

Closed
foolip opened this issue Aug 31, 2017 · 8 comments
Closed

Travis jobs for slow tests will time out, blocking merge #7073

foolip opened this issue Aug 31, 2017 · 8 comments

Comments

@foolip
Copy link
Member

foolip commented Aug 31, 2017

In #7006 it looks like the changed tests taken together were too slow to finish running 10 times in the stability checker, and so all stability checker jobs failed.

The Firefox job ran for 49 minutes before:

The job exceeded the maximum time limit for jobs, and has been terminated.

The Chrome job ran for 19 minutes before:

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated

It would probably ultimately fail in the same way as Firefox, but could a periodic watchdog "echo still running" allow it for run for longer?

@bobholt

@foolip foolip added the infra label Aug 31, 2017
@foolip
Copy link
Member Author

foolip commented Aug 31, 2017

https://docs.travis-ci.com/user/customizing-the-build#Build-Timeouts documents the 50 minute timeout. Perhaps this limit could be raised by paying Travis, but that'd still mean that PR results take a very long time to come in, which is a problem in itself.

Idea: First run the tests exactly once and make note of how long that took. Only run the tests again if there's enough time. After each run, use the longest of the previous runs to estimate if it's going to be possible to run again. And add some margin of safety, so that we don't start an expected 8 minute run at the 40 minute mark. @bobholt @jgraham, WDYT?

We will still probably have cases where running all affected tests even once isn't possible in 50 minutes. Let's cross that bridge when we get there, however.

@mfalken
Copy link
Member

mfalken commented Sep 1, 2017

Regarding #7006, I just noticed Firefox might have some configuration for disabling the test(s) that were split up for taking too long:
https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/meta/service-workers/service-worker/registration.https.html.ini
https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/meta/service-workers/service-worker/register-link-element.https.html.ini

Notably https://bugzilla.mozilla.org/show_bug.cgi?id=1351890 is disabling registration.https.html on windows due to timeout. #7006 split that test into multiple files, since on Chrome it was also taking too long to run. But probably Mozilla will need to update their configuration to match the new test file names. @wanderview

@jgraham
Copy link
Contributor

jgraham commented Sep 4, 2017

The chrome bug where there's no output is, I think, a problem with Chrome? I suppose I hadn't considered the possibility that it's just not outputting anything when the tests are running; maybe Gecko creates enough log spew that that doesn't happen.

We could of course arrange things so that we only run the tests as many times as we can fit into a 50 minute timeout, but we get the issue that if you update a lot of tests you end up bypassing the stability check because they are only running once. That doesn't seem ideal for test stability.

@foolip
Copy link
Member Author

foolip commented Sep 4, 2017

Is there any way around that, short of maintaining non-Travis infrastructure that is always able to run the full test suite 10 times in 50 minutes?

@jgraham
Copy link
Contributor

jgraham commented Sep 4, 2017

Well with different infrastructure we could run for longer (e.g. if we hooked up to Mozilla's taskcluster infrastructure — which I believe is possible [1] — the default timeout is 3 hours). Of course there has to be some limit, and it's going to be pretty annoying if PRs take multiple hours per push to test. In theory one could parallelise, but that could be difficult given the limitations of travis. Multiple parallel runs on a single machine might be possible, but that itself could cause intermittency, and if we are resource limited might not cause speedups.

I don't have a great suggestion here.

[1] https://docs.taskcluster.net/manual/using/github

@jgraham
Copy link
Contributor

jgraham commented Sep 4, 2017

Er, sorry 1 hour on TC, but it's configurable.

@foolip
Copy link
Member Author

foolip commented Oct 12, 2017

@lukebjerring, FYI, I had some ideas in #7073 (comment) and see also #7660 for a similar problem with changing many tests.

@foolip
Copy link
Member Author

foolip commented Apr 3, 2018

Closing this in favor of #7660, since it comes down to the same thing, whether there are many fast tests or fewer slow tests, it's not always possible to finish running in 50 minutes.

@foolip foolip closed this as completed Apr 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
@bobholt @jgraham @foolip @mfalken and others