-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When many tests are affected, CI stability jobs will time out #7660
Comments
There are two issues here. One is that we should have a periodic keepalive message when the tests are still running but not producing output. The other is that if we get no output for 10 minutes we're pretty much guaranteed to hit the hard timout (since that implies a minimum runtime of 100 minutes which is about twice the threshold). |
Changed the title a bit because in #9641 not that many files were changed, but many included the changed file, and that's what matters. Also, not all browsers necessarily fail. |
Does it make sense to run with |
@jgraham, what kinds of things would you expect to blow up if we do that? And would that still restart between each of the 10 runs? |
Nothing should blow up; the default assumes that if you have one failing test then you don't want it to infect subsequent tests if it happens to set some bad state. In this case what you might see is that one unstable test would be more likely to make subsequent tests on the run look unstable. It won't affect restarting between runs at all. An alternate apporach would be to set the expectation data after the first run and use that to avoid restarting on subequent runs unless something actually is unstable. At the moment the way we update the metadata is really slow, so that will itself have a performance impact. |
So I'm not opposed to that, but it is a limitation of the current system that we should look to address in a better way. But this issue seems to be about the "10m without output" problem which is different to the entire run timing out and requires a different solution. |
I think most of the PRs I've linked to here are were actual timeouts close to 50 minutes, at least https://travis-ci.org/w3c/web-platform-tests/jobs/352958109 ended with just "The job exceeded the maximum time limit for jobs, and has been terminated". Some of the timeouts have probably been to verbose logging (now fixed) and maybe I've misattributed some other things, but plain hangs that result in no output for 10 minutes isn't most of the problem, I think. As for a better way, ultimately I think there are cases where it's just not a good use of resources to even try to run all affected tests 10 times, like if testharness.js has changed or if many tests are affected in some trivial way like in #9718. No matter how much Travis capacity we have or if we are able to reuse the wpt.fyi running infra, I think it makes sense to put a cap on resource usage per PR and just do the most useful things that can be done within those constraints. It should be possible to run all tests once so that we can actually make testharness.js changes with confidence, and maybe twice to do with/without changes, but I wouldn't want to allocate more resources than that per PR. I think the next step is to just buy more Travis capacity which will increase the timeout to 120 minutes. @plehegar is looking into that. |
@stephenmcgruer can you retriage this for priority? #7660 (comment) is the only idea I have for fixing this. |
I wrote a 'brief' document that summarizes the (I believe) generally agreed upon solution: https://docs.google.com/document/d/1dAlCSHUQldtgWDDTrGJR-ksm19FZZ3k8ppqc5-kSwIk/edit# . It's world-readable, comment-able by @chromiums, and I'm happy to give comment access to anyone else who wants it (just can't make it world comment-able due to spam). |
Update: unfortunately there was disagreement on the proposed approach for avoiding CI timeout and I haven't had the time/energy to drive it to a successful conclusion. I still hope to address this, hopefully sometime in Q3 once the Taskcluster Checks migration is comfortably landed. |
FWIW, my view is still the ideal solution is to do something in the decision task that chooses to shard based on number of files changed (I don't think we have tests affected in the decision task?) |
Well, even if we do that, we still have widely used resource files that affect way too many tests (such that we can't afford to run 10x). |
So we're in a better place to do this now, since we switched to repeat-only stability checks for the CI. We can check for an impending timeout between each iteration of the tests run (using previous runs to estimate running time), and bail if we're over our time limit. Now just needs me (or someone) to put the time in to create a PR ;) |
With the new changes added that were implemented based on @stephenmcgruer's proposal, is it safe to close this issue now? |
I think it's still possible for this to occur in the case we can't run two iterations of a selection of tests inside the task timeout. But hopefully that's rarer. I think it makes sense to track any remaining failures as part of a new issue. |
In #7654 I changed 778 files, and as a result all of the stability jobs failed, with messages like "No output has been received in the last 10m0s".
This is similar to #7073.
The text was updated successfully, but these errors were encountered: