Skip to content
This repository has been archived by the owner on Nov 6, 2019. It is now read-only.

Timeouts in Firefox #616

Open
jugglinmike opened this issue Oct 3, 2018 · 8 comments
Open

Timeouts in Firefox #616

jugglinmike opened this issue Oct 3, 2018 · 8 comments

Comments

@jugglinmike
Copy link
Collaborator

Over the past month, Firefox has occasionally been timing out when running "chunk" number 8 of 20. The specific test being executed varies, but it is always located in the same directory html/the-xhtml-syntax/parsing-xhtml-documents/.

This timeout blocked collection attempts today at both 06:00 UTC and 12:00 UTC. It also blocked the two "retry" attempts I triggered manually. At this point, it seems best to give up on those builds and research the root cause directly.

@jugglinmike
Copy link
Collaborator Author

This looks like a out-of-memory issue.

The workers we are using are relatively low-memory systems (2 gigabytes) running Ubuntu 18.04. The following command causes them to lock up with full memory utilization:

./wpt run --no-manifest \
  --binary _venv/browsers/nightly/firefox/firefox \
  --log-tbpl - \
  --log-tbpl-level debug \
  --webdriver-binary ./geckodriver \
  firefox \
  html/the-xhtml-syntax/parsing-xhtml-documents/xhtml-mathml-dtd-entity-1.htm \
  html/the-xhtml-syntax/parsing-xhtml-documents/xhtml-mathml-dtd-entity-2.htm

Where

$ ./_venv/browsers/nightly/firefox/firefox --version
Mozilla Firefox 64.0a1
$ ./geckodriver --version
geckodriver 0.22.0

wptrunner executes the first test successfully, begins execution of the second, goes silent for about 20 minutes (give or take), and finally reports a crash.

We've been seeing variation in the test that causes the build failure because Buildbot's "cutoff" for silent builds is 20 minutes. Attempting to limp through all 10 of these tests in this fashion involves passing one test, timing out on another, restarting, and moving on. That's 5 opportunities for the crash to go unnoticed for the full 20 minutes which will cause a build error. We could avoid it by increasing the Buildbot timeout, but this would cause the build time for the "chunk" to increase from 40 minutes to 140 minutes (and the result data would indicate test errors for technically valid tests).

These tests are notoriously memory intensive: @jgraham split them across multiple files to mitigate the problem last year. We might be able to address the issue with more subdivision, although it's not currently clear if that amounts to WPT's capitulation to the specific needs of this project. I've opened a separate issue to discuss defining a policy outside of the context of Buildbot, etc.

I'm also wondering if there's a browser bug that needs to be addressed. The system is capable of running any of these tests in isolation. That it crashes only when running one after another (i.e. serial execution in separate windows) makes it seem as though memory is not being properly deallocated. I'm just not sure if it's reasonable to expect memory to be released immediately, or even if that's what's necessary to avoid this timeout.

@jgraham, do you have any thoughts on this?

@jugglinmike
Copy link
Collaborator Author

@foolip
Copy link
Member

foolip commented Nov 12, 2018

Perhaps the tests can be modified to not need as much (maximum) memory? Does state leak beyond the individual tests for example?

@jgraham
Copy link
Collaborator

jgraham commented Nov 12, 2018

Locally the peak memory usage in Firefox is about 1Gb (based on reading top rather than anything sophisticated).

The tests basically run DOMParser on a separate document containing each character entity that's defined from XHTML. I don't think state explicitly leaks between tests, but creating all those documents in a loop is something of an (unrealistic) gc stress test.

@foolip
Copy link
Member

foolip commented Nov 13, 2018

Hmm, sounds like the documents and parsers should be able to be GCd, but aren't?

@jgraham
Copy link
Collaborator

jgraham commented Nov 13, 2018

I think they are GCd but in each test we create and destroy 2500 documents or so in a short timeframe, which isn't much like a typical GC load, so it seems probable that the heuristics aren't tuned for that case (but this is a pure guess; @smaug---- would know better).

@foolip
Copy link
Member

foolip commented Nov 13, 2018

Ah, probably the JS engine just doesn't know what the true cost of these objects is, counting only the wrappers. At least for <canvas> and <video> extra accounting (in the implementation, not tests) is needed to make sure that GC is triggered in cases like these.

@jugglinmike
Copy link
Collaborator Author

The EC2 workers used by this system are of type "t2.small." They have 2GB of memory, which seems like it ought to support tests that require 1GB. If we account for the imprecision of @jgraham's method and the needs of the operating system and the Buildbot worker process, then this could explain things.

Taskcluster is running a bunch of different EC2 instances, but among them is the 15-GB m3.xlarge, so it's no surprise that we don't see the problem in that environment.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants