Timeouts in Firefox #616

jugglinmike · 2018-10-03T02:34:57Z

Over the past month, Firefox has occasionally been timing out when running "chunk" number 8 of 20. The specific test being executed varies, but it is always located in the same directory html/the-xhtml-syntax/parsing-xhtml-documents/.

This timeout blocked collection attempts today at both 06:00 UTC and 12:00 UTC. It also blocked the two "retry" attempts I triggered manually. At this point, it seems best to give up on those builds and research the root cause directly.

The text was updated successfully, but these errors were encountered:

jugglinmike · 2018-10-04T00:13:38Z

This looks like a out-of-memory issue.

The workers we are using are relatively low-memory systems (2 gigabytes) running Ubuntu 18.04. The following command causes them to lock up with full memory utilization:

./wpt run --no-manifest \
  --binary _venv/browsers/nightly/firefox/firefox \
  --log-tbpl - \
  --log-tbpl-level debug \
  --webdriver-binary ./geckodriver \
  firefox \
  html/the-xhtml-syntax/parsing-xhtml-documents/xhtml-mathml-dtd-entity-1.htm \
  html/the-xhtml-syntax/parsing-xhtml-documents/xhtml-mathml-dtd-entity-2.htm

Where

$ ./_venv/browsers/nightly/firefox/firefox --version
Mozilla Firefox 64.0a1
$ ./geckodriver --version
geckodriver 0.22.0

wptrunner executes the first test successfully, begins execution of the second, goes silent for about 20 minutes (give or take), and finally reports a crash.

We've been seeing variation in the test that causes the build failure because Buildbot's "cutoff" for silent builds is 20 minutes. Attempting to limp through all 10 of these tests in this fashion involves passing one test, timing out on another, restarting, and moving on. That's 5 opportunities for the crash to go unnoticed for the full 20 minutes which will cause a build error. We could avoid it by increasing the Buildbot timeout, but this would cause the build time for the "chunk" to increase from 40 minutes to 140 minutes (and the result data would indicate test errors for technically valid tests).

These tests are notoriously memory intensive: @jgraham split them across multiple files to mitigate the problem last year. We might be able to address the issue with more subdivision, although it's not currently clear if that amounts to WPT's capitulation to the specific needs of this project. I've opened a separate issue to discuss defining a policy outside of the context of Buildbot, etc.

I'm also wondering if there's a browser bug that needs to be addressed. The system is capable of running any of these tests in isolation. That it crashes only when running one after another (i.e. serial execution in separate windows) makes it seem as though memory is not being properly deallocated. I'm just not sure if it's reasonable to expect memory to be released immediately, or even if that's what's necessary to avoid this timeout.

@jgraham, do you have any thoughts on this?

jugglinmike · 2018-11-12T18:34:57Z

As of 2018-11-09, we are still occassionally experiencing this problem.

foolip · 2018-11-12T19:33:44Z

Perhaps the tests can be modified to not need as much (maximum) memory? Does state leak beyond the individual tests for example?

jgraham · 2018-11-12T19:49:21Z

Locally the peak memory usage in Firefox is about 1Gb (based on reading top rather than anything sophisticated).

The tests basically run DOMParser on a separate document containing each character entity that's defined from XHTML. I don't think state explicitly leaks between tests, but creating all those documents in a loop is something of an (unrealistic) gc stress test.

foolip · 2018-11-13T13:25:06Z

Hmm, sounds like the documents and parsers should be able to be GCd, but aren't?

jgraham · 2018-11-13T13:27:54Z

I think they are GCd but in each test we create and destroy 2500 documents or so in a short timeframe, which isn't much like a typical GC load, so it seems probable that the heuristics aren't tuned for that case (but this is a pure guess; @smaug---- would know better).

foolip · 2018-11-13T13:29:24Z

Ah, probably the JS engine just doesn't know what the true cost of these objects is, counting only the wrappers. At least for <canvas> and <video> extra accounting (in the implementation, not tests) is needed to make sure that GC is triggered in cases like these.

jugglinmike · 2018-11-28T03:24:12Z

The EC2 workers used by this system are of type "t2.small." They have 2GB of memory, which seems like it ought to support tests that require 1GB. If we account for the imprecision of @jgraham's method and the needs of the operating system and the Buildbot worker process, then this could explain things.

Taskcluster is running a bunch of different EC2 instances, but among them is the 15-GB m3.xlarge, so it's no surprise that we don't see the problem in that environment.

jugglinmike mentioned this issue Oct 3, 2018

Document WPT's system requirements web-platform-tests/wpt#13349

Open

jugglinmike mentioned this issue Nov 12, 2018

Missing Firefox stable results for a516e7edca #627

Closed

jugglinmike mentioned this issue Dec 17, 2018

Cancel collection from Firefox and Chrome #640

Open

jugglinmike mentioned this issue Mar 6, 2019

Fix #15625: deal with Safari throwing when moving the window to (0,0) web-platform-tests/wpt#15659

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeouts in Firefox #616

Timeouts in Firefox #616

jugglinmike commented Oct 3, 2018

jugglinmike commented Oct 4, 2018

jugglinmike commented Nov 12, 2018

foolip commented Nov 12, 2018

jgraham commented Nov 12, 2018

foolip commented Nov 13, 2018

jgraham commented Nov 13, 2018

foolip commented Nov 13, 2018

jugglinmike commented Nov 28, 2018

Timeouts in Firefox #616

Timeouts in Firefox #616

Comments

jugglinmike commented Oct 3, 2018

jugglinmike commented Oct 4, 2018

jugglinmike commented Nov 12, 2018

foolip commented Nov 12, 2018

jgraham commented Nov 12, 2018

foolip commented Nov 13, 2018

jgraham commented Nov 13, 2018

foolip commented Nov 13, 2018

jugglinmike commented Nov 28, 2018