Fix high crash rate when crawler is run in Docker #28

englehardt · 2019-08-26T20:04:34Z

Need to rebase on #27.

As explored in openwpm/OpenWPM#255, newer versions of Firefox will often crash due to insufficient shared memory. This issue wasn't present in the Firefox 52 branch because we used to set browser.tabs.remote.autostart to False, which disables SHM usage by disabling e10s.

When running a Docker container, the shared memory size can be increased by adding the argument --shm-size=2g. Unfortunately kubernetes doesn't yet support setting the shared memory size (see: kubernetes/kubernetes#28272), nor does it support passing arguments to docker. The current fix is a workaround copied from kubernetes/kubernetes#28272 (comment).

I ran a 5k site test crawl with this fix using the branch from this PR: openwpm/OpenWPM#473 . For results, see: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/165717/command/165734.

== CRAWL HISTORY ==
Total number of commands submitted:
+-------+-----+
|command|count|
+-------+-----+
|    GET| 4873|
+-------+-----+

Percentage of command failures 0.10%
Percentage of neterrors 6.48%
Percentage of command timeouts 2.85%

Summary of neterrors:
+-----------------+-----+
|            error|count|
+-----------------+-----+
|connectionFailure|   11|
|       netTimeout|  155|
|         netReset|    1|
|     fileNotFound|    1|
|      nssFailure2|    2|
|      dnsNotFound|  146|
+-----------------+-----+


Summary of exceptions:
+--------------------+-----+
|               error|count|
+--------------------+-----+
|NoSuchWindowExcep...|    1|
|InvalidSessionIdE...|    4|
+--------------------+-----+

Errors due to crashes still occur, but significantly less frequently. Note that I also had to increase the timeout to 120 seconds. The timeout rate at 60 seconds was ~20% of sites. I'm not sure if this workaround introduces some overhead, or if it's just that the sites that previous crashed FF were resource heavy and thus moved from crashing to timing out after increasing the shared memory size.

motin

Interesting. This is very promising, even without specifically being able to set the shared memory size. I'll happily see this as the new master.

Note: There are 2.54% records missing from the 5k though, which we can infer means that all three attempts to crawl those records failed and then resulted in a data loss(?). Is there a risk that the new shared memory leads to over-committed memory allocation on the nodes?

Btw, running two parallel crawls, one without the patch, and one with the patch, would be helpful to understand the specific effects of a single patch (should be something that we ought to do in CI...).

.:: In relation to the original seed list (n = 5000)
Percentage that did not result in a crawl_history record: 2.54%
Percentage that failed to result in a successful command: 11.74%

englehardt · 2019-08-27T20:27:57Z

Note: There are 2.54% records missing from the 5k though, which we can infer means that all three attempts to crawl those records failed and then resulted in a data loss(?). Is there a risk that the new shared memory leads to over-committed memory allocation on the nodes?

No, my comment there was wrong. I forgot about node evictions. If a node is evicted it could still result in data loss and wouldn't appear in the crawl history table. See openwpm/OpenWPM#476.

Btw, running two parallel crawls, one without the patch, and one with the patch, would be helpful to understand the specific effects of a single patch (should be something that we ought to do in CI...).

I agree this would be useful. I filed https://github.com/mozilla/OpenWPM/issues/479 for discussion.

englehardt · 2019-08-27T21:19:19Z

Rebased on the most recent master. Merging now.

englehardt mentioned this pull request Aug 26, 2019

Investigate high crash rate of the WebExtensions crawls openwpm/OpenWPM#255

Closed

englehardt requested a review from motin August 26, 2019 20:55

motin approved these changes Aug 27, 2019

View reviewed changes

englehardt mentioned this pull request Aug 27, 2019

Add a minimum request for memory and bump parallelism #30

Merged

Increase shared memory and timeout

7f7a440

englehardt force-pushed the openwpm_issue_255 branch from 9f87ca6 to 7f7a440 Compare August 27, 2019 21:18

englehardt merged commit 48ebf1a into master Aug 27, 2019

englehardt deleted the openwpm_issue_255 branch August 27, 2019 21:19

englehardt mentioned this pull request Aug 27, 2019

Configure shared memory size explicitly in kubernetes #31

Open

motin mentioned this pull request Sep 1, 2019

Support a best practice retry-n-times crawl approach #23

Closed

englehardt mentioned this pull request Nov 11, 2020

Explore running smaller, automated crawls in CI to detect regressions #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix high crash rate when crawler is run in Docker #28

Fix high crash rate when crawler is run in Docker #28

englehardt commented Aug 26, 2019

motin left a comment •

edited

Loading

englehardt commented Aug 27, 2019

englehardt commented Aug 27, 2019

Fix high crash rate when crawler is run in Docker #28

Fix high crash rate when crawler is run in Docker #28

Conversation

englehardt commented Aug 26, 2019

motin left a comment • edited Loading

Choose a reason for hiding this comment

englehardt commented Aug 27, 2019

englehardt commented Aug 27, 2019

motin left a comment •

edited

Loading