Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix high crash rate when crawler is run in Docker #28

Merged
merged 1 commit into from
Aug 27, 2019

Conversation

englehardt
Copy link
Contributor

Need to rebase on #27.

As explored in openwpm/OpenWPM#255, newer versions of Firefox will often crash due to insufficient shared memory. This issue wasn't present in the Firefox 52 branch because we used to set browser.tabs.remote.autostart to False, which disables SHM usage by disabling e10s.

When running a Docker container, the shared memory size can be increased by adding the argument --shm-size=2g. Unfortunately kubernetes doesn't yet support setting the shared memory size (see: kubernetes/kubernetes#28272), nor does it support passing arguments to docker. The current fix is a workaround copied from kubernetes/kubernetes#28272 (comment).

I ran a 5k site test crawl with this fix using the branch from this PR: openwpm/OpenWPM#473 . For results, see: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/165717/command/165734.

== CRAWL HISTORY ==
Total number of commands submitted:
+-------+-----+
|command|count|
+-------+-----+
|    GET| 4873|
+-------+-----+

Percentage of command failures 0.10%
Percentage of neterrors 6.48%
Percentage of command timeouts 2.85%

Summary of neterrors:
+-----------------+-----+
|            error|count|
+-----------------+-----+
|connectionFailure|   11|
|       netTimeout|  155|
|         netReset|    1|
|     fileNotFound|    1|
|      nssFailure2|    2|
|      dnsNotFound|  146|
+-----------------+-----+


Summary of exceptions:
+--------------------+-----+
|               error|count|
+--------------------+-----+
|NoSuchWindowExcep...|    1|
|InvalidSessionIdE...|    4|
+--------------------+-----+

Errors due to crashes still occur, but significantly less frequently. Note that I also had to increase the timeout to 120 seconds. The timeout rate at 60 seconds was ~20% of sites. I'm not sure if this workaround introduces some overhead, or if it's just that the sites that previous crashed FF were resource heavy and thus moved from crashing to timing out after increasing the shared memory size.

Copy link
Contributor

@motin motin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. This is very promising, even without specifically being able to set the shared memory size. I'll happily see this as the new master.

Note: There are 2.54% records missing from the 5k though, which we can infer means that all three attempts to crawl those records failed and then resulted in a data loss(?). Is there a risk that the new shared memory leads to over-committed memory allocation on the nodes?

Btw, running two parallel crawls, one without the patch, and one with the patch, would be helpful to understand the specific effects of a single patch (should be something that we ought to do in CI...).

.:: In relation to the original seed list (n = 5000)
Percentage that did not result in a crawl_history record: 2.54%
Percentage that failed to result in a successful command: 11.74%

@englehardt
Copy link
Contributor Author

Note: There are 2.54% records missing from the 5k though, which we can infer means that all three attempts to crawl those records failed and then resulted in a data loss(?). Is there a risk that the new shared memory leads to over-committed memory allocation on the nodes?

No, my comment there was wrong. I forgot about node evictions. If a node is evicted it could still result in data loss and wouldn't appear in the crawl history table. See openwpm/OpenWPM#476.

Btw, running two parallel crawls, one without the patch, and one with the patch, would be helpful to understand the specific effects of a single patch (should be something that we ought to do in CI...).

I agree this would be useful. I filed https://github.com/mozilla/OpenWPM/issues/479 for discussion.

@englehardt
Copy link
Contributor Author

Rebased on the most recent master. Merging now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants