Investigate high crash rate of the WebExtensions crawls #255

englehardt · 2019-02-07T01:06:36Z

In a 1 million site crawl with our most recent Firefox 60 support has a relatively high command crash rate (13% of get commands). Compare this to 3.5% for the current Firefox 52-based master branch.

A (partial) log file from a recent 1M site crawl is available here: https://www.dropbox.com/s/320d4my3b2prnc8/2019_01_29_alexa1m_logs.bz2?dl=0

The text was updated successfully, but these errors were encountered:

motin · 2019-07-16T08:48:35Z

13% of get commands

How is this number calculated? Do we store information about crashes in the resulting crawl datasets?

It seems like the right thing to do here is to set up Sentry or similar solution so that patterns in these errors can be properly detected and troubleshot.

englehardt · 2019-07-16T09:45:31Z

How is this number calculated? Do we store information about crashes in the resulting crawl datasets?

Yes, see https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/71922/command/71923

englehardt · 2019-07-16T09:52:15Z

/cc @gunesacar, who also has some data on this comparing the 1M site crawls we used to run on FF52 and a crawl with the new webextensions instrumentation.

gunesacar · 2019-07-16T16:53:04Z

/cc @gunesacar, who also has some data on this comparing the 1M site crawls we used to run on FF52 and a crawl with the new webextensions instrumentation.

Yes, we got 4x more GET command failures with the webextensions instrumentation compared to the previous (November 2018) crawl. See the 7th figure from the top on this notebook: https://github.com/citp/openwpm-data-release/blob/master/Crawl-Data-Metrics.ipynb

Happy to share the crawl logs and the database. You can also download them as part of the Princeton Web Census Dataset).

motin · 2019-07-22T07:07:14Z

Yes, we got 4x more GET command failures with the webextensions instrumentation compared to the previous (November 2018) crawl. See the 7th figure from the top on this notebook: https://github.com/citp/openwpm-data-release/blob/master/Crawl-Data-Metrics.ipynb

This is such a great notebook. Thanks for sharing this! I am looking forward to getting Sentry setup properly (#406) so that we systematically can address this.

motin · 2019-08-10T05:28:10Z

How is this number calculated? Do we store information about crashes in the resulting crawl datasets?

Yes, see https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/71922/command/71923

The notebook summarizes the following failure rates, with 100% being the 893453 crawl_history records:

Percentage of command failures 12.26%
Percentage of command timeouts 5.61%

An expansion of this notebook (here), states:

.:: In relation to the total number of commands submitted (n = 893453)
Percentage of command failures 12.26%
Percentage of command timeouts 5.61%
Percentage of successful commands 82.13%
Percentage of non-successful commands 17.87%

Corresponding data for a recent pair of crawls that uses the current master branch:
crawl-a:

.:: In relation to the total number of commands submitted (n = 107067)
Percentage of command failures 5.96%
Percentage of command timeouts 7.62%
Percentage of successful commands 86.42%
Percentage of non-successful commands 13.58%

crawl-b:

.:: In relation to the total number of commands submitted (n = 107125)
Percentage of command failures 6.13%
Percentage of command timeouts 7.53%
Percentage of successful commands 86.34%
Percentage of non-successful commands 13.66%

In summary, this issue is about investigating "Percentage of command failures", and it seems that it has dropped from 12.3% to about 6% since January. That is good news!

motin · 2019-08-10T07:19:13Z

The fact that out of the 1M crawl list only 89% ended up with a crawl_history record may be related to data loss in the S3 Aggregator, which is filed as a separate issue here: #450

motin · 2019-08-16T06:46:29Z

Recently ran a Tranco top 500 crawl, resulting in:

.:: In relation to the total number of commands submitted (n = 500)
Percentage of command failures 14.40%
Percentage of command timeouts 4.20%
Percentage of successful commands 81.40%
Percentage of non-successful commands 18.60%

.:: In relation to the original seed list (n = 500)
Percentage that did not result in a crawl_history record: 0.00%
Percentage that failed to result in a successful command: 18.60%

Source: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/156111

The 18.6% corresponds to 93 sites that was not crawled.

I found 82 errors in Sentry:

34 errors - InvalidSessionIdException (reported as InvalidSessionIdException #463)
33 errors - about:neterror - dnsNotFound
10 errors - about:neterror - netTimeout
2 errors - Failed to decode response from marionette
1 error - about:neterror - connectionFailure
1 error - about:neterror - nssFailure
1 error - NoSuchWindowException

@englehardt, recent improvements in error reporting and logging has made this kind of analysis possible (awesome!), but what fate did the 11 sites or so (2.2%) meet that did not get a successful command in crawl_history and no error reported in Sentry?

It would be great if crawl_history could contain information about reported Sentry errors so that this can be easily answered. It would also help with filtering out about:net errors (9%) which are probably not related to bugs in the platform.

For some manual triage:

Bool_success -1 urls:

Bool_success 0 urls:

englehardt · 2019-08-16T17:27:39Z

I found 82 errors in Sentry:

* 34 errors - InvalidSessionIdException (reported as #463)

* 33 errors - about:neterror - dnsNotFound

* 10 errors - about:neterror - netTimeout

* 2 errors - Failed to decode response from marionette

* 1 error - about:neterror - connectionFailure

* 1 error - about:neterror - nssFailure

* 1 error - NoSuchWindowException

These InvalidSessionIdExceptions have the same cause as the Failed to decode response from marionette we were getting in earlier crawls. The reason we're now getting these is because I started ignoring webdriver exceptions for the alert handling code to see if the alert handling was the cause or if it was a general issue -- the existence of these errors show it's a general webdriver issue. The python side of webdriver is somehow losing connection to the browser after the get() command, leading the next command to fail.

These errors have been elusive. If you know sites that reliably cause these errors please do share! I've sampled sites from the Sentry logs and have had trouble reproducing them locally, even when I purposefully run crawls in a memory or cpu constrained VM. I have a WIP PR that makes log levels configurable and bumps up console output logs to DEBUG by default. This will allow us to see the FirefoxExtension logs in GCP (which contain native logs from webdriver). Maybe we'll be able to trace down the cause through that.

@englehardt, recent improvements in error reporting and logging has made this kind of analysis possible (awesome!), but what fate did the 11 sites or so (2.2%) meet that did not get a successful command in crawl_history and no error reported in Sentry?

This is concerning. I see two possibilities:

All of the sites records are dropped by SocketInterface.py, perhaps due to a serialization problem for something that's present it all records (e.g., the top_level_url?). This is something I'm trying to fix in [WIP] Replace print calls with logging #441, but progress has been slow because the cross-process error logging uses SocketInterface.py and thus I think we'll falling into an infinite loop of decoding errors. Would you mind to let me know the URLs of the 11 missing sites?
There's a bug in the S3Aggregator logic that allows sites to be dropped.

It would be great if crawl_history could contain information about reported Sentry errors so that this can be easily answered. It would also help with filtering out about:net errors (9%) which are probably not related to bugs in the platform.

Agreed. I'm planning to save the serialized exceptions of command failures in a new column in crawl_history. Will that work for you? It won't contain all of the errors that lead to a crash, but will contain all of the browser errors that are currently handled by the catch-all try-except in the browser manager process.

motin · 2019-08-16T20:11:10Z

These InvalidSessionIdExceptions have the same cause as the Failed to decode response from marionette we were getting in earlier crawls.

Ah, yes, thanks now I remember that you explained this in a recent PR.

These errors have been elusive. If you know sites that reliably cause these errors please do share!

The sites are listed above - about a third ought to reproduce the error :)

Would you mind to let me know the URLs of the 11 missing sites?

They are listed above. I realize my previous statement was confusing. All sites have records in crawl_history so no data loss was encountered, but since there were 93 sites with bool_success != 1 but only 82 errors in sentry, there ought to be 11 sites with bool_success != 1 that did not get any error reported in Sentry - I just don't know which particular sites/URLs, and it's weird that they got a bool_success != 1 despite not reporting an error.

I'm planning to save the serialized exceptions of command failures in a new column in crawl_history. Will that work for you?

Sounds like a good start! Anything that helps track down the root case of bool_success != 1 is good :)

englehardt · 2019-08-16T22:49:34Z

They are listed above. I realize my previous statement was confusing. All sites have records in crawl_history so no data loss was encountered, but since there were 93 sites with bool_success != 1 but only 82 errors in sentry, there ought to be 11 sites with bool_success != 1 that did not get any error reported in Sentry - I just don't know which particular sites/URLs, and it's weird that they got a bool_success != 1 despite not reporting an error.

Got it. That's less concerning :) I'll add in the saving of command errors and we can investigate from there.

englehardt · 2019-08-22T06:06:17Z

It would be great if crawl_history could contain information about reported Sentry errors so that this can be easily answered. It would also help with filtering out about:net errors (9%) which are probably not related to bugs in the platform.

This is done in #473.

englehardt · 2019-08-22T06:07:41Z

I think I found the cause of the high rate of WebDriverExceptions, which manifests as errors #404 and #463. I believe my intuition in #255 (comment) was correct. The errors are caused by Firefox crashing during the get() command. That explains why we don't see any errors until marionette attempts to issue a new command to Firefox, only to find that the connection is broken in some way.

I only observe the high crash rate when running in a Docker container, which is why all of my earlier investigations didn't turn up anything useful (I was just inspecting a few sites locally). I believe it's caused by a very low /dev/shm (shared memory) size default in Docker. See elgalu/docker-selenium#20. This comment lists a helpful collection of issues as well (karma-runner/karma-firefox-launcher#104 (comment)). The fix should be easy: just bump the shared memory size of the container to 2G. That's easy to do when running Docker locally, but we may need to add it at container build time or in the kubernetes config. See also: https://stackoverflow.com/a/46434614.

englehardt · 2019-08-26T20:05:30Z

This should be fixed in openwpm/openwpm-crawler#28

englehardt · 2019-08-26T20:27:40Z

I also created #475 to throw better error messages in the event of a Firefox process crash.

englehardt · 2019-08-27T21:33:25Z

This was fixed in openwpm/openwpm-crawler#28, openwpm/openwpm-crawler#30, and #477. In particular, openwpm/openwpm-crawler#30 (comment) gives a detailed description of why the new config parameters improve the stability of the crawler. In a recent test crawl (https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/166927/command/167004) we saw just 5 errors of this type across 100k sites.

englehardt added the FF60-Upgrade label Feb 7, 2019

englehardt mentioned this issue Apr 19, 2019

Possible problem with geckodriver #266

Closed

englehardt added the bug label Jun 19, 2019

This was referenced Jul 16, 2019

Alerts causing WebDriverException: Message: Failed to decode response from marionette? #404

Closed

UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 522: invalid continuation byte #403

Closed

motin mentioned this issue Jul 16, 2019

Configure Sentry for error/crash report aggregation #406

Closed

englehardt mentioned this issue Jul 16, 2019

TypeError: '>=' not supported between instances of ____ and 'int' #407

Closed

englehardt changed the title ~~Investigate high crash rate of Firefox 60 crawls~~ Investigate high crash rate of the WebExtensions crawls Jul 18, 2019

englehardt mentioned this issue Jul 23, 2019

Add timeout to batch upload in S3Aggregator #420

Merged

This was referenced Jul 31, 2019

Fatal crashes - "[Errno 32] Broken pipe" #431

Closed

Support a best practice retry-n-times crawl approach openwpm/openwpm-crawler#23

Closed

This was referenced Aug 6, 2019

broken pipe #282

Closed

Reduce crash rate on .get() in general #84

Closed

motin mentioned this issue Aug 10, 2019

Minimize data loss during unexpected shutdown #450

Closed

motin mentioned this issue Aug 16, 2019

InvalidSessionIdException #463

Closed

englehardt mentioned this issue Aug 26, 2019

Fix high crash rate when crawler is run in Docker openwpm/openwpm-crawler#28

Merged

englehardt mentioned this issue Aug 26, 2019

Check for Firefox process crashes explicitly #475

Closed

This was referenced Aug 26, 2019

Add shared memory param to docker README #477

Merged

Configure shared memory size explicitly in kubernetes openwpm/openwpm-crawler#31

Open

englehardt closed this as completed Aug 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate high crash rate of the WebExtensions crawls #255

Investigate high crash rate of the WebExtensions crawls #255

englehardt commented Feb 7, 2019

motin commented Jul 16, 2019

englehardt commented Jul 16, 2019

englehardt commented Jul 16, 2019

gunesacar commented Jul 16, 2019

motin commented Jul 22, 2019

motin commented Aug 10, 2019 •

edited

Loading

motin commented Aug 10, 2019

motin commented Aug 16, 2019 •

edited

Loading

englehardt commented Aug 16, 2019

motin commented Aug 16, 2019

englehardt commented Aug 16, 2019

englehardt commented Aug 22, 2019

englehardt commented Aug 22, 2019

englehardt commented Aug 26, 2019

englehardt commented Aug 26, 2019

englehardt commented Aug 27, 2019 •

edited

Loading

Investigate high crash rate of the WebExtensions crawls #255

Investigate high crash rate of the WebExtensions crawls #255

Comments

englehardt commented Feb 7, 2019

motin commented Jul 16, 2019

englehardt commented Jul 16, 2019

englehardt commented Jul 16, 2019

gunesacar commented Jul 16, 2019

motin commented Jul 22, 2019

motin commented Aug 10, 2019 • edited Loading

motin commented Aug 10, 2019

motin commented Aug 16, 2019 • edited Loading

englehardt commented Aug 16, 2019

motin commented Aug 16, 2019

englehardt commented Aug 16, 2019

englehardt commented Aug 22, 2019

englehardt commented Aug 22, 2019

englehardt commented Aug 26, 2019

englehardt commented Aug 26, 2019

englehardt commented Aug 27, 2019 • edited Loading

motin commented Aug 10, 2019 •

edited

Loading

motin commented Aug 16, 2019 •

edited

Loading

englehardt commented Aug 27, 2019 •

edited

Loading