-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate high crash rate of the WebExtensions crawls #255
Comments
How is this number calculated? Do we store information about crashes in the resulting crawl datasets? It seems like the right thing to do here is to set up Sentry or similar solution so that patterns in these errors can be properly detected and troubleshot. |
Yes, see https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/71922/command/71923 |
/cc @gunesacar, who also has some data on this comparing the 1M site crawls we used to run on FF52 and a crawl with the new webextensions instrumentation. |
Yes, we got 4x more GET command failures with the webextensions instrumentation compared to the previous (November 2018) crawl. See the 7th figure from the top on this notebook: https://github.com/citp/openwpm-data-release/blob/master/Crawl-Data-Metrics.ipynb Happy to share the crawl logs and the database. You can also download them as part of the Princeton Web Census Dataset). |
This is such a great notebook. Thanks for sharing this! I am looking forward to getting Sentry setup properly (#406) so that we systematically can address this. |
The notebook summarizes the following failure rates, with 100% being the 893453 crawl_history records:
An expansion of this notebook (here), states:
Corresponding data for a recent pair of crawls that uses the current master branch:
In summary, this issue is about investigating "Percentage of command failures", and it seems that it has dropped from 12.3% to about 6% since January. That is good news! |
The fact that out of the 1M crawl list only 89% ended up with a crawl_history record may be related to data loss in the S3 Aggregator, which is filed as a separate issue here: #450 |
These These errors have been elusive. If you know sites that reliably cause these errors please do share! I've sampled sites from the Sentry logs and have had trouble reproducing them locally, even when I purposefully run crawls in a memory or cpu constrained VM. I have a WIP PR that makes log levels configurable and bumps up console output logs to DEBUG by default. This will allow us to see the FirefoxExtension logs in GCP (which contain native logs from webdriver). Maybe we'll be able to trace down the cause through that.
This is concerning. I see two possibilities:
Agreed. I'm planning to save the serialized exceptions of command failures in a new column in crawl_history. Will that work for you? It won't contain all of the errors that lead to a crash, but will contain all of the browser errors that are currently handled by the catch-all try-except in the browser manager process. |
Ah, yes, thanks now I remember that you explained this in a recent PR.
The sites are listed above - about a third ought to reproduce the error :)
They are listed above. I realize my previous statement was confusing. All sites have records in crawl_history so no data loss was encountered, but since there were 93 sites with bool_success != 1 but only 82 errors in sentry, there ought to be 11 sites with bool_success != 1 that did not get any error reported in Sentry - I just don't know which particular sites/URLs, and it's weird that they got a bool_success != 1 despite not reporting an error.
Sounds like a good start! Anything that helps track down the root case of bool_success != 1 is good :) |
Got it. That's less concerning :) I'll add in the saving of command errors and we can investigate from there. |
This is done in #473. |
I think I found the cause of the high rate of I only observe the high crash rate when running in a Docker container, which is why all of my earlier investigations didn't turn up anything useful (I was just inspecting a few sites locally). I believe it's caused by a very low |
This should be fixed in openwpm/openwpm-crawler#28 |
I also created #475 to throw better error messages in the event of a Firefox process crash. |
This was fixed in openwpm/openwpm-crawler#28, openwpm/openwpm-crawler#30, and #477. In particular, openwpm/openwpm-crawler#30 (comment) gives a detailed description of why the new config parameters improve the stability of the crawler. In a recent test crawl (https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/166927/command/167004) we saw just 5 errors of this type across 100k sites. |
In a 1 million site crawl with our most recent Firefox 60 support has a relatively high command crash rate (13% of
get
commands). Compare this to 3.5% for the current Firefox 52-based master branch.A (partial) log file from a recent 1M site crawl is available here: https://www.dropbox.com/s/320d4my3b2prnc8/2019_01_29_alexa1m_logs.bz2?dl=0
The text was updated successfully, but these errors were encountered: