Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved stability and error logging #449

Merged
merged 6 commits into from
Aug 13, 2019
Merged

Improved stability and error logging #449

merged 6 commits into from
Aug 13, 2019

Conversation

englehardt
Copy link
Collaborator

This is addressing some of the errors observed in a recent large crawl.

This also better handles errors if we still fail after retrying (so the
aggregator process doesn't crash).
The current inter-process log handler discards `exc_info` and adds the
arguments to the exception message. This prevents Sentry from properly
grouping log messages by their exception and call information. We send
the information if possible (or fall back to the previous approach if
not).
@englehardt
Copy link
Collaborator Author

englehardt commented Aug 13, 2019

@motin This PR is ready for review. I ran a 10k site crawl with this branch and here are the improvements:

First, some positives:

  1. Out of the 10k sites, 9489 are in the crawl_history table. This means that the aggregator is much more stable than our previous crawls (with >30% data loss). There were 68 sites left in the processing queue. This means that all of the data loss could very well have been from the single main process crash described below.
  2. The new serialization of exceptions means the sentry logs are much cleaner. Look at the groupings in https://sentry.prod.mozaws.net/operations/web-crawls/?query=CRAWL_REFERENCE%3A%22openwpm-crawls%2F2019-08-12_test_pr_449%22.
  3. I didn't receive any threading related errors, perhaps as a result of more safely closing the webdriver (8903f9d)?

Remaining problems:

  1. There was still a single broken pipe error in the main process (https://sentry.prod.mozaws.net/operations/web-crawls/issues/6101037/?query=CRAWL_REFERENCE:%22openwpm-crawls/2019-08-12_test_pr_449%22). This means there's still a crasher in the aggregator, but there's no associated Sentry log from what i can tell. I'll need to do a bit more digging in the GCP logs (https://console.cloud.google.com/logs/viewer?organizationId=442341870013&project=senglehardt-openwpm-test-1&minLogLevel=500&expandAll=false&timestamp=2019-08-13T00:15:08.481000000Z&customFacets=&limitCustomFacetWidth=true&dateRangeStart=2019-08-12T00:15:08.856Z&dateRangeEnd=2019-08-13T00:15:08.856Z&interval=P1D&resource=container&scrollTimestamp=2019-08-13T00:00:30.414629184Z).
  2. The rate of WebDriverExceptions is quite high. After ignoring the exceptions for the alert handling code (0b9ccb7), we received almost no errors of the type Message: Failed to decode response from marionette. Instead, we received a bunch of InvalidSessionIdException: Message: Tried to run command without establishing a connection, which is just caused by the next command that interacts with the webdriver. Thus, there must be something that's causing the webdriver / marionette session to disconnect from the browser after it finishes loading the page. Perhaps the browser has crashed at that point? If I could find a site that reliably reproduces the error, it would be easier to debug.

@englehardt englehardt requested a review from motin August 13, 2019 04:21
@englehardt englehardt changed the title Improvements to S3Aggregator stability and error logging Improved stability and error logging Aug 13, 2019
@englehardt
Copy link
Collaborator Author

Fixes #12. Partially addresses #450.

Copy link
Contributor

@motin motin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good news! I am happy with merging this and having the remaining issues be tackled in separate PRs

@@ -133,7 +133,7 @@ def get_website(url, sleep, visit_id, webdriver,
alert = webdriver.switch_to_alert()
alert.dismiss()
time.sleep(1)
except TimeoutException:
except (TimeoutException, WebDriverException):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will completely suppress the exceptions that are reported in #404 and prevent being able to properly address these errors as per #404 (comment). When addressing #404, we must remember to change this back to throwing exceptions, so that we can reproduce the root cause via tests, report upstream etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants