Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize data loss during unexpected shutdown #450

Closed
motin opened this issue Aug 10, 2019 · 7 comments
Closed

Minimize data loss during unexpected shutdown #450

motin opened this issue Aug 10, 2019 · 7 comments
Assignees
Labels

Comments

@motin
Copy link
Contributor

motin commented Aug 10, 2019

In a recent pair of 108k crawls in GCP, this is what happened:

  1. After an initial crawl run, about 83k crawl_history records were found in the S3 bucket (about 70%)
  2. After creating a new crawl list (n = 25k) with the items of the original seed list that did not yet have a crawl_history record, about 18k more, or 101k crawl_history records in total, were found in the S3 bucket (about 72% of the 25k crawled made it)
  3. After step 2 repeated with now a 7k list, about 107k crawl_history records were found in total (about 85% of the 7k crawled made it)

I am filing this separate to #255, which addresses percentage of command failures in relation to the total number of commands submitted. This issue does not relate the value of bool_success in the crawl_history table, but to the existence of a record at all, and currently it seems that the data for around 15-30% of crawled sites in each crawl-attempt gets dropped.

@englehardt
Copy link
Collaborator

This was partially addressed in a bunch of previous PRs by improving the stability of the main processes, thus lowering the instances where data loss occurs. In a recent 5k site crawl 127 (~2.5%) sites were not present in the final dataset.

This test crawl had no main process crashes, so I suspect that all of the lost data was due to evicted nodes. Nodes can be evicted due to memory pressure, and I suspect this somewhat unavoidable. Setting a higher minimum memory allocation per node seems like it could help as this would lower the number of nodes that spawn in a pod, and might help prevent an over-allocation that leads to memory pressure after a while. I filed openwpm/openwpm-crawler#29 for this.

I think the final piece of this puzzle will be to make the S3Aggregator batch size configurable, lower the default by it a bit, and distribute batches in the job queue. Let's say the batch size is 100. A worker will check out that batch, run through all 100 sites, commit the data synchronously (#476), and then mark the batch as completed. If the worker dies or is evicted, another worker will pick up the batch.

@englehardt
Copy link
Collaborator

With the additions in openwpm/openwpm-crawler#30, we see a relatively low data loss rate. For a crawl of 100k sites, we failed to record any data for 131, or 0.13%, of the submitted sites. I'm updating the title here to reflect that there's still more work to do to prevent any data loss.

@englehardt englehardt changed the title Address high data loss rate in S3Aggregator Minimize data loss due to S3Aggregator record batching Aug 27, 2019
@vringar
Copy link
Contributor

vringar commented Nov 12, 2019

With registering a preStop Hook we should be able to flush an aggregator before the instance gets killed

@nhnt11
Copy link
Contributor

nhnt11 commented Nov 12, 2019

We can use the preStop hook to run a script that sends a signal to the task manager which then shuts everything down and flushes data. Ideally we should also log the success/failure of this process.

@vringar
Copy link
Contributor

vringar commented Nov 13, 2019

Apparently we have a default grace period of 30s to clean up after ourselves, as per the kubernetes docs

@englehardt englehardt changed the title Minimize data loss due to S3Aggregator record batching Minimize data loss during unexpected shutdown Nov 14, 2019
@vringar
Copy link
Contributor

vringar commented Nov 14, 2019

MainThread: Inform everybody else to shutdown
WorkerThreads: Inform Extension of shutdown, wait for extension to report done, save profile, shutdown
Aggregator: Wait for WorkerThreads (possibly in the main thread), flush all data to Storage, and report crawl incomplete sites

@englehardt
Copy link
Collaborator

This was addressed in several recent PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants