Minimize data loss during unexpected shutdown #450

motin · 2019-08-10T07:16:17Z

In a recent pair of 108k crawls in GCP, this is what happened:

After an initial crawl run, about 83k crawl_history records were found in the S3 bucket (about 70%)
After creating a new crawl list (n = 25k) with the items of the original seed list that did not yet have a crawl_history record, about 18k more, or 101k crawl_history records in total, were found in the S3 bucket (about 72% of the 25k crawled made it)
After step 2 repeated with now a 7k list, about 107k crawl_history records were found in total (about 85% of the 7k crawled made it)

I am filing this separate to #255, which addresses percentage of command failures in relation to the total number of commands submitted. This issue does not relate the value of bool_success in the crawl_history table, but to the existence of a record at all, and currently it seems that the data for around 15-30% of crawled sites in each crawl-attempt gets dropped.

englehardt · 2019-08-26T23:14:58Z

This was partially addressed in a bunch of previous PRs by improving the stability of the main processes, thus lowering the instances where data loss occurs. In a recent 5k site crawl 127 (~2.5%) sites were not present in the final dataset.

This test crawl had no main process crashes, so I suspect that all of the lost data was due to evicted nodes. Nodes can be evicted due to memory pressure, and I suspect this somewhat unavoidable. Setting a higher minimum memory allocation per node seems like it could help as this would lower the number of nodes that spawn in a pod, and might help prevent an over-allocation that leads to memory pressure after a while. I filed openwpm/openwpm-crawler#29 for this.

I think the final piece of this puzzle will be to make the S3Aggregator batch size configurable, lower the default by it a bit, and distribute batches in the job queue. Let's say the batch size is 100. A worker will check out that batch, run through all 100 sites, commit the data synchronously (#476), and then mark the batch as completed. If the worker dies or is evicted, another worker will pick up the batch.

englehardt · 2019-08-27T22:02:51Z

With the additions in openwpm/openwpm-crawler#30, we see a relatively low data loss rate. For a crawl of 100k sites, we failed to record any data for 131, or 0.13%, of the submitted sites. I'm updating the title here to reflect that there's still more work to do to prevent any data loss.

vringar · 2019-11-12T10:49:20Z

With registering a preStop Hook we should be able to flush an aggregator before the instance gets killed

nhnt11 · 2019-11-12T11:03:14Z

We can use the preStop hook to run a script that sends a signal to the task manager which then shuts everything down and flushes data. Ideally we should also log the success/failure of this process.

vringar · 2019-11-13T10:48:18Z

Apparently we have a default grace period of 30s to clean up after ourselves, as per the kubernetes docs

vringar · 2019-11-14T13:23:53Z

MainThread: Inform everybody else to shutdown
WorkerThreads: Inform Extension of shutdown, wait for extension to report done, save profile, shutdown
Aggregator: Wait for WorkerThreads (possibly in the main thread), flush all data to Storage, and report crawl incomplete sites

englehardt · 2020-06-12T15:33:17Z

This was addressed in several recent PRs

motin assigned englehardt Aug 10, 2019

motin mentioned this issue Aug 10, 2019

Investigate high crash rate of the WebExtensions crawls #255

Closed

englehardt mentioned this issue Aug 13, 2019

Improved stability and error logging #449

Merged

englehardt added the bug label Aug 16, 2019

englehardt mentioned this issue Aug 26, 2019

Add a method to S3Aggregator to synchronously commit data to S3 #476

Closed

englehardt changed the title ~~Address high data loss rate in S3Aggregator~~ Minimize data loss due to S3Aggregator record batching Aug 27, 2019

vringar mentioned this issue Nov 14, 2019

Registering signal handler to handle shutdown gracefully #528

Closed

englehardt changed the title ~~Minimize data loss due to S3Aggregator record batching~~ Minimize data loss during unexpected shutdown Nov 14, 2019

englehardt closed this as completed Jun 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize data loss during unexpected shutdown #450

Minimize data loss during unexpected shutdown #450

motin commented Aug 10, 2019

englehardt commented Aug 26, 2019

englehardt commented Aug 27, 2019

vringar commented Nov 12, 2019

nhnt11 commented Nov 12, 2019

vringar commented Nov 13, 2019

vringar commented Nov 14, 2019

englehardt commented Jun 12, 2020

Minimize data loss during unexpected shutdown #450

Minimize data loss during unexpected shutdown #450

Comments

motin commented Aug 10, 2019

englehardt commented Aug 26, 2019

englehardt commented Aug 27, 2019

vringar commented Nov 12, 2019

nhnt11 commented Nov 12, 2019

vringar commented Nov 13, 2019

vringar commented Nov 14, 2019

englehardt commented Jun 12, 2020