-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimize data loss during unexpected shutdown #450
Comments
This was partially addressed in a bunch of previous PRs by improving the stability of the main processes, thus lowering the instances where data loss occurs. In a recent 5k site crawl 127 (~2.5%) sites were not present in the final dataset. This test crawl had no main process crashes, so I suspect that all of the lost data was due to evicted nodes. Nodes can be evicted due to memory pressure, and I suspect this somewhat unavoidable. Setting a higher minimum memory allocation per node seems like it could help as this would lower the number of nodes that spawn in a pod, and might help prevent an over-allocation that leads to memory pressure after a while. I filed openwpm/openwpm-crawler#29 for this. I think the final piece of this puzzle will be to make the S3Aggregator batch size configurable, lower the default by it a bit, and distribute batches in the job queue. Let's say the batch size is 100. A worker will check out that batch, run through all 100 sites, commit the data synchronously (#476), and then mark the batch as completed. If the worker dies or is evicted, another worker will pick up the batch. |
With the additions in openwpm/openwpm-crawler#30, we see a relatively low data loss rate. For a crawl of 100k sites, we failed to record any data for 131, or 0.13%, of the submitted sites. I'm updating the title here to reflect that there's still more work to do to prevent any data loss. |
With registering a preStop Hook we should be able to flush an aggregator before the instance gets killed |
We can use the preStop hook to run a script that sends a signal to the task manager which then shuts everything down and flushes data. Ideally we should also log the success/failure of this process. |
Apparently we have a default grace period of 30s to clean up after ourselves, as per the kubernetes docs |
MainThread: Inform everybody else to shutdown |
This was addressed in several recent PRs |
In a recent pair of 108k crawls in GCP, this is what happened:
I am filing this separate to #255, which addresses percentage of command failures in relation to the total number of commands submitted. This issue does not relate the value of bool_success in the crawl_history table, but to the existence of a record at all, and currently it seems that the data for around 15-30% of crawled sites in each crawl-attempt gets dropped.
The text was updated successfully, but these errors were encountered: