Add a minimum request for memory and bump parallelism #30
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I need to rebase on #27 and #28 once merged.
This PR makes two tweaks to the kubernetes deployment config:
Adds a minimum memory request of 1G per pod. I scanned through some of the older crawls and saw that most pods where using somewhere between 500mb and 900mb of memory during a 5k site crawl. As mentioned in Increase the minimum allocated memory in gcp crawls #29, I would allocate ~1GB of memory for each browser in my past EC2 crawls. In fact, OpenWPM monitors to see if the browser process exceeds 1.5GB and kills it if is does so this is somewhat of a low request. However, 1GB should be fine for stateless crawls as the browser's memory usage increases over time during stateful crawls (hence the 1.5GB limit).
Bumps the parallelism from 100 to 300. The cluster was significantly underutilized when set to 100 but manually scaled to 15 nodes (as the README recommends). That is, most nodes only had at most half of their resources in use, and the cluster would slowly auto-scale down (with a lot of money wasted in the process) At 1GB per pod, each node will run 11 pods. Thus, a parallelism of 300 nicely fills the cluster (300 / 11 = ~28 nodes out of 30 max).
Here's a screenshot of the resource usage of one node for a recent 100k site crawl:
The maximum allocatable CPU is 15.89 and memory is 12.43. As you can see the node nearly maxes out both but has some headroom for spikes. This tells me we have the right balance between minimum CPU and memory. If we lowered the memory a bit we may be able to run more pods per node, but then they'd start to CPU throttle (assuming the pod isn't already).
Overall the crawl was super stable, so it seems that bumping the memory minimums helped. I only saw a few evicted pods when manually scanning through the nodes. This notebook gives the full health summary. Some highlights:
So we actually see more visit ids than sites submitted! I guess that's better than losing data. In reality, it means some sites are recorded twice. It's not immediately clear to me how that can happen since we add 30 seconds to the lease time over the total command time. Perhaps there was some lag starting (or restarting) the browser that caused a worker to work on a job longer than the lease time. I suggest we just add more padding to the lease (maybe 120 seconds or more).
We only see four non-neterror browser crashes total for 100k sites. That means these memory improvements and the ones in #28 are working. The number of neterrors is high, but aside from
netTimeout
, it's not clear to me there's anything we can do about them.Here we see that there are only 131, or 0.13% of the submitted sites missing. These are likely lost in the data aggregator cache of the evicted pods. I think this is an acceptable error rate overall, but it could be brought to zero through the changes summarized in openwpm/OpenWPM#476.