Add a minimum request for memory and bump parallelism #30

englehardt · 2019-08-27T05:32:09Z

I need to rebase on #27 and #28 once merged.

This PR makes two tweaks to the kubernetes deployment config:

Adds a minimum memory request of 1G per pod. I scanned through some of the older crawls and saw that most pods where using somewhere between 500mb and 900mb of memory during a 5k site crawl. As mentioned in Increase the minimum allocated memory in gcp crawls #29, I would allocate ~1GB of memory for each browser in my past EC2 crawls. In fact, OpenWPM monitors to see if the browser process exceeds 1.5GB and kills it if is does so this is somewhat of a low request. However, 1GB should be fine for stateless crawls as the browser's memory usage increases over time during stateful crawls (hence the 1.5GB limit).
Bumps the parallelism from 100 to 300. The cluster was significantly underutilized when set to 100 but manually scaled to 15 nodes (as the README recommends). That is, most nodes only had at most half of their resources in use, and the cluster would slowly auto-scale down (with a lot of money wasted in the process) At 1GB per pod, each node will run 11 pods. Thus, a parallelism of 300 nicely fills the cluster (300 / 11 = ~28 nodes out of 30 max).

Here's a screenshot of the resource usage of one node for a recent 100k site crawl:

The maximum allocatable CPU is 15.89 and memory is 12.43. As you can see the node nearly maxes out both but has some headroom for spikes. This tells me we have the right balance between minimum CPU and memory. If we lowered the memory a bit we may be able to run more pods per node, but then they'd start to CPU throttle (assuming the pod isn't already).

Overall the crawl was super stable, so it seems that bumping the memory minimums helped. I only saw a few evicted pods when manually scanning through the nodes. This notebook gives the full health summary. Some highlights:

== CRAWL HISTORY ==
Total number of commands submitted:
+-------+------+
|command| count|
+-------+------+
|    GET|100447|
+-------+------+

Percentage of command failures 0.00%
Percentage of neterrors 6.96%
Percentage of command timeouts 1.44%

Summary of neterrors:
+--------------------+-----+
|               error|count|
+--------------------+-----+
|   connectionFailure|  235|
|        redirectLoop|   35|
|unknownProtocolFound|    1|
|          netTimeout| 2874|
|corruptedContentE...|    3|
|            netReset|  102|
|contentEncodingError|    3|
|        fileNotFound|    2|
|         nssFailure2|   40|
|         dnsNotFound| 3698|
+--------------------+-----+


Summary of exceptions:
+--------------------+-----+
|               error|count|
+--------------------+-----+
|NoSuchWindowExcep...|    1|
|InvalidSessionIdE...|    4|
+--------------------+-----+

So we actually see more visit ids than sites submitted! I guess that's better than losing data. In reality, it means some sites are recorded twice. It's not immediately clear to me how that can happen since we add 30 seconds to the lease time over the total command time. Perhaps there was some lag starting (or restarting) the browser that caused a worker to work on a job longer than the lease time. I suggest we just add more padding to the lease (maybe 120 seconds or more).

We only see four non-neterror browser crashes total for 100k sites. That means these memory improvements and the ones in #28 are working. The number of neterrors is high, but aside from netTimeout, it's not clear to me there's anything we can do about them.

== SITE VISITS ==
Total number of distinct site urls in table:
+------------------+
|distinct_site_urls|
+------------------+
|             99869|
+------------------+

Here we see that there are only 131, or 0.13% of the submitted sites missing. These are likely lost in the data aggregator cache of the evicted pods. I think this is an acceptable error rate overall, but it could be brought to zero through the changes summarized in openwpm/OpenWPM#476.

motin

Love it :)
The findings here should probably be copied to or be linked in openwpm/OpenWPM#255

englehardt · 2019-08-27T21:21:02Z

Now rebased. Merging to master.

englehardt requested a review from motin August 27, 2019 05:32

motin approved these changes Aug 27, 2019

View reviewed changes

Add memory minimum and bump parallelism

b376ed9

englehardt force-pushed the issue_29 branch from 71b77ab to b376ed9 Compare August 27, 2019 21:20

englehardt merged commit e57cb18 into master Aug 27, 2019

englehardt deleted the issue_29 branch August 27, 2019 21:29

This was referenced Aug 27, 2019

Increase the minimum allocated memory in gcp crawls #29

Closed

Investigate high crash rate of the WebExtensions crawls openwpm/OpenWPM#255

Closed

Minimize data loss during unexpected shutdown openwpm/OpenWPM#450

Closed

This was referenced Sep 1, 2019

Add helpers to extract crawl metrics / data verification mozilla/openwpm-utils#12

Open

Support a best practice retry-n-times crawl approach #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a minimum request for memory and bump parallelism #30

Add a minimum request for memory and bump parallelism #30

englehardt commented Aug 27, 2019

motin left a comment •

edited

Loading

englehardt commented Aug 27, 2019

Add a minimum request for memory and bump parallelism #30

Add a minimum request for memory and bump parallelism #30

Conversation

englehardt commented Aug 27, 2019

motin left a comment • edited Loading

Choose a reason for hiding this comment

englehardt commented Aug 27, 2019

motin left a comment •

edited

Loading