Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a minimum request for memory and bump parallelism #30

Merged
merged 1 commit into from
Aug 27, 2019
Merged

Conversation

englehardt
Copy link
Contributor

I need to rebase on #27 and #28 once merged.

This PR makes two tweaks to the kubernetes deployment config:

  1. Adds a minimum memory request of 1G per pod. I scanned through some of the older crawls and saw that most pods where using somewhere between 500mb and 900mb of memory during a 5k site crawl. As mentioned in Increase the minimum allocated memory in gcp crawls #29, I would allocate ~1GB of memory for each browser in my past EC2 crawls. In fact, OpenWPM monitors to see if the browser process exceeds 1.5GB and kills it if is does so this is somewhat of a low request. However, 1GB should be fine for stateless crawls as the browser's memory usage increases over time during stateful crawls (hence the 1.5GB limit).

  2. Bumps the parallelism from 100 to 300. The cluster was significantly underutilized when set to 100 but manually scaled to 15 nodes (as the README recommends). That is, most nodes only had at most half of their resources in use, and the cluster would slowly auto-scale down (with a lot of money wasted in the process) At 1GB per pod, each node will run 11 pods. Thus, a parallelism of 300 nicely fills the cluster (300 / 11 = ~28 nodes out of 30 max).

Here's a screenshot of the resource usage of one node for a recent 100k site crawl:
Screenshot from 2019-08-26 22-13-35

The maximum allocatable CPU is 15.89 and memory is 12.43. As you can see the node nearly maxes out both but has some headroom for spikes. This tells me we have the right balance between minimum CPU and memory. If we lowered the memory a bit we may be able to run more pods per node, but then they'd start to CPU throttle (assuming the pod isn't already).

Overall the crawl was super stable, so it seems that bumping the memory minimums helped. I only saw a few evicted pods when manually scanning through the nodes. This notebook gives the full health summary. Some highlights:

== CRAWL HISTORY ==
Total number of commands submitted:
+-------+------+
|command| count|
+-------+------+
|    GET|100447|
+-------+------+

Percentage of command failures 0.00%
Percentage of neterrors 6.96%
Percentage of command timeouts 1.44%

Summary of neterrors:
+--------------------+-----+
|               error|count|
+--------------------+-----+
|   connectionFailure|  235|
|        redirectLoop|   35|
|unknownProtocolFound|    1|
|          netTimeout| 2874|
|corruptedContentE...|    3|
|            netReset|  102|
|contentEncodingError|    3|
|        fileNotFound|    2|
|         nssFailure2|   40|
|         dnsNotFound| 3698|
+--------------------+-----+


Summary of exceptions:
+--------------------+-----+
|               error|count|
+--------------------+-----+
|NoSuchWindowExcep...|    1|
|InvalidSessionIdE...|    4|
+--------------------+-----+

So we actually see more visit ids than sites submitted! I guess that's better than losing data. In reality, it means some sites are recorded twice. It's not immediately clear to me how that can happen since we add 30 seconds to the lease time over the total command time. Perhaps there was some lag starting (or restarting) the browser that caused a worker to work on a job longer than the lease time. I suggest we just add more padding to the lease (maybe 120 seconds or more).

We only see four non-neterror browser crashes total for 100k sites. That means these memory improvements and the ones in #28 are working. The number of neterrors is high, but aside from netTimeout, it's not clear to me there's anything we can do about them.

== SITE VISITS ==
Total number of distinct site urls in table:
+------------------+
|distinct_site_urls|
+------------------+
|             99869|
+------------------+

Here we see that there are only 131, or 0.13% of the submitted sites missing. These are likely lost in the data aggregator cache of the evicted pods. I think this is an acceptable error rate overall, but it could be brought to zero through the changes summarized in openwpm/OpenWPM#476.

@englehardt englehardt requested a review from motin August 27, 2019 05:32
Copy link
Contributor

@motin motin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it :)
The findings here should probably be copied to or be linked in openwpm/OpenWPM#255

@englehardt
Copy link
Contributor Author

Now rebased. Merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants