[SPARK-4505][Core] Add a ClassTag parameter to CompactBuffer[T] #3378

zsxwing · 2014-11-20T05:57:54Z

Added a ClassTag parameter to CompactBuffer. So CompactBuffer[T] can create primitive arrays for primitive types. It will reduce the memory usage for primitive types significantly and only pay minor performance lost.

Here is my test code:

  // Call org.apache.spark.util.SizeEstimator.estimate
  def estimateSize(obj: AnyRef): Long = {
    val c = Class.forName("org.apache.spark.util.SizeEstimator$")
    val f = c.getField("MODULE$")
    val o = f.get(c)
    val m = c.getMethod("estimate", classOf[Object])
    m.setAccessible(true)
    m.invoke(o, obj).asInstanceOf[Long]
  }

  sc.parallelize(1 to 10000).groupBy(_ => 1).foreach {
    case (k, v) =>
      println(v.getClass() + " size: " + estimateSize(v))
  }

Using the previous CompactBuffer outputed

class org.apache.spark.util.collection.CompactBuffer size: 313358

Using the new CompactBuffer outputed

class org.apache.spark.util.collection.CompactBuffer size: 65712

In this case, the new CompactBuffer only used 20% memory of the previous one. It's really helpful for groupByKey when using a primitive value.

…T] when T is a primitive type

SparkQA · 2014-11-20T06:02:47Z

Test build #23661 has started for PR 3378 at commit 4abdbba.

This patch merges cleanly.

sryza · 2014-11-20T06:11:05Z

This seems like probably a great idea. Do you know what the overhead of including a classtag is? Does it mean an extra pointer per object?

zsxwing · 2014-11-20T06:27:11Z

Does it mean an extra pointer per object?

~~No. E.g., ClassTag.Int will be shared by all CompactBuffer[Int]. Same approach has already bean used in RDD.~~

Sorry. Yes the CompactBuffer will has one extra pointer for ClassTag.

zsxwing · 2014-11-20T07:12:20Z

It's weird. I just found both the sizes of old and new CompactBuffer(1) are 56. I cannot explain why.

Then I added a field to the old CompactBuffer like this:

class CompactBuffer[T] extends Seq[T] with Serializable {
  val dummy: AnyRef = null

  // First two elements
  private var element0: T = _
  private var element1: T = _

println(estimateSize(CompactBuffer[Int](1))) also outputs 56.

aarondav · 2014-11-20T07:18:15Z

This does seem like a good change, though I'll note that I think groupBy is the only current user of this API that is able to have a primitive ClassTag. Still worthwhile, especially for future usage. I do wonder if it could have a runtime impact due to increased primitive wrapping, possibly creating a lot of short-lived garbage if it were iterated over many times.

SparkQA · 2014-11-20T07:30:03Z

Test build #23661 has finished for PR 3378 at commit 4abdbba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-20T07:30:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23661/
Test PASSed.

zsxwing · 2014-11-20T07:31:18Z

It's weird. I just found both the sizes of old and new CompactBuffer(1) are 56.

Found the cause. My JVM enables UseCompressedOops. So in such case, due to alignment, the sizes are same.

JoshRosen · 2014-11-20T07:35:15Z

Ping @rxin, since this seems like the sort of optimization that you'd be interested in.

zsxwing · 2014-11-20T07:48:45Z

My motivation is that we encountered a skew data set that a special hot key has too many values and could not fit into memory. Spilling helps nothing in this case since groupBy will put all values of a key into a CompactBuffer. After this optimization, at least, my job could run using the same memory limitation.

rxin · 2014-11-20T08:25:49Z

We should definitely add a ClassTag since this can be used for primitive types. However, there might be places where we create a lot of CompactBuffers. I haven't had a chance to look at where CompactBuffers are used yet, but for those places, would it be possible to create a single ClassTag reference?

zsxwing · 2014-11-20T08:33:12Z

Cogroup uses CompactBuffer. However, it cannot add ClassTag due to its signature:

class CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ <: Product2[K, _]]], part: Partitioner)
  extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil)

Here rdds is Seq[RDD[_ <: Product2[K, _]]] without the real template type of RDDs

zsxwing · 2014-11-25T02:58:34Z

@rxin Is it OK to merge?

pwendell · 2014-11-30T01:22:15Z

I don't understand the architecture here as well as @rxin but this change seems like a strict improvement in its current form, so I'm gonna pull it in. LGTM.

Add a ClassTag parameter to reduce the memory usage of CompactBuffer[…

4abdbba

…T] when T is a primitive type

asfgit closed this in c062224 Nov 30, 2014

zsxwing deleted the SPARK-4505 branch November 30, 2014 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4505][Core] Add a ClassTag parameter to CompactBuffer[T] #3378

[SPARK-4505][Core] Add a ClassTag parameter to CompactBuffer[T] #3378

zsxwing commented Nov 20, 2014

SparkQA commented Nov 20, 2014

sryza commented Nov 20, 2014

zsxwing commented Nov 20, 2014

zsxwing commented Nov 20, 2014

aarondav commented Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

zsxwing commented Nov 20, 2014

JoshRosen commented Nov 20, 2014

zsxwing commented Nov 20, 2014

rxin commented Nov 20, 2014

zsxwing commented Nov 20, 2014

zsxwing commented Nov 25, 2014

pwendell commented Nov 30, 2014

[SPARK-4505][Core] Add a ClassTag parameter to CompactBuffer[T] #3378

[SPARK-4505][Core] Add a ClassTag parameter to CompactBuffer[T] #3378

Conversation

zsxwing commented Nov 20, 2014

SparkQA commented Nov 20, 2014

sryza commented Nov 20, 2014

zsxwing commented Nov 20, 2014

zsxwing commented Nov 20, 2014

aarondav commented Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

zsxwing commented Nov 20, 2014

JoshRosen commented Nov 20, 2014

zsxwing commented Nov 20, 2014

rxin commented Nov 20, 2014

zsxwing commented Nov 20, 2014

zsxwing commented Nov 25, 2014

pwendell commented Nov 30, 2014