[SPARK-3886] [PySpark] use AutoBatchedSerializer by default #2740

davies · 2014-10-09T23:43:08Z

Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in into [64k - 640k].

In JVM, the serializer will also track the objects in batch to figure out duplicated objects, larger batch may cause OOM in JVM.

SparkQA · 2014-10-09T23:49:39Z

QA tests have started for PR 2740 at commit 185f2b9.

This patch merges cleanly.

SparkQA · 2014-10-10T00:52:39Z

QA tests have finished for PR 2740 at commit 185f2b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-10T00:52:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21556/Test PASSed.

JoshRosen · 2014-10-10T20:54:23Z

python/pyspark/context.py

-               Java object.  Set 1 to disable batching or -1 to use an
-               unlimited batch size.
+               Java object.  Set 1 to disable batching, or 0 to choose batch size
+               based on size of objects automaticly, or -1 to use an unlimited


Spelling: automatically. How about "Set 1 to disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use an unlimited batch size"?

JoshRosen · 2014-10-10T20:55:41Z

Aside from a minor doc typo, this looks good to me, especially since AutoBatchedSerializer already exists and has been tested.

SparkQA · 2014-10-10T21:09:34Z

QA tests have started for PR 2740 at commit 52cdb88.

This patch merges cleanly.

JoshRosen · 2014-10-10T21:13:20Z

I tried a small experiment to test this out:

import os
from pyspark import SparkContext, SparkConf

conf = SparkConf().set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)

mb = 1000000
def inflateDataSize(x):
    return bytearray(os.urandom(1 * mb))

sc.parallelize(range(1000), 10).map(inflateDataSize).cache().count()

Prior to this patch, the Python worker's memory consumption would steadily grow while it attempted to batch together 100 MB of data per task, whereas now the memory usage remains constant because we emit smaller batches more often (since the objects are big).

Thanks for updating the docs. This looks good to me, so I'm going to merge it into master.

SparkQA · 2014-10-10T22:11:46Z

QA tests have finished for PR 2740 at commit 52cdb88.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-10T22:11:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21601/Test PASSed.

Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in into [64k - 640k]. In JVM, the serializer will also track the objects in batch to figure out duplicated objects, larger batch may cause OOM in JVM. Author: Davies Liu <[email protected]> Closes apache#2740 from davies/batchsize and squashes the following commits: 52cdb88 [Davies Liu] update docs 185f2b9 [Davies Liu] use AutoBatchedSerializer by default

use AutoBatchedSerializer by default

185f2b9

JoshRosen reviewed Oct 10, 2014
View reviewed changes

update docs

52cdb88

asfgit closed this in 72f36ee Oct 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3886] [PySpark] use AutoBatchedSerializer by default #2740

[SPARK-3886] [PySpark] use AutoBatchedSerializer by default #2740

davies commented Oct 9, 2014

SparkQA commented Oct 9, 2014

SparkQA commented Oct 10, 2014

AmplabJenkins commented Oct 10, 2014

JoshRosen Oct 10, 2014

JoshRosen commented Oct 10, 2014

SparkQA commented Oct 10, 2014

JoshRosen commented Oct 10, 2014

SparkQA commented Oct 10, 2014

AmplabJenkins commented Oct 10, 2014

[SPARK-3886] [PySpark] use AutoBatchedSerializer by default #2740

[SPARK-3886] [PySpark] use AutoBatchedSerializer by default #2740

Conversation

davies commented Oct 9, 2014

SparkQA commented Oct 9, 2014

SparkQA commented Oct 10, 2014

AmplabJenkins commented Oct 10, 2014

JoshRosen Oct 10, 2014

Choose a reason for hiding this comment

JoshRosen commented Oct 10, 2014

SparkQA commented Oct 10, 2014

JoshRosen commented Oct 10, 2014

SparkQA commented Oct 10, 2014

AmplabJenkins commented Oct 10, 2014