[SPARK-20801] Record accurate size of blocks in MapStatus when it's above threshold. #18031

jinxing64 · 2017-05-18T16:09:26Z

What changes were proposed in this pull request?

Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is used to store size of blocks. in HighlyCompressedMapStatus, only average size is stored for non empty blocks. Which is not good for memory control when we shuffle blocks. It makes sense to store the accurate size of block when it's above threshold.

How was this patch tested?

Added test in MapStatusSuite.

jinxing64 · 2017-05-18T16:15:27Z

To resolve the comments in #16989 :

minimum size before we consider something a large block : if average is 10kb, and some blocks are > 20kb, spilling them to disk would be highly suboptimal.

One edge-case to consider is the situation where every shuffle block is just over this threshold: in this case HighlyCompressedMapStatus won't really be doing any compression.

I propose two configs:
spark.shuffle.accurateBlockThreshold and spark.shuffle.accurateBlockThresholdByTimesAverage , sizes of blocks above both will be record accurately. By this, we can avoid making size of HighlyCompressedMapStatus too large, when:

All blocks are smaller ones, but a great percentage of blocks are > 2*avg;
All blocks are almost of the same size, but they are all big ones.

Another idea is just to only keep spark.shuffle.accurateBlockThreshold, and set the default value to be 2*avg. I'm not sure if this is preferred.

jinxing64 · 2017-05-18T16:20:38Z

I try to give user a way to control the memory strictly and no blocks are underestimated(setting spark.shuffle.accurateBlockThreshold=0 and spark.shuffle.accurateBlockThresholdByTimesAverage=1). I'm a little bit hesitant to remove the huge blocks from the numerator in that calculation for average size.

jinxing64 · 2017-05-18T16:23:41Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

+
+        }
+        i += 1
+      }


Sorry for bringing in another while loop here. I have to calculate the average size first, then filter out the huge blocks. I don't have a better implementation to merge the two while loops into one :(

SparkQA · 2017-05-18T16:25:34Z

Test build #77056 has finished for PR 18031 at commit d5b8a21.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-05-18T16:34:39Z

Jenkins, retest this please

jinxing64 · 2017-05-18T16:45:42Z

Gentle ping to @JoshRosen @cloud-fan @mridulm

SparkQA · 2017-05-18T16:54:38Z

Test build #77058 has finished for PR 18031 at commit d5b8a21.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-18T23:56:08Z

Test build #77069 has finished for PR 18031 at commit f6670d8.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-19T00:10:42Z

Test build #77070 has finished for PR 18031 at commit 970421b.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-05-19T00:57:32Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

- * plus a bitmap for tracking which blocks are empty.
+ * A [[MapStatus]] implementation that stores the accurate size of huge blocks, which are larger
+ * than both [[config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD]] and
+ * [[config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE]] * averageSize. It stores the


It looks the documentation generation for Javadoc 8 is being failed due to these links -

[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/target/java/org/apache/spark/scheduler/HighlyCompressedMapStatus.java:4: error: reference not found [error] * than both {@link config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD} and [error] ^ [error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/target/java/org/apache/spark/scheduler/HighlyCompressedMapStatus.java:5: error: reference not found [error] * {@link config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE} * averageSize. It stores the [error] ^

Probably, we should wrap it `...` as I did before - #16013 or find a way to make this link properly.

The other errors seem spurious. Please refer my observation - #17389 (comment)

(I think we should fix it or document ^ somewhere at least).

jinxing64 · 2017-05-19T01:00:10Z

@HyukjinKwon
Thank you so much ! Really helpful 👍

SparkQA · 2017-05-19T03:49:14Z

Test build #77072 has finished for PR 18031 at commit bfea9f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-05-19T09:24:41Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

+    val threshold2 = avgSize * Option(SparkEnv.get)
+      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
+      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
+    val threshold = math.max(threshold1, threshold2)


Just for curiosity: is there any reason we compute threshold in this way? Is it an empirical threshold?

Suppose each map task produces a 90MB bucket and many small buckets (skew data), then avgSize can be very small, and threshold would be 100MB because 100MB (threshold1) > 2 * avgSize (threshold2). If the number of map tasks is large (several hundreads or more), OOM can still happen, right?

@wzhfy
Thanks for taking time review this :)
This pr is based on the discussion in #16989 . The idea is to avoid underestimating big blocks in HighlyCompressedStatus and control the size of HighlyCompressedStatus at the same time.

Yes, the case you mentioned above is a really good one. But setting spark.shuffle.accurateBlockThreshold means we can accept sacrificing accuracy of blocks smaller than spark.shuffle.accurateBlockThreshold. If we want it to be more accurate, set it larger(in this case we can set it 50M). Thus size of the big bucket will be accurate

Yes, but my point is these two configs are difficult for users to set. Seems we still need to adjust them case by case.

But I agree that with this pr, at lease we have a workaround for oom problem.

Yes, this is to avoid the OOM. To adjust the value of this config, user needs to be sophisticated. I agree that these two configs are difficult. But with the default setting, we can really avoid some OOM situations(e.g. super huge block when skew happens).

viirya · 2017-05-20T09:11:13Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

    emptyBlocks.trim()
    emptyBlocks.runOptimize()
-    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize)
+    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize,


It seems to me that this https://github.com/apache/spark/pull/16989/files#r117174623 is a good comment to have accurate size for the smaller blocks.

@viirya Thanks a lot for taking time looking into this pr :)

remove the huge blocks from the numerator in that calculation so that you more accurately size the smaller blocks

Yes, I think this is really good idea to have accurate size for smaller blocks. But I'm proposing two configs(spark.shuffle.accurateBlockThreshold and spark.shuffle.accurateBlockThresholdByTimesAverage ) in current change, I have to compute the average twice: 1) the average calculated including huge blocks, thus I can filter out the huge blocks 2) the average calculated without huge blocks, thus I can have accurate size for the smaller blocks. A little bit complicated, right? How about remove the spark.shuffle.accurateBlockThresholdByTimesAverage ? Thus we can simplify the logic. @cloud-fan Any ideas about this?

In current change, if almost all blocks are huge, that's said it is not a skew case, so we won't mark the blocks as huge ones. Then we will still fetch them into memory?

With the default value (spark.shuffle.accurateBlockThreshold=100M and spark.shuffle.accurateBlockThresholdByTimesAverage=2), Yes.
But the user can make it more strict by setting (spark.shuffle.accurateBlockThreshold=0 and spark.shuffle.accurateBlockThresholdByTimesAverage=1).

I'd tend to have just one flag and simplify the configuration.

+1 for one flag, let's only keep spark.shuffle.accurateBlockThreshold

viirya · 2017-05-20T09:40:59Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

 */
 private[spark] class HighlyCompressedMapStatus private (
    private[this] var loc: BlockManagerId,
    private[this] var numNonEmptyBlocks: Int,
    private[this] var emptyBlocks: RoaringBitmap,
-    private[this] var avgSize: Long)
+    private[this] var avgSize: Long,
+    @transient private var hugeBlockSizes: Map[Int, Byte])


I go through part of codes in #16989. It seems to me that If we want is to know which shuffle request should be go to disk instead of memory, do we need to record the mapping of block ids and accurate sizes?

A simpler approach can be adding a bitmap for hugeBlocks. And we can simply fetch those blocks into disk. Another benefit by doing this is to avoid introducing another config REDUCER_MAX_REQ_SIZE_SHUFFLE_TO_MEM to decide which blocks going to disk.

Yes, I think it makes sense to add bitmap for hugeBlocks. But I'm a little bit hesitant. I still prefer to have hugeBlockSizes more independent from upper logic. In addition, the accurate size of blocks can also have positive effect on pending requests. (e.g. spark.reducer.maxSizeInFlight can control the size of pending requests better.)

The control of spark.reducer.maxSizeInFlight is not a big problem. It seems to me that any blocks considered as huge should break maxSizeInFlight and can't be fetching in parallel. We actually don't need to know accurate size of huge blocks, we just need to know it's huge and it should be more than maxSizeInFlight.

@viirya We had this discussion before in the earlier PR (which this is split from).
maxSizeInFlight meant to control how much data can be fetched in parallel and tuned based on network throughput and not memory (though currently, they are directly dependent due to implementation detail).
In reality, it is fairly small compared to what can be held in memory (48mb is default iirc) - since the memory and IO subsystems have different characteristics, using same config to control behavior in both will lead to suboptimal behavior (for example, large memory systems where large amounts can be held in memory, but network bandwidth is not propotionally higher).

SparkQA · 2017-05-22T05:12:14Z

Test build #77165 has finished for PR 18031 at commit 94fa7bb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-22T05:14:37Z

Test build #77166 has finished for PR 18031 at commit a313744.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-22T05:21:18Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

+        // Remove the huge blocks from the calculation for average size and have accurate size for
+        // smaller blocks.
+        if (size > threshold) {
+          totalSize += size


this should be put in the else branch

cloud-fan · 2017-05-22T05:21:45Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

+    val threshold = Option(SparkEnv.get)
+      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
+      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
+    val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
    while (i < totalNumBlocks) {
      var size = uncompressedSizes(i)


not related, why this is a var?

SparkQA · 2017-05-22T05:42:35Z

Test build #77169 has started for PR 18031 at commit 46fba08.

SparkQA · 2017-05-22T09:55:58Z

Test build #77171 has finished for PR 18031 at commit ca65544.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-22T13:15:18Z

Test build #77182 has finished for PR 18031 at commit 66aa56f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-05-22T13:38:40Z

In current change:

there's only one config: spark.shuffle.accurateBlockThreshold
I remove the huge blocks from the numerator in that calculation for average size

…bove threshold. ## What changes were proposed in this pull request? Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is used to store size of blocks. in HighlyCompressedMapStatus, only average size is stored for non empty blocks. Which is not good for memory control when we shuffle blocks. It makes sense to store the accurate size of block when it's above threshold. ## How was this patch tested? Added test in MapStatusSuite. Author: jinxing <[email protected]> Closes #18031 from jinxing64/SPARK-20801. (cherry picked from commit 2597674) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2017-05-22T14:10:33Z

thanks, merging to master/2.2!

cloud-fan · 2017-05-22T14:11:30Z

For other reviewers, this is kind of a stability fix, so I backported to branch 2.2

jinxing64 · 2017-05-22T15:46:11Z

@cloud-fan
Thanks for merging !
@mridulm @JoshRosen @viirya @HyukjinKwon @wzhfy Thanks a lot for taking time reviewing this pr !

…bove threshold. ## What changes were proposed in this pull request? Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is used to store size of blocks. in HighlyCompressedMapStatus, only average size is stored for non empty blocks. Which is not good for memory control when we shuffle blocks. It makes sense to store the accurate size of block when it's above threshold. ## How was this patch tested? Added test in MapStatusSuite. Author: jinxing <[email protected]> Closes apache#18031 from jinxing64/SPARK-20801.

jinxing64 commented May 18, 2017

View reviewed changes

jinxing64 force-pushed the SPARK-20801 branch 2 times, most recently from f6670d8 to 970421b Compare May 18, 2017 23:53

Record accurate size of blocks in MapStatus when it's above threshold.

bfea9f5

HyukjinKwon reviewed May 19, 2017

View reviewed changes

jinxing64 force-pushed the SPARK-20801 branch from 970421b to bfea9f5 Compare May 19, 2017 00:57

jinxing64 changed the title ~~Record accurate size of blocks in MapStatus when it's above threshold.~~ [SPARK-20801] Record accurate size of blocks in MapStatus when it's above threshold. May 19, 2017

wzhfy reviewed May 19, 2017

View reviewed changes

viirya reviewed May 20, 2017

View reviewed changes

jinxing64 force-pushed the SPARK-20801 branch 2 times, most recently from 94fa7bb to a313744 Compare May 22, 2017 03:04

cloud-fan reviewed May 22, 2017

View reviewed changes

jinxing64 changed the title ~~[SPARK-20801] Record accurate size of blocks in MapStatus when it's above threshold.~~ [WIP][SPARK-20801] Record accurate size of blocks in MapStatus when it's above threshold. May 22, 2017

jinxing64 force-pushed the SPARK-20801 branch from a313744 to 46fba08 Compare May 22, 2017 05:38

remove spark.shuffle.accurateBlockThresholdByTimesAverage

ca65544

jinxing64 force-pushed the SPARK-20801 branch from 46fba08 to ca65544 Compare May 22, 2017 07:08

Refine docs

66aa56f

jinxing64 changed the title ~~[WIP][SPARK-20801] Record accurate size of blocks in MapStatus when it's above threshold.~~ [SPARK-20801] Record accurate size of blocks in MapStatus when it's above threshold. May 22, 2017

asfgit closed this in 2597674 May 22, 2017

cloud-fan mentioned this pull request May 22, 2017

[SPARK-19659] Fetch big blocks to disk when shuffle-read. #16989

Closed

cloud-fan mentioned this pull request Jun 29, 2017

[SPARK-21236] Make the threshold of using HighlyCompressedStatus configurable. #18446

Closed

[SPARK-20801] Record accurate size of blocks in MapStatus when it's above threshold. #18031

[SPARK-20801] Record accurate size of blocks in MapStatus when it's above threshold. #18031

Conversation

jinxing64 commented May 18, 2017

What changes were proposed in this pull request?

How was this patch tested?

jinxing64 commented May 18, 2017 • edited Loading

jinxing64 commented May 18, 2017

Choose a reason for hiding this comment

SparkQA commented May 18, 2017

jinxing64 commented May 18, 2017

jinxing64 commented May 18, 2017

SparkQA commented May 18, 2017

SparkQA commented May 18, 2017

SparkQA commented May 19, 2017

HyukjinKwon May 19, 2017 • edited Loading

Choose a reason for hiding this comment

jinxing64 commented May 19, 2017

SparkQA commented May 19, 2017

wzhfy May 19, 2017 • edited Loading

Choose a reason for hiding this comment

wzhfy May 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wzhfy May 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 22, 2017

SparkQA commented May 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 22, 2017

SparkQA commented May 22, 2017

SparkQA commented May 22, 2017

jinxing64 commented May 22, 2017

cloud-fan commented May 22, 2017

cloud-fan commented May 22, 2017

jinxing64 commented May 22, 2017

jinxing64 commented May 18, 2017 •

edited

Loading

HyukjinKwon May 19, 2017 •

edited

Loading

wzhfy May 19, 2017 •

edited

Loading

wzhfy May 19, 2017 •

edited

Loading

wzhfy May 19, 2017 •

edited

Loading