Skip to content

Commit

Permalink
[SPARK-49741][DOCS] Add spark.shuffle.accurateBlockSkewedFactor to …
Browse files Browse the repository at this point in the history
…config docs page

### What changes were proposed in this pull request?

`spark.shuffle.accurateBlockSkewedFactor` was added in Spark 3.3.0 in https://issues.apache.org/jira/browse/SPARK-36967 and is a useful shuffle configuration to prevent issues where `HighlyCompressedMapStatus` wrongly estimates the shuffle block sizes when the block size distribution is skewed, which can cause the shuffle reducer to fetch too much data and OOM. This PR adds this config to the Spark config docs page to make it discoverable.

### Why are the changes needed?

To make this useful config discoverable by users and make them able to resolve shuffle fetch OOM issues themselves.

### Does this PR introduce _any_ user-facing change?

Yes, this is a documentation fix. Before this PR there's no `spark.sql.adaptive.skewJoin.skewedPartitionFactor` in the `Shuffle Behavior` section on [the Configurations page](https://spark.apache.org/docs/latest/configuration.html) and now there is.

### How was this patch tested?

On the IDE:
<img width="1633" alt="image" src="https://github.com/user-attachments/assets/616a94b9-2408-491c-a17b-c6dbdff14465">
Updated:
<img width="1274" alt="image" src="https://github.com/user-attachments/assets/ba170e9a-eba2-4fdf-85eb-a3aebefc055e">

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#48189 from timlee0119/add-accurate-block-skewed-factor-to-doc.

Authored-by: Tim Lee <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
  • Loading branch information
timlee0119 authored and HyukjinKwon committed Sep 22, 2024
1 parent 4f640e2 commit b642096
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -1386,7 +1386,6 @@ package object config {

private[spark] val SHUFFLE_ACCURATE_BLOCK_SKEWED_FACTOR =
ConfigBuilder("spark.shuffle.accurateBlockSkewedFactor")
.internal()
.doc("A shuffle block is considered as skewed and will be accurately recorded in " +
"HighlyCompressedMapStatus if its size is larger than this factor multiplying " +
"the median shuffle block size or SHUFFLE_ACCURATE_BLOCK_THRESHOLD. It is " +
Expand Down
13 changes: 13 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -1232,6 +1232,19 @@ Apart from these, the following properties are also available, and may be useful
</td>
<td>2.2.1</td>
</tr>
<tr>
<td><code>spark.shuffle.accurateBlockSkewedFactor</code></td>
<td>-1.0</td>
<td>
A shuffle block is considered as skewed and will be accurately recorded in
<code>HighlyCompressedMapStatus</code> if its size is larger than this factor multiplying
the median shuffle block size or <code>spark.shuffle.accurateBlockThreshold</code>. It is
recommended to set this parameter to be the same as
<code>spark.sql.adaptive.skewJoin.skewedPartitionFactor</code>. Set to -1.0 to disable this
feature by default.
</td>
<td>3.3.0</td>
</tr>
<tr>
<td><code>spark.shuffle.registration.timeout</code></td>
<td>5000</td>
Expand Down

0 comments on commit b642096

Please sign in to comment.