Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35239][SQL] Coalesce shuffle partition should handle empty input RDD #32362

Closed
wants to merge 1 commit into from

Conversation

ulysses-you
Copy link
Contributor

What changes were proposed in this pull request?

Create empty partition for custom shuffle reader if input RDD is empty.

Why are the changes needed?

If input RDD partition is empty then the map output statistics will be null. And if all shuffle stage's input RDD partition is empty, we will skip it and lose the chance to coalesce partition.

We can simply create a empty partition for these custom shuffle reader to reduce the partition number.

Does this PR introduce any user-facing change?

Yes, the shuffle partition might be changed in AQE.

How was this patch tested?

add new test.

@ulysses-you
Copy link
Contributor Author

@github-actions github-actions bot added the SQL label Apr 27, 2021
// `ShuffleQueryStageExec#mapStats` returns None when the input RDD has 0 partitions,
// we should skip it when calculating the `partitionStartIndices`.
// If all input RDDs have 0 partition, we create empty partition for every shuffle reader.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all input RDDs have 0 partition, the query is very fast and we don't need to optimize?

Copy link
Contributor Author

@ulysses-you ulysses-you Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically it is. But for Spark server like SparkThriftServer, we always use large shuffle partitions (e.g. 8192) and depend on aqe to coalesce it. If some users query on a empty table it will waste too many tasks. And an another issue is that driver can be busy (unnecessary task event) with the single point problem.

@SparkQA
Copy link

SparkQA commented Apr 27, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42514/

@SparkQA
Copy link

SparkQA commented Apr 27, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42514/

@SparkQA
Copy link

SparkQA commented Apr 27, 2021

Test build #137993 has finished for PR 32362 at commit 9e395ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137993/

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 4ff9f1f Apr 27, 2021
@ulysses-you ulysses-you deleted the SPARK-35239 branch April 28, 2021 01:12
@ulysses-you
Copy link
Contributor Author

thanks for merging !

domybest11 pushed a commit to domybest11/spark that referenced this pull request Jun 15, 2022
…ut RDD

### What changes were proposed in this pull request?

Create empty partition for custom shuffle reader if input RDD is empty.

### Why are the changes needed?

If input RDD partition is empty then the map output statistics will be null. And if all shuffle stage's input RDD partition is empty, we will skip it and lose the chance to coalesce partition.

We can simply create a empty partition for these custom shuffle reader to reduce the partition number.

### Does this PR introduce _any_ user-facing change?

Yes, the shuffle partition might be changed in AQE.

### How was this patch tested?

add new test.

Closes apache#32362 from ulysses-you/SPARK-35239.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants