Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions Non-AQE part #37706

Closed
wants to merge 3 commits into from

Conversation

ulysses-you
Copy link
Contributor

What changes were proposed in this pull request?

Skip optimize the root user-specified repartition in PropagateEmptyRelation.

Why are the changes needed?

Spark should preserve the final repatition which can affect the final output partition which is user-specified.

For example:

spark.sql("select * from values(1) where 1 < rand()").repartition(1)

// before:
== Optimized Logical Plan ==
LocalTableScan <empty>, [col1#0]

// after:
== Optimized Logical Plan ==
Repartition 1, true
+- LocalRelation <empty>, [col1#0]

Does this PR introduce any user-facing change?

yes, the empty plan may change

How was this patch tested?

add test

@ulysses-you
Copy link
Contributor Author

cc @cloud-fan

@github-actions github-actions bot added the SQL label Aug 29, 2022
*/
protected def addTagForRootRepartition(plan: LogicalPlan): LogicalPlan = {
var isRootRepartition = true
plan.transformDownWithPruning(_.containsPattern(repartitionTreePattern)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use a manual traversal here. We can stop traversal once we hit a node that is not repartition/project/filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1310,6 +1310,14 @@ class PlannerSuite extends SharedSparkSession with AdaptiveSparkPlanHelper {
assert(topKs.size == 1)
assert(sorts.isEmpty)
}

test("SPARK-39915: Dataset.repartition(N) may not create N partitions") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a planner bug... We can probably add an end-to-end test in DataFrameSuite

@cloud-fan
Copy link
Contributor

cloud-fan commented Aug 30, 2022

thanks, merging to master!

@cloud-fan cloud-fan closed this in ff7ab34 Aug 30, 2022
@cloud-fan
Copy link
Contributor

@ulysses-you can you open a backport PR for 3.3? it has conflicts.

ulysses-you added a commit to ulysses-you/spark that referenced this pull request Aug 30, 2022
… Non-AQE part

Skip optimize the root user-specified repartition in `PropagateEmptyRelation`.

Spark should preserve the final repatition which can affect the final output partition which is user-specified.

For example:

```scala
spark.sql("select * from values(1) where 1 < rand()").repartition(1)

// before:
== Optimized Logical Plan ==
LocalTableScan <empty>, [col1#0]

// after:
== Optimized Logical Plan ==
Repartition 1, true
+- LocalRelation <empty>, [col1#0]
```

yes, the empty plan may change

add test

Closes apache#37706 from ulysses-you/empty.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@ulysses-you
Copy link
Contributor Author

thank you @cloud-fan , craeted #37730

@ulysses-you ulysses-you deleted the empty branch August 30, 2022 14:07
* Add a [[ROOT_REPARTITION]] tag for the root user-specified repartition so this rule can
* skip optimize it.
*/
private def addTagForRootRepartition(plan: LogicalPlan): LogicalPlan = plan match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: we can skip this earlier with something like if (!plan.containsPattern(REPARTITION))

dongjoon-hyun pushed a commit that referenced this pull request Sep 9, 2022
…tions Non-AQE part

### What changes were proposed in this pull request?

backport #37706 for branch-3.3

Skip optimize the root user-specified repartition in `PropagateEmptyRelation`.

### Why are the changes needed?

Spark should preserve the final repatition which can affect the final output partition which is user-specified.

For example:

```scala
spark.sql("select * from values(1) where 1 < rand()").repartition(1)

// before:
== Optimized Logical Plan ==
LocalTableScan <empty>, [col1#0]

// after:
== Optimized Logical Plan ==
Repartition 1, true
+- LocalRelation <empty>, [col1#0]
```

### Does this PR introduce _any_ user-facing change?

yes, the empty plan may change

### How was this patch tested?

add test

Closes #37730 from ulysses-you/empty-3.3.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Nov 4, 2023
### What changes were proposed in this pull request?
This pr upgrade Apache Arrow from 13.0.0 to 14.0.0.

### Why are the changes needed?
The Apache Arrow 14.0.0 release brings a number of enhancements and bug fixes.
‎
In terms of bug fixes, the release addresses several critical issues that were causing failures in integration jobs with Spark([GH-36332](apache/arrow#36332)) and problems with importing empty data arrays([GH-37056](apache/arrow#37056)). It also optimizes the process of appending variable length vectors([GH-37829](apache/arrow#37829)) and includes C++ libraries for MacOS AARCH 64 in Java-Jars([GH-38076](apache/arrow#38076)).
‎
The new features and improvements focus on enhancing the handling and manipulation of data. This includes the introduction of DefaultVectorComparators for large types([GH-25659](apache/arrow#25659)), support for extended expressions in ScannerBuilder([GH-34252](apache/arrow#34252)), and the exposure of the VectorAppender class([GH-37246](apache/arrow#37246)).
‎
The release also brings enhancements to the development and testing process, with the CI environment now using JDK 21([GH-36994](apache/arrow#36994)). In addition, the release introduces vector validation consistent with C++, ensuring consistency across different languages([GH-37702](apache/arrow#37702)).
‎
Furthermore, the usability of VarChar writers and binary writers has been improved with the addition of extra input methods([GH-37705](apache/arrow#37705)), and VarCharWriter now supports writing from `Text` and `String`([GH-37706](apache/arrow#37706)). The release also adds typed getters for StructVector, improving the ease of accessing data([GH-37863](apache/arrow#37863)).

The full release notes as follows:
- https://arrow.apache.org/release/14.0.0.html

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43650 from LuciferYang/arrow-14.

Lead-authored-by: yangjie01 <[email protected]>
Co-authored-by: YangJie <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants