[SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions Non-AQE part #37706

ulysses-you · 2022-08-29T10:35:07Z

What changes were proposed in this pull request?

Skip optimize the root user-specified repartition in PropagateEmptyRelation.

Why are the changes needed?

Spark should preserve the final repatition which can affect the final output partition which is user-specified.

For example:

spark.sql("select * from values(1) where 1 < rand()").repartition(1)

// before:
== Optimized Logical Plan ==
LocalTableScan <empty>, [col1#0]

// after:
== Optimized Logical Plan ==
Repartition 1, true
+- LocalRelation <empty>, [col1#0]

Does this PR introduce any user-facing change?

yes, the empty plan may change

How was this patch tested?

add test

…QE part

ulysses-you · 2022-08-29T10:35:20Z

cc @cloud-fan

cloud-fan · 2022-08-29T15:59:22Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala

+   */
+  protected def addTagForRootRepartition(plan: LogicalPlan): LogicalPlan = {
+    var isRootRepartition = true
+    plan.transformDownWithPruning(_.containsPattern(repartitionTreePattern)) {


I think it's better to use a manual traversal here. We can stop traversal once we hit a node that is not repartition/project/filter.

cloud-fan · 2022-08-30T02:06:21Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

@@ -1310,6 +1310,14 @@ class PlannerSuite extends SharedSparkSession with AdaptiveSparkPlanHelper {
    assert(topKs.size == 1)
    assert(sorts.isEmpty)
  }
+
+  test("SPARK-39915: Dataset.repartition(N) may not create N partitions") {


This is not a planner bug... We can probably add an end-to-end test in DataFrameSuite

cloud-fan · 2022-08-30T06:30:34Z

thanks, merging to master!

cloud-fan · 2022-08-30T06:31:54Z

@ulysses-you can you open a backport PR for 3.3? it has conflicts.

… Non-AQE part Skip optimize the root user-specified repartition in `PropagateEmptyRelation`. Spark should preserve the final repatition which can affect the final output partition which is user-specified. For example: ```scala spark.sql("select * from values(1) where 1 < rand()").repartition(1) // before: == Optimized Logical Plan == LocalTableScan <empty>, [col1#0] // after: == Optimized Logical Plan == Repartition 1, true +- LocalRelation <empty>, [col1#0] ``` yes, the empty plan may change add test Closes apache#37706 from ulysses-you/empty. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

ulysses-you · 2022-08-30T14:07:48Z

thank you @cloud-fan , craeted #37730

cloud-fan · 2022-08-31T14:38:14Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala

+   * Add a [[ROOT_REPARTITION]] tag for the root user-specified repartition so this rule can
+   * skip optimize it.
+   */
+  private def addTagForRootRepartition(plan: LogicalPlan): LogicalPlan = plan match {


note: we can skip this earlier with something like if (!plan.containsPattern(REPARTITION))

…tions Non-AQE part ### What changes were proposed in this pull request? backport #37706 for branch-3.3 Skip optimize the root user-specified repartition in `PropagateEmptyRelation`. ### Why are the changes needed? Spark should preserve the final repatition which can affect the final output partition which is user-specified. For example: ```scala spark.sql("select * from values(1) where 1 < rand()").repartition(1) // before: == Optimized Logical Plan == LocalTableScan <empty>, [col1#0] // after: == Optimized Logical Plan == Repartition 1, true +- LocalRelation <empty>, [col1#0] ``` ### Does this PR introduce _any_ user-facing change? yes, the empty plan may change ### How was this patch tested? add test Closes #37730 from ulysses-you/empty-3.3. Authored-by: ulysses-you <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This pr upgrade Apache Arrow from 13.0.0 to 14.0.0. ### Why are the changes needed? The Apache Arrow 14.0.0 release brings a number of enhancements and bug fixes. ‎ In terms of bug fixes, the release addresses several critical issues that were causing failures in integration jobs with Spark([GH-36332](apache/arrow#36332)) and problems with importing empty data arrays([GH-37056](apache/arrow#37056)). It also optimizes the process of appending variable length vectors([GH-37829](apache/arrow#37829)) and includes C++ libraries for MacOS AARCH 64 in Java-Jars([GH-38076](apache/arrow#38076)). ‎ The new features and improvements focus on enhancing the handling and manipulation of data. This includes the introduction of DefaultVectorComparators for large types([GH-25659](apache/arrow#25659)), support for extended expressions in ScannerBuilder([GH-34252](apache/arrow#34252)), and the exposure of the VectorAppender class([GH-37246](apache/arrow#37246)). ‎ The release also brings enhancements to the development and testing process, with the CI environment now using JDK 21([GH-36994](apache/arrow#36994)). In addition, the release introduces vector validation consistent with C++, ensuring consistency across different languages([GH-37702](apache/arrow#37702)). ‎ Furthermore, the usability of VarChar writers and binary writers has been improved with the addition of extra input methods([GH-37705](apache/arrow#37705)), and VarCharWriter now supports writing from `Text` and `String`([GH-37706](apache/arrow#37706)). The release also adds typed getters for StructVector, improving the ease of accessing data([GH-37863](apache/arrow#37863)). The full release notes as follows: - https://arrow.apache.org/release/14.0.0.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43650 from LuciferYang/arrow-14. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

SPARK-39915: Dataset.repartition(N) may not create N partitions Non-A…

489fd0a

…QE part

github-actions bot added the SQL label Aug 29, 2022

cloud-fan reviewed Aug 29, 2022

View reviewed changes

manual traversal

ab8c649

cloud-fan reviewed Aug 30, 2022

View reviewed changes

cloud-fan approved these changes Aug 30, 2022

View reviewed changes

dataframe suite

97c284a

cloud-fan closed this in ff7ab34 Aug 30, 2022

ulysses-you mentioned this pull request Aug 30, 2022

[SPARK-39915][SQL][3.3] Dataset.repartition(N) may not create N partitions Non-AQE part #37730

Closed

ulysses-you deleted the empty branch August 30, 2022 14:07

cloud-fan reviewed Aug 31, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions Non-AQE part #37706

[SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions Non-AQE part #37706

ulysses-you commented Aug 29, 2022

ulysses-you commented Aug 29, 2022

cloud-fan Aug 29, 2022

ulysses-you Aug 30, 2022

cloud-fan Aug 30, 2022

cloud-fan commented Aug 30, 2022 •

edited

Loading

cloud-fan commented Aug 30, 2022

ulysses-you commented Aug 30, 2022

cloud-fan Aug 31, 2022

[SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions Non-AQE part #37706

[SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions Non-AQE part #37706

Conversation

ulysses-you commented Aug 29, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

ulysses-you commented Aug 29, 2022

cloud-fan Aug 29, 2022

Choose a reason for hiding this comment

ulysses-you Aug 30, 2022

Choose a reason for hiding this comment

cloud-fan Aug 30, 2022

Choose a reason for hiding this comment

cloud-fan commented Aug 30, 2022 • edited Loading

cloud-fan commented Aug 30, 2022

ulysses-you commented Aug 30, 2022

cloud-fan Aug 31, 2022

Choose a reason for hiding this comment

cloud-fan commented Aug 30, 2022 •

edited

Loading