[SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning #33664

wangyum · 2021-08-06T07:44:12Z

What changes were proposed in this pull request?

Remove OptimizeSubqueries from batch of PartitionPruning to make DPP support more cases. For example:

SELECT date_id, product_id FROM fact_sk f                                        
JOIN (select store_id + 3 as new_store_id from dim_store where country = 'US') s 
ON f.store_id = s.new_store_id

Before this PR:

== Physical Plan ==
*(2) Project [date_id#3998, product_id#3999]
+- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false
   :- *(2) ColumnarToRow
   :  +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(true)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#274]
      +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
         +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
            +- *(1) ColumnarToRow
               +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>

After this PR:

== Physical Plan ==
*(2) Project [date_id#3998, product_id#3999]
+- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false
   :- *(2) ColumnarToRow
   :  +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN dynamicpruning#4007)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
   :        +- SubqueryBroadcast dynamicpruning#4007, 0, [new_store_id#3997], [id=#263]
   :           +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262]
   :              +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
   :                 +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
   :                    +- *(1) ColumnarToRow
   :                       +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
   +- ReusedExchange [new_store_id#3997], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262]

This is because OptimizeSubqueries will infer more filters, so we cannot reuse broadcasts. The following is the plan if disable spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly:

== Physical Plan ==
*(2) Project [date_id#3998, product_id#3999]
+- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false
   :- *(2) ColumnarToRow
   :  +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN subquery#4009)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
   :        +- Subquery subquery#4009, [id=#284]
   :           +- *(2) HashAggregate(keys=[new_store_id#3997#4008], functions=[])
   :              +- Exchange hashpartitioning(new_store_id#3997#4008, 5), ENSURE_REQUIREMENTS, [id=#280]
   :                 +- *(1) HashAggregate(keys=[new_store_id#3997 AS new_store_id#3997#4008], functions=[])
   :                    +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
   :                       +- *(1) Filter (((isnotnull(store_id#4002) AND isnotnull(country#4004)) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
   :                          +- *(1) ColumnarToRow
   :                             +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(store_id#4002), isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002..., Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(store_id), IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#305]
      +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
         +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
            +- *(1) ColumnarToRow
               +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>

Why are the changes needed?

Improve DPP to support more cases.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and benchmark test:

SQL	Before this PR(Seconds)	After this PR(Seconds)
TPC-DS q58	40	20
TPC-DS q83	18	14

SparkQA · 2021-08-06T09:18:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46655/

SparkQA · 2021-08-06T09:55:25Z

Test build #142143 has finished for PR 33664 at commit be2abff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-06T09:58:47Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46655/

wangyum · 2021-08-06T10:16:35Z

retest this please.

SparkQA · 2021-08-06T12:31:55Z

Test build #142158 has finished for PR 33664 at commit be2abff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-06T14:48:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46671/

SparkQA · 2021-08-06T15:25:40Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46671/

SparkQA · 2021-08-06T17:06:06Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46677/

wangyum · 2021-08-06T21:08:17Z

cc @cloud-fan @maryannxue

SparkQA · 2021-08-06T22:29:01Z

Test build #142164 has finished for PR 33664 at commit 0d7e228.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2021-08-07T00:29:12Z

retest this please.

SparkQA · 2021-08-07T01:42:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46696/

SparkQA · 2021-08-07T02:18:51Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46696/

SparkQA · 2021-08-07T05:44:14Z

Test build #142184 has finished for PR 33664 at commit 0d7e228.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-08-09T08:57:29Z

This is a tricky case, inferring more filters is generally good if it doesn't break DPP, so this is not a simple decision.

cc @maryannxue

wangyum · 2021-08-13T06:52:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala

    Batch("PartitionPruning", Once,
-      PartitionPruning,
-      OptimizeSubqueries) :+
+      PartitionPruning) :+
    Batch("Pushdown Filters from PartitionPruning", fixedPoint,


Another option is:

private val partitionPruningRules = Seq(PartitionPruning) ++ (if (catalog.conf.dynamicPartitionPruningReuseBroadcastOnly) Nil else Seq(OptimizeSubqueries)) Batch("PartitionPruning", Once, partitionPruningRules: _*) :+

SparkQA · 2021-08-16T07:11:40Z

Test build #142480 has finished for PR 33664 at commit 0d7e228.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2021-08-19T03:05:59Z

retest this please.

cloud-fan · 2021-08-19T03:58:29Z

a7a3935#diff-5221c65a64ad82c34cae68169cdb389210a9a28145058ae995b46ff4d3d4964cR39

We put this OptimizeSubqueries rule together with the DPP rule at the very beginning. It's kind of a mistake, as once this rule applies, we break plan reuse and thus break DPP. cc @maryannxue

This PR LGTM

SparkQA · 2021-08-19T04:27:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47133/

SparkQA · 2021-08-19T05:05:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47133/

SparkQA · 2021-08-19T08:28:19Z

Test build #142633 has finished for PR 33664 at commit 0d7e228.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…runing ### What changes were proposed in this pull request? Remove `OptimizeSubqueries` from batch of `PartitionPruning` to make DPP support more cases. For example: ```sql SELECT date_id, product_id FROM fact_sk f JOIN (select store_id + 3 as new_store_id from dim_store where country = 'US') s ON f.store_id = s.new_store_id ``` Before this PR: ``` == Physical Plan == *(2) Project [date_id#3998, product_id#3999] +- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false :- *(2) ColumnarToRow : +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(true)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#274] +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997] +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3))) +- *(1) ColumnarToRow +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string> ``` After this PR: ``` == Physical Plan == *(2) Project [date_id#3998, product_id#3999] +- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false :- *(2) ColumnarToRow : +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN dynamicpruning#4007)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int> : +- SubqueryBroadcast dynamicpruning#4007, 0, [new_store_id#3997], [id=#263] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262] : +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997] : +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3))) : +- *(1) ColumnarToRow : +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string> +- ReusedExchange [new_store_id#3997], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262] ``` This is because `OptimizeSubqueries` will infer more filters, so we cannot reuse broadcasts. The following is the plan if disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`: ``` == Physical Plan == *(2) Project [date_id#3998, product_id#3999] +- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false :- *(2) ColumnarToRow : +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN subquery#4009)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int> : +- Subquery subquery#4009, [id=#284] : +- *(2) HashAggregate(keys=[new_store_id#3997#4008], functions=[]) : +- Exchange hashpartitioning(new_store_id#3997#4008, 5), ENSURE_REQUIREMENTS, [id=#280] : +- *(1) HashAggregate(keys=[new_store_id#3997 AS new_store_id#3997#4008], functions=[]) : +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997] : +- *(1) Filter (((isnotnull(store_id#4002) AND isnotnull(country#4004)) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3))) : +- *(1) ColumnarToRow : +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(store_id#4002), isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002..., Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(store_id), IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#305] +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997] +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3))) +- *(1) ColumnarToRow +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string> ``` ### Why are the changes needed? Improve DPP to support more cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and benchmark test: SQL | Before this PR(Seconds) | After this PR(Seconds) -- | -- | -- TPC-DS q58 | 40 | 20 TPC-DS q83 | 18 | 14 Closes #33664 from wangyum/SPARK-36444. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 2310b99) Signed-off-by: Yuming Wang <[email protected]>

wangyum · 2021-08-19T08:47:02Z

Merged to master and branch-3.2.

…d optimize subqueries ### What changes were proposed in this pull request? This is a followup to #36304 to simplify `RowLevelOperationRuntimeGroupFiltering`. It does 3 things: 1. run `OptimizeSubqueries` in the batch `PartitionPruning`, so that `RowLevelOperationRuntimeGroupFiltering` does not need to invoke it manually. 2. skip dpp subquery in `OptimizeSubqueries`, to avoid the issue fixed by #33664 3. `RowLevelOperationRuntimeGroupFiltering` creates `InSubquery` instead of `DynamicPruningSubquery`, so that it can be optimized by `OptimizeSubqueries` later. This also avoids unnecessary planning overhead of `DynamicPruningSubquery`, as there is no join and we can only run it as a subquery. ### Why are the changes needed? code simplification ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #38557 from cloud-fan/help. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…d optimize subqueries ### What changes were proposed in this pull request? This is a followup to apache#36304 to simplify `RowLevelOperationRuntimeGroupFiltering`. It does 3 things: 1. run `OptimizeSubqueries` in the batch `PartitionPruning`, so that `RowLevelOperationRuntimeGroupFiltering` does not need to invoke it manually. 2. skip dpp subquery in `OptimizeSubqueries`, to avoid the issue fixed by apache#33664 3. `RowLevelOperationRuntimeGroupFiltering` creates `InSubquery` instead of `DynamicPruningSubquery`, so that it can be optimized by `OptimizeSubqueries` later. This also avoids unnecessary planning overhead of `DynamicPruningSubquery`, as there is no join and we can only run it as a subquery. ### Why are the changes needed? code simplification ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#38557 from cloud-fan/help. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

SPARK-36444: Remove OptimizeSubqueries from batch of PartitionPruning

be2abff

github-actions bot added the SQL label Aug 6, 2021

wangyum added 2 commits August 6, 2021 21:30

Merge remote-tracking branch 'upstream/master' into dpp

1c3933b

Merge upstream

0d7e228

wangyum commented Aug 13, 2021

View reviewed changes

wangyum closed this in 2310b99 Aug 19, 2021

wangyum deleted the SPARK-36444 branch August 19, 2021 08:47

pan3793 mentioned this pull request Feb 8, 2022

[SPARK-36444][SQL][3.1] Remove OptimizeSubqueries from batch of PartitionPruning #35431

Closed

cloud-fan mentioned this pull request Nov 8, 2022

[SPARK-38959][SQL][FOLLOWUP] Optimizer batch PartitionPruning should optimize subqueries #38557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning #33664

[SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning #33664

wangyum commented Aug 6, 2021 •

edited

Loading

SparkQA commented Aug 6, 2021

SparkQA commented Aug 6, 2021

SparkQA commented Aug 6, 2021

wangyum commented Aug 6, 2021

SparkQA commented Aug 6, 2021

SparkQA commented Aug 6, 2021

SparkQA commented Aug 6, 2021

SparkQA commented Aug 6, 2021

wangyum commented Aug 6, 2021

SparkQA commented Aug 6, 2021

wangyum commented Aug 7, 2021

SparkQA commented Aug 7, 2021

SparkQA commented Aug 7, 2021

SparkQA commented Aug 7, 2021

cloud-fan commented Aug 9, 2021

wangyum Aug 13, 2021

SparkQA commented Aug 16, 2021

wangyum commented Aug 19, 2021

cloud-fan commented Aug 19, 2021 •

edited

Loading

SparkQA commented Aug 19, 2021

SparkQA commented Aug 19, 2021

SparkQA commented Aug 19, 2021

wangyum commented Aug 19, 2021

[SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning #33664

[SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning #33664

Conversation

wangyum commented Aug 6, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Aug 6, 2021

SparkQA commented Aug 6, 2021

SparkQA commented Aug 6, 2021

wangyum commented Aug 6, 2021

SparkQA commented Aug 6, 2021

SparkQA commented Aug 6, 2021

SparkQA commented Aug 6, 2021

SparkQA commented Aug 6, 2021

wangyum commented Aug 6, 2021

SparkQA commented Aug 6, 2021

wangyum commented Aug 7, 2021

SparkQA commented Aug 7, 2021

SparkQA commented Aug 7, 2021

SparkQA commented Aug 7, 2021

cloud-fan commented Aug 9, 2021

wangyum Aug 13, 2021

Choose a reason for hiding this comment

SparkQA commented Aug 16, 2021

wangyum commented Aug 19, 2021

cloud-fan commented Aug 19, 2021 • edited Loading

SparkQA commented Aug 19, 2021

SparkQA commented Aug 19, 2021

SparkQA commented Aug 19, 2021

wangyum commented Aug 19, 2021

wangyum commented Aug 6, 2021 •

edited

Loading

cloud-fan commented Aug 19, 2021 •

edited

Loading