[SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs #29950

viirya · 2020-10-06T02:50:00Z

What changes were proposed in this pull request?

This patch proposes to avoid collapsing adjacent Project in query optimizer if the combined Project will duplicate too many common expressions. One SQL config spark.sql.optimizer.maxCommonExprsInCollapseProject is added to set up the maximum allowed number of common expressions.

Why are the changes needed?

In some edge cases, collapsing adjacent Project hurts performance, instead of improving it. We observed such behavior in our customer Spark jobs where one expensive expression was repeatedly duplicated many times. It is hard to have a optimizer rule that could decide whether to collapse two Projects because we don't know the cost of each expression. Currently we can provide a SQL config so users can set it up to change optimizer's behavior regarding collapsing adjacent Projects.

Note that normally in whole-stage codegen Project operator will de-duplicate expressions internally, but in edge cases Spark cannot do whole-stage codegen and fallback to interpreted mode. In such cases, users can use this config to avoid duplicate expressions.

Does this PR introduce any user-facing change?

Yes. Users can change optimizer's behavior regarding collapsing Projects by setting SQL config.

How was this patch tested?

Unit test.

viirya · 2020-10-06T02:50:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -766,6 +768,23 @@ object CollapseProject extends Rule[LogicalPlan] {
    })
  }


We could extend to other cases like case p @ Project(_, agg: Aggregate), but leave it untouched for now.

SparkQA · 2020-10-06T03:43:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34039/

SparkQA · 2020-10-06T04:01:15Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34039/

maropu · 2020-10-06T05:15:15Z

Related to #29094 ?

viirya · 2020-10-06T05:35:00Z

Related to #29094 ?

No, after did a quick scan of that PR. That PR targets driver OOM caused by too many leaf expressions in collapsed Project. Here this diff cares about duplicated common expressions in collapsed Project. Different problems, I think.

tanelk · 2020-10-06T06:59:23Z

Perhaps the max number of common expressions is not the best metric here?

Lets compare two cases:

On the lower project you have a JsonToStructs and on upper Project you get 3 fields from that struct. This would mean 2 redundant computations and the "metric" you are looking at is 3.
On the lower project you have two JsonToStructs and on upper Project you get 2 fields from both stucts. This would also mean 2 redundant computations and the "metric" you are looking at is 2.

Adding more JsonToStructs to the lower level would increase the number redundant computations without increasing the max value.
So as an alternative I would propose "the number of redundant computations" (sum of values in the exprMap minus its size) as a metric to use.

Although I must admit, that in that case we might cache more values for the number of extra computations we save.
So both of them have their benefits.

tanelk · 2020-10-06T07:01:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
+
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+        getLargestNumOfCommonOutput(p1.projectList, p2.projectList) >= maxCommonExprs) {


Perhaps this comparison should be > instead of >=, because currently the actual max value is maxCommonExprs - 1.

SparkQA · 2020-10-06T07:05:02Z

Test build #129432 has finished for PR 29950 at commit f418714.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-10-06T21:25:08Z

Perhaps the max number of common expressions is not the best metric here?

Lets compare two cases:
1. On the lower project you have a `JsonToStructs` and on upper Project you get 3 fields from that struct. This would mean 2 redundant computations and the "metric" you are looking at is 3.

2. On the lower project you have two `JsonToStructs` and on upper Project you get 2 fields from both stucts. This would also mean 2 redundant computations and the "metric" you are looking at is 2.
Adding more JsonToStructs to the lower level would increase the number redundant computations without increasing the max value.
So as an alternative I would propose "the number of redundant computations" (sum of values in the exprMap minus its size) as a metric to use.

Although I must admit, that in that case we might cache more values for the number of extra computations we save.
So both of them have their benefits.

Yes, in the case you add the number of redundant computations each time you add one more JsonToStructs. But overall it should not cause noticeable performance issue because you simply three times the running cost of JsonToStructs (assume each JsonToStructs has 2 redundant computations).

The number of redundant computations is misleading. If we have 100 JsonToStructs in lower project, each of them has 2 redundant computations in upper project, it doesn't mean we 100 times the running cost of JsonToStructs. In other words, it is hard to tell the performance difference between 10 and 20 redundant computations. If the redundant computations come from the same expression, then we have 10 times v.s. 20 times running cost, but if they are from 10 expressions? We might just have 2 to 3 times running cost.

viirya · 2020-10-06T21:35:29Z

cc @cloud-fan @dongjoon-hyun too

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala

SparkQA · 2020-10-07T02:50:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34091/

SparkQA · 2020-10-07T02:55:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34092/

SparkQA · 2020-10-07T03:09:20Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34091/

SparkQA · 2020-10-07T03:11:49Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34092/

tanelk · 2020-10-08T19:56:57Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala

+        val query = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))
+          .select(
+            GetStructField('struct, 0).as("a"),
+            GetStructField('struct, 1).as("b"),
+            GetStructField('struct, 2).as("c")).analyze


When using the dataset API, then it would be very common to chain withColumn calls:

dataset .withColumn("json", ...) .withColumn("a", col("json").getField("a")) .withColumn("b", col("json").getField("b")) .withColumn("c", col("json").getField("c"))

In that case the query should look more like this:

val query = relation .select('json, JsonToStructs(schema, options, 'json).as("struct")) .select('json, 'struct, GetStructField('struct, 0).as("a")) .select('json, 'struct, 'a, GetStructField('struct, 1).as("b")) .select('json, 'struct, 'a, 'b, GetStructField('struct, 2).as("c")) .analyze

The CollapseProject rule uses transformUp. It seems that in that case we do not get the expected results from this optimization.

This seems can be fixed by using transformDown instead? Seems to me CollapseProject is not necessarily to use transformUp if I don't miss anything. cc @cloud-fan @maropu

If there is a chain of projects: P1(P2(P3(P4(...)))), then using transformDown will firstly merge P1 and P2 into P12 and then it will go to its child P3 and merge it with P4 into P34. Only on the second iteration it will merge all 4 of these.

In this case we want to merge P123 and then see, that we can't merge with P4 because we would exceed maxCommonExprsInCollapseProject.

I think, that correct way would be using transformDown in a similar manner to recursiveRemoveSort in #21072.
So basically when you hit the first Project, then you collect all consecutive Projects until you hit the maxCommonExprsInCollapseProject limit and merge them.

hm, it sounds fine, too. Rather, it seems a top-down transformation can collapse projects in one shot just like RemoveRedundantProjects?

Seems like we need to change to transformDown and take a recursive approach like RemoveRedundantProjects and recursiveRemoveSort for collapsing Project.

dongjoon-hyun · 2020-10-09T09:41:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "if merging two Project, Spark SQL will skip the merging.")
+      .version("3.1.0")
+      .intConf
+      .createWithDefault(20)


Just a question. Is there a reason to choose 20?

No, just decide a number that seems bad for repeating an expression.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

dongjoon-hyun · 2020-10-09T09:43:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      val maxCommonExprs = SQLConf.get.maxCommonExprsInCollapseProject
+
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+        getLargestNumOfCommonOutput(p1.projectList, p2.projectList) > maxCommonExprs) {


indentation?

dongjoon-hyun · 2020-10-09T09:45:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

@@ -124,14 +128,34 @@ object ScanOperation extends OperationHelper with PredicateHelper {
    }.exists(!_.deterministic))
  }

+  def moreThanMaxAllowedCommonOutput(
+       expr: Seq[NamedExpression],


indentation? It seems that there is one more space here.

dongjoon-hyun · 2020-10-09T09:46:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+            // do not have common non-deterministic expressions, or do not have equal to/more than
+            // maximum allowed common outputs.
+            if (!hasCommonNonDeterministic(fields, aliases)
+                || !moreThanMaxAllowedCommonOutput(fields, aliases)) {


nit, you may want to move || into line 157.

sure. thanks.

SparkQA · 2020-10-09T20:25:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34198/

viirya · 2020-11-01T17:40:45Z

gentle ping @dongjoon-hyun @cloud-fan

dongjoon-hyun · 2020-11-12T06:21:36Z

Oops. Sorry for being late, @viirya .

dongjoon-hyun · 2020-11-12T06:21:59Z

Retest this please.

SparkQA · 2020-11-12T07:09:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35578/

SparkQA · 2020-11-12T07:37:07Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35578/

SparkQA · 2020-11-12T08:05:02Z

Test build #130972 has finished for PR 29950 at commit 58e71d8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-11-12T08:35:49Z

retest this please

SparkQA · 2020-11-12T09:41:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35589/

SparkQA · 2020-11-12T10:14:44Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35589/

SparkQA · 2020-11-12T13:47:12Z

Test build #130983 has finished for PR 29950 at commit 58e71d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-11-12T23:46:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .version("3.1.0")
+      .intConf
+      .checkValue(_ > 0, "The value of maxCommonExprsInCollapseProject must be larger than zero.")
+      .createWithDefault(20)


If possible, can we introduce this configuration with Int.MaxValue in 3.1.0 first? We can reduce it later.

Sure. It is safer.

SparkQA · 2020-11-13T00:52:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35630/

SparkQA · 2020-11-13T01:19:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35630/

SparkQA · 2020-11-13T05:19:46Z

Test build #131024 has finished for PR 29950 at commit bbaae3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-11-13T05:32:48Z

The GitHub Action's flakiness at sql - slow tests is fixed via #30365 .

dongjoon-hyun · 2020-11-13T05:36:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-      } else {
-        p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList))
-      }
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {


Is there a reason to change from transformUp to transformDown? If the all test passed, it would be safe if we keep the original one.

I found the previous comment about supporting withColumn. If this is designed for that, shall we add a test case for that?

[SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs #29950 (comment)

dongjoon-hyun · 2020-11-13T05:45:04Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala

+        // If we collapse two Projects, `JsonToStructs` will be repeated three times.
+        val relation = LocalRelation('json.string)
+        val query1 = relation.select(
+          JsonToStructs(schema, options, 'json).as("struct"))


indentation? Maybe, the following is better?

- val query1 = relation.select( - JsonToStructs(schema, options, 'json).as("struct")) - .select( + val query1 = relation.select(JsonToStructs(schema, options, 'json).as("struct")) + .select(

dongjoon-hyun · 2020-11-13T05:47:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "the physical planning.")
+      .version("3.1.0")
+      .intConf
+      .checkValue(_ > 0, "The value of maxCommonExprsInCollapseProject must be larger than zero.")


larger than zero -> positive.

dongjoon-hyun · 2020-11-13T05:52:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+    })
+  }
+
+  // Whether the largest times common outputs from lower operator used in upper operators is


upper operators -> upper operator?

dongjoon-hyun · 2020-11-13T05:53:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+  }
+
+  // Whether the largest times common outputs from lower operator used in upper operators is
+  // larger than allowed.


than allowed -> than the maximum?

dongjoon-hyun · 2020-11-25T03:43:40Z

Retest this please.

SparkQA · 2020-11-25T08:14:59Z

Test build #131729 has finished for PR 29950 at commit bbaae3e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-11-25T13:52:10Z

retest this please

SparkQA · 2020-11-25T18:52:18Z

Test build #131780 has finished for PR 29950 at commit bbaae3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-12-07T06:26:58Z

I recently generalized subexpression elimination feature to interpreted project and predicate. So now both whole-stage codegen and interpreted execution support subexpression elimination that could avoid the performance issue caused by embedding common expressions from collapsing projects.

That's said, I think this patch is less useful for now. I'm closing it now.

dongjoon-hyun · 2020-12-08T19:02:49Z

Thank you for your decision, @viirya !

HyukjinKwon · 2020-12-09T09:49:49Z

Thanks @viirya

Avoid collapsing projects if reaching max allowed common exprs.

f418714

viirya force-pushed the SPARK-32945 branch from a27c748 to f418714 Compare October 6, 2020 02:50

viirya commented Oct 6, 2020

View reviewed changes

tanelk reviewed Oct 6, 2020

View reviewed changes

maropu reviewed Oct 7, 2020

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala Show resolved Hide resolved

viirya added 2 commits October 6, 2020 18:56

Avoid collapsing projection lists in physical query.

98843dd

Should be more than instead of equal to and more than.

1b567e7

This comment has been minimized.

Sign in to view

tanelk reviewed Oct 8, 2020

View reviewed changes

dongjoon-hyun reviewed Oct 9, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Show resolved Hide resolved

dongjoon-hyun reviewed Oct 9, 2020

View reviewed changes

Update the approach.

76509b3

This comment has been minimized.

Sign in to view

dongjoon-hyun reviewed Nov 12, 2020

View reviewed changes

Change config default value.

bbaae3e

dongjoon-hyun reviewed Nov 13, 2020

View reviewed changes

viirya closed this Dec 7, 2020

viirya deleted the SPARK-32945 branch December 27, 2023 18:28

		@@ -766,6 +768,23 @@ object CollapseProject extends Rule[LogicalPlan] {
		})
		}

[SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs #29950

[SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs #29950

Conversation

viirya commented Oct 6, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

SparkQA commented Oct 6, 2020

SparkQA commented Oct 6, 2020

maropu commented Oct 6, 2020

viirya commented Oct 6, 2020

tanelk commented Oct 6, 2020 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Oct 6, 2020

viirya commented Oct 6, 2020 • edited Loading

viirya commented Oct 6, 2020

SparkQA commented Oct 7, 2020

SparkQA commented Oct 7, 2020

This comment has been minimized.

SparkQA commented Oct 7, 2020

SparkQA commented Oct 7, 2020

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tanelk Oct 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Oct 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

SparkQA commented Oct 9, 2020

viirya commented Nov 1, 2020

dongjoon-hyun commented Nov 12, 2020

dongjoon-hyun commented Nov 12, 2020

SparkQA commented Nov 12, 2020

SparkQA commented Nov 12, 2020

SparkQA commented Nov 12, 2020

maropu commented Nov 12, 2020

SparkQA commented Nov 12, 2020

SparkQA commented Nov 12, 2020

SparkQA commented Nov 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 13, 2020

SparkQA commented Nov 13, 2020

SparkQA commented Nov 13, 2020

dongjoon-hyun commented Nov 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Nov 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 25, 2020

SparkQA commented Nov 25, 2020

maropu commented Nov 25, 2020

SparkQA commented Nov 25, 2020

viirya commented Dec 7, 2020

dongjoon-hyun commented Dec 8, 2020

HyukjinKwon commented Dec 9, 2020

viirya commented Oct 6, 2020 •

edited

Loading

tanelk commented Oct 6, 2020 •

edited

Loading

viirya commented Oct 6, 2020 •

edited

Loading

tanelk Oct 9, 2020 •

edited

Loading

dongjoon-hyun Oct 9, 2020 •

edited

Loading

dongjoon-hyun Nov 13, 2020 •

edited

Loading