[SPARK-15076][SQL] Add ReorderAssociativeOperator optimizer #12850

dongjoon-hyun · 2016-05-02T22:41:51Z

What changes were proposed in this pull request?

This issue add a new optimizer ReorderAssociativeOperator by taking advantage of integral associative property. Currently, Spark works like the following.

Can optimize 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a into 45 + a.
Cannot optimize a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9.

This PR can handle Case 2 for Add/Multiply expression whose data types are ByteType, ShortType, IntegerType, and LongType. The followings are the plan comparison between before and after this issue.

Before

scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(((((((((a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS (((((((((a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
:     +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
scala> sql("select a*1*2*3*4*5*6*7*8*9 from (select explode(array(1)) a)").explain
== Physical Plan ==
*Project [(((((((((a#18 * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9) AS (((((((((a * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9)#19]
+- Generate explode([1]), false, false, [a#18]
   +- Scan OneRowRelation[]

After

scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(a#7 + 45) AS (((((((((a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
:     +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
scala> sql("select a*1*2*3*4*5*6*7*8*9 from (select explode(array(1)) a)").explain
== Physical Plan ==
*Project [(a#18 * 362880) AS (((((((((a * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9)#19]
+- Generate explode([1]), false, false, [a#18]
   +- Scan OneRowRelation[]

This PR is greatly generalized by @cloud-fan 's key ideas; he should be credited for the work he did.

How was this patch tested?

Pass the Jenkins tests including new testsuite.

SparkQA · 2016-05-03T00:04:54Z

Test build #57567 has finished for PR 12850 at commit 71c3c73.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-04T19:44:58Z

Test build #57781 has finished for PR 12850 at commit 6898e0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-06T18:28:35Z

Test build #58006 has finished for PR 12850 at commit a4a3ce3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-08T23:04:12Z

Test build #58113 has finished for PR 12850 at commit 3802255.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-10T17:31:19Z

Rebased to see the result on re-enable hive queries.

SparkQA · 2016-05-10T18:57:34Z

Test build #58246 has finished for PR 12850 at commit 06e9b36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-13T00:27:38Z

Test build #58519 has finished for PR 12850 at commit 18f5a8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-16T18:15:49Z

Test build #58648 has finished for PR 12850 at commit 0b60464.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-19T18:31:49Z

Test build #58875 has finished for PR 12850 at commit 65c7db7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-19T18:43:05Z

Rebased to trigger Jenkins test again.

SparkQA · 2016-05-19T20:45:24Z

Test build #58884 has finished for PR 12850 at commit 0ffd004.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-22T23:20:40Z

Test build #59114 has finished for PR 12850 at commit eeae56d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-24T16:53:46Z

Hi, @marmbrus and @rxin .
Could you review this PR when you have some time?

SparkQA · 2016-05-27T23:42:42Z

Test build #59531 has finished for PR 12850 at commit 8c8ea7a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-05-31T05:59:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      // Use associative property for integral type
+      case e if e.isInstanceOf[BinaryArithmetic] && e.dataType.isInstanceOf[IntegralType]
+        => e match {
+        case Add(Add(a, b), c) if b.foldable && c.foldable => Add(a, Add(b, c))


what about a + 1 + b + 2? I think we need a more general approach, like reordering the Add nodes to put all literals together.

Thank you for review, @cloud-fan !
I see. That sounds great.
Let me think about how to eliminate all constants then.

SparkQA · 2016-05-31T11:29:26Z

Test build #59645 has finished for PR 12850 at commit 8956a1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-31T11:31:44Z

Hi, @cloud-fan .
Could you review again?
Now, this PR provides a more generalized way to handle all foldable constants in any orders.

cloud-fan · 2016-05-31T16:58:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -742,6 +742,23 @@ object InferFiltersFromConstraints extends Rule[LogicalPlan] with PredicateHelpe
 * equivalent [[Literal]] values.
 */
 object ConstantFolding extends Rule[LogicalPlan] {
+  private def isAssociativelyFoldable(e: Expression): Boolean =


Similar to ReorderJoin, we should have a new rule ReorderAssociativeOperator to do this optimization, instead of putting it in ConstantFolding.

Oh, that could be.

There is some difference on level of granulity.

Join-related optimizers might be improved later to cost-based optimizers while ConstantFolder optimizer is just about removing constants on a single expression.

Do you think it is a good idea to put the different levels of concerns together?

I can do this in any way you decide. :)

I think this is OK, BooleanSimplification is also kind of constant folding but we made a new rule for it.

Thank you. I see!

dongjoon-hyun · 2016-05-31T22:50:03Z

Hi, @cloud-fan .
Now, I made a new rule ReorderAssociativeOperator as you recommended.
Jira issue and PR description are updated together, too.

SparkQA · 2016-05-31T23:44:08Z

Test build #59691 has finished for PR 12850 at commit 4e4845c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-31T23:47:58Z

Test build #59690 has finished for PR 12850 at commit d022904.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-01T01:59:05Z

Test build #59700 has finished for PR 12850 at commit 2ebc53c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-06-01T17:10:22Z

Hi, @cloud-fan .
It's ready for review again.
Could you review this when you have some time?
Thank you always!

cloud-fan · 2016-06-01T20:57:30Z

I discussed it with @davies offline and here is our conclusion:

This feature is not that important, as users can always do it manually, i.e. change the add/multiply order, which is not a lot effort.
When we have this future, users lose the control of the execution order. e.g. they may add UDFs and other literals together and they want a deterministic execution order.
other corner cases like overflow

In general, we think this feature brings too much nondeterminacy compared to the benefits it brings. What do you think?

dongjoon-hyun · 2016-06-01T21:58:19Z

Thank you for deep discussion on this. I think like this.

For 1), there are machine-generated queries by BI tools. This is an important category of queries. In many cases, BIs (or other tools having UI) will generated queries by simple rules and those rule does not care about the output queries. The optimization is the role of DBMS or Spark. So, static optimizations are always important. This PR also minimizes the size of generated codes, too.

For 2), other optimizers already remove or duplicate UDFs. Spark dose not give the control of the execution order. As you know, we already made the conclusion to leave an explicit note like the following for this (in SPARK-15282 and #13087).

Note that the user-defined functions must be deterministic. Due to optimization,
duplicate invocations may be eliminated or the function may even be invoked more times than
it is present in the query.

For 3), could you give some problematic real cases? This PR reordered only addition or multiplications, but I think this PR does not change the final result value. The following is the behavior of current Spark. (Not this PR. You can see that in the physical plan.)

scala> sql("select 2147483640 + a + 7 from (select explode(array(1,2,3)) a)").explain()
== Physical Plan ==
*Project [((2147483640 + a#8) + 7) AS ((2147483640 + a) + 7)#9]
+- Generate explode([1,2,3]), false, false, [a#8]
   +- Scan OneRowRelation[]

scala> sql("select 2147483640 + a + 7 from (select explode(array(1,2,3)) a)").collect()
res1: Array[org.apache.spark.sql.Row] = Array([-2147483648], [-2147483647], [-2147483646])

scala>  sql("select a + 2147483647 from (select explode(array(1,2,3)) a)").collect()
res2: Array[org.apache.spark.sql.Row] = Array([-2147483648], [-2147483647], [-2147483646])

scala> sql("select 214748364 * a from (select explode(array(1,2,3)) a)").collect()
res3: Array[org.apache.spark.sql.Row] = Array([214748364], [429496728], [644245092])

scala> sql("select 214748364 * a * 10 from (select explode(array(1,2,3)) a)").collect()
res4: Array[org.apache.spark.sql.Row] = Array([2147483640], [-16], [2147483624])

scala> sql("select a * 2147483640 from (select explode(array(1,2,3)) a)").collect()
res5: Array[org.apache.spark.sql.Row] = Array([2147483640], [-16], [2147483624])

Apparently, the optimization of this PR will work like the above.

dongjoon-hyun · 2016-06-01T21:59:02Z

Hi, @cloud-fan and @davies .
How do you think about the above?

cloud-fan · 2016-06-01T23:01:42Z

UDF is the first thing I came out, and yes, it must be deterministic. But as we have the deterministic property in Expression, I think it's possible for users to create non-deterministic expressions, e.g. ScalaUDAF, or other API we may create in the future, then the execution order matters.

You can still improve this PR to handle non-deterministic cases, but that will make this PR more complex and harder to reason about, which may not worth.

cc @davies

dongjoon-hyun · 2016-06-01T23:10:33Z

Thank you for feedback. I'm really happy with your attention!
For the non-deterministic part, we can add a single condition in isAssociativelyFoldable.
If some of operand of the expression is non-deterministic, the whole expression is non-deterministic. It's easy.
Also, it's future-proof. In the future, although we make NondeterministicScalaUDF whose deterministic==false, this optimizer will not handle the expressions containing it.

dongjoon-hyun · 2016-06-01T23:20:16Z

I added the missing part, e.deterministic check in isAssociativelyFoldable.

cloud-fan · 2016-06-02T00:01:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case q: LogicalPlan => q transformExpressionsDown {


how about

def flattenAdd(e: Expression): Seq[Expression] = e match { case Add(l, r) => flattenAdd(l) ++ flattenAdd(r) case other => other } ... plan transformAllExpressions { case a: Add if a.deterministic && a.dataType.isInstanceOf[IntegralType] => val (foldables, others) => flattenAdd(a).partition(_.foldable) if (foldables.size > 1) { val foldableExpr = foldables.reduce(Add(_, _)) val c = Literal.create(foldableExpr.eval(), a.dataType) if (others.isEmpty) c else Add(others.reduce(Add(_, _)), c) } else { a } }

We can duplicate some code for Multiply, and I think this maybe more readable than the current version.

I see. That could be.
We also need to add isSingleOperatorExpr there.
Otherwise, flattenAdd(Add(Multiply(1, 2), 3)) -> (3).

flattenAdd(Add(Multiply(1, 2), 3)) will become [Multiply(1, 2), 3], and we won't get wrong result

Oh, I see. You generalize my PR again! Great!

cloud-fan · 2016-06-02T00:02:31Z

looks like it's not such difficult to handle all cases, this optimization LGTM

dongjoon-hyun · 2016-06-02T00:05:48Z

...src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReorderAssociativeOperatorSuite.scala

+          'b * 1 * 2 * 3 * 4,
+          'a + 1 + 'b + 2 + 'c + 3,
+          Rand(0) * 1 * 2 * 3 * 4)
+


I already added non-deterministic case here.

dongjoon-hyun · 2016-06-02T00:10:04Z

Thank you for reconsidering this PR positively. I'll update soon according to your advice.

…l associative property.

dongjoon-hyun · 2016-06-02T00:41:30Z

@cloud-fan .
According to your advice, I refactored the code and added mixed(addition+multiplication) testcases. Also, the PR description is updated.
Thank you so much again.

SparkQA · 2016-06-02T01:23:58Z

Test build #59788 has finished for PR 12850 at commit 8b7a0bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-02T02:36:56Z

Test build #59795 has finished for PR 12850 at commit 0acb157.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-02T05:01:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+    case other => other :: Nil
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformExpressionsDown {


we should do:

plan transform { case q: LogicalPlan => q transformExpressionsDown { ...... } }

or here we just optimize the top level plan.

My bad. I changed this in a hurry. I'll fix soon.

cloud-fan · 2016-06-02T05:03:38Z

cc @davies , can you take a look?

rxin · 2016-06-02T05:14:42Z

BTW it goes without saying ... if you do decide to merge this, don't merge it in branch-2.0.

SparkQA · 2016-06-02T07:28:30Z

Test build #59824 has finished for PR 12850 at commit 3959d57.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-02T16:49:47Z

thanks, merging to master!

dongjoon-hyun · 2016-06-02T16:59:58Z

Oh, thank you! @cloud-fan .

cloud-fan reviewed May 31, 2016
View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-15076][SQL] Improve ConstantFolding optimizer by using integral associative property~~ [SPARK-15076][SQL] Add ReorderAssociativeOperator optimizer May 31, 2016

cloud-fan reviewed Jun 2, 2016
View reviewed changes

dongjoon-hyun reviewed Jun 2, 2016
View reviewed changes

dongjoon-hyun added 4 commits June 1, 2016 17:35

[SPARK-15076][SQL] Improve ConstantFolding optimizer by using integra…

1789a74

…l associative property.

Make new optimizer ReorderAssociativeOperator.

526ee94

Add deterministic check and testcase.

37bfa88

Improve the code according to the comments.

0acb157

cloud-fan reviewed Jun 2, 2016
View reviewed changes

Fix apply.

3959d57

asfgit closed this in 63b7f12 Jun 2, 2016

dongjoon-hyun deleted the SPARK-15076 branch July 20, 2016 07:37

[SPARK-15076][SQL] Add ReorderAssociativeOperator optimizer #12850

[SPARK-15076][SQL] Add ReorderAssociativeOperator optimizer #12850

Conversation

dongjoon-hyun commented May 2, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 3, 2016

SparkQA commented May 4, 2016

SparkQA commented May 6, 2016

SparkQA commented May 8, 2016

dongjoon-hyun commented May 10, 2016

SparkQA commented May 10, 2016

SparkQA commented May 13, 2016

SparkQA commented May 16, 2016

SparkQA commented May 19, 2016

dongjoon-hyun commented May 19, 2016

SparkQA commented May 19, 2016

SparkQA commented May 22, 2016

dongjoon-hyun commented May 24, 2016

SparkQA commented May 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 31, 2016

dongjoon-hyun commented May 31, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented May 31, 2016

SparkQA commented May 31, 2016

SparkQA commented May 31, 2016

SparkQA commented Jun 1, 2016

dongjoon-hyun commented Jun 1, 2016

cloud-fan commented Jun 1, 2016

dongjoon-hyun commented Jun 1, 2016

dongjoon-hyun commented Jun 1, 2016

cloud-fan commented Jun 1, 2016

dongjoon-hyun commented Jun 1, 2016

dongjoon-hyun commented Jun 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jun 2, 2016

Choose a reason for hiding this comment

dongjoon-hyun commented Jun 2, 2016

dongjoon-hyun commented Jun 2, 2016

SparkQA commented Jun 2, 2016

SparkQA commented Jun 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jun 2, 2016

rxin commented Jun 2, 2016

SparkQA commented Jun 2, 2016

cloud-fan commented Jun 2, 2016

dongjoon-hyun commented Jun 2, 2016

dongjoon-hyun commented May 2, 2016 •

edited

Loading

dongjoon-hyun commented May 31, 2016 •

edited

Loading