[SPARK-6624][SQL]Add CNF Normalization as part of optimization #8200

yjshen · 2015-08-14T12:18:15Z

This PR aims at adding CNF Normalization as part of optimization.

For example:

a && b || f => (a || f) && (b || f)
a || b || c && d => (a || b || c) && (a || b || d)
a || (b && c || d) => (a || b || d) && (a || c || d)

JIRA: https://issues.apache.org/jira/browse/SPARK-6624

liancheng · 2015-08-14T13:32:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+        case (l, r) => Or(l, r)
+      }
+    }
+  }


The following version may be questionably more readable:

private def pushOrToBottom(condition: Expression): Expression = { condition match { case Or(And(innerLhs, innerRhs), rhs) => And(pushOrToBottom(Or(innerLhs, rhs)), pushOrToBottom(Or(innerRhs, rhs))) case Or(lhs, And(innerLhs, innerRhs)) => And(pushOrToBottom(Or(lhs, innerLhs)), pushOrToBottom(Or(lhs, innerRhs))) case _ => condition } }

Shall we also cover cases like Not(Or(x, y)) => And(Not(x), Not(y)) here?

maybe I could do Not(Or(x, y)) => And(Not(x), Not(y)) in BooleanSimplification? or just push Not to bottom as the previous phase?

Oh I see. Moving De Morgan transformation intto BooleanSimplification makes sense. Please document this assumption. And you need to use optimizedPlan instead of analyzed in your test suite.

After a second thought, I tend to not move De Morgan conversion to BooleanSimplification. Coupling these two seems to be dangerous and the assumption can be easily broken by others in the future.

SparkQA · 2015-08-14T14:17:21Z

Test build #40873 has finished for PR 8200 at commit c00f3a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yjshen · 2015-08-14T14:20:05Z

Thanks @liancheng , I will update my PR soon.

liancheng · 2015-08-14T14:23:49Z

...atalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CNFNormalizationSuite.scala

+      atoms += expression
+    }
+    atoms.sortBy(_.toString).reduce(Or)
+  }


~~Seems that this method is equivalent to:~~

expression .collect { case e if !e.isInstanceOf[Or] => e } .sortBy(_.toString) .reduce(Or)

Maybe just return a Seq[Expression] without the final reduce(Or).

~~(The assumption here is that expression has already gone through CNF transformation, so that any sub-expression that is not an Or doesn't contain any Or either.)~~

Sorry, please ignore my comments above, made a mistake and the assumption is wrong...

Is that possible we still get Not(expr) even we finished boolean simplification? If so, simply matching !e.isInstanceOf[Or] and collect seems not proper?

Actually... It seems to be correct? Since you first do a splitConjunctivePredicates and then pass in elements of the result to this method... 😵

I mean, will Not(expr) be collected as Not(expr) and expr as two separate expressions?

@yjshen Yeah, the above assumption is only correct if you do REAL CNF conversion in this PR. Currently De Morgan law is not considered.

(This one replies this comment.)

@yjshen If you do real CNF conversion here (namely, taking Not into consideration), then it would be OK, since e in Not(e) cannot be Or.

Your current code also suffers from the Not(Or(x, y)) case (x and y are atoms), because you're using foreachUp rather than transformUp here.

SparkQA · 2015-08-14T20:21:49Z

Test build #40897 has finished for PR 8200 at commit 3d7fc52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yjshen · 2015-08-17T01:43:09Z

@liancheng, do you mind to review this again? Thanks.

chenghao-intel · 2015-08-17T01:55:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -501,6 +501,10 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper {
        case LessThanOrEqual(l, r) => GreaterThan(l, r)
        // not(not(e))  =>  e
        case Not(e) => e
+        // De Morgan's law: !(a || b) => !a && !b


To the BooleanSimplification, this probably not an optimization, isn't it? Probably we'd better to move this logic into predicates.scala, which used only for the cases for the join predicate push down.

I think it's a boolean simplification because we can only do further optimisation when Not is on leaf node, the above Not(LessThanOrEqual) or Not(Not) is a good example.

I mean we couldn't tell And(Not(lhs), Not(rhs)) is more optimal than Not(Or(lhs, rhs)), do we? Sorry if I missed something.

As mentioned in #8200 (comment), I tend not to have this one in BooleanSimplification. I'd prefer not having CNF conversion coupled with BooleanSimplification, and do transformations related to Not in the CNF part.

@chenghao-intel The Not case added here is not for simplification. It's used to push Not predicates to the bottom, so that e in Not(e) cannot be either And or Or. CNF conversion code relies on this assumption.

Yeah, I got it. I meant the same thing, for the normal expression evaluation, this code change probably cause performance regression, it's better to move the CNF stuff into the predicates.scala or patterns.scala, which only works for the predicate push down.

cloud-fan · 2015-08-17T16:03:48Z

I'm new to data source, can someone explain what can we benefit from CNF? I'm sorry if this question is too stupid...

liancheng · 2015-08-18T03:23:11Z

@cloud-fan Conjunctions (And predicates) are more friendlier to filter push-down optimization. Because it doesn't require both branches to be convertible. Take predicate a <= 1 AND someUdf(b) as an example, UDF is not pushable, but it's still safe to push a <= 1. On the other hand, if you got a > 1 OR !someUdf(b), you can't push down a > 1.

cloud-fan · 2015-08-18T03:43:25Z

Ah got it! Thanks for the explanation :)

SparkQA · 2015-08-18T08:22:31Z

Test build #41106 has finished for PR 8200 at commit b78105d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-18T09:00:01Z

Test build #41112 has finished for PR 8200 at commit b78105d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-18T12:30:53Z

Test build #41125 has finished for PR 8200 at commit b78105d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-19T04:06:18Z

Test build #41193 has finished for PR 8200 at commit 5c8e3db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yjshen · 2015-08-21T08:09:00Z

@liancheng , is the current version OK to you? seems you didn't see the updates.

yjshen · 2015-09-14T09:14:28Z

Since #5700 has been merged, I would revert the PR to its original version, without De Morgan's laws

SparkQA · 2015-09-14T13:20:28Z

Test build #42422 has finished for PR 8200 at commit 9dc64e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-09-15T21:53:14Z

Is there a reason to not do all of this in the optimizer?

yjshen · 2015-09-16T02:59:20Z

@marmbrus converting a filter into CNF may lead to an expanded filter, which I think is not necessarily a general optimisation.

marmbrus · 2015-09-16T18:37:36Z

I don't think that is true. Its pretty standard to convert predicates into CNF as part of optimization: http://db.cs.berkeley.edu/papers/UCB-MS-zfong.pdf

gatorsmile · 2015-12-18T16:44:10Z

As discussed in another PR #10362, we plan to add CNF normalization into the Optimizer. Will you do it? Otherwise, I can do it. Thanks!

gatorsmile · 2015-12-21T06:17:40Z

It sounds like multiple PRs are blocked by this PR. I will submit a PR for fixing it tomorrow. Thanks!

maropu · 2015-12-21T06:19:50Z

@gatorsmile +1 and great work :))

yjshen · 2015-12-21T10:37:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -489,13 +511,13 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper {
        case (l, Or(r, l1)) if (Not(l) == l1) => And(l, r)
        case (Or(l, l1), r) if (l1 == Not(r)) => And(l, r)
        case (Or(l1, l), r) if (l1 == Not(r)) => And(l, r)
-        // (a || b) && (a || c)  =>  a || (b && c)
+        // (a || b) && (a || b || c)  =>  a || b


(a || b) && (a || c) => a || (b && c) is just a transformation instead of optimization, it is only the case when we could eliminate one side like: (a || b) && (a || b || c) => a || b. Besides, the original transformation is opposite to CNF Normalize.

yjshen · 2015-12-21T10:44:53Z

@marmbrus @maropu @gatorsmile I've update my PR to add CNFNormalization into the Optimizer. I'm so sorry for the delay in my reply.

SparkQA · 2015-12-21T12:04:56Z

Test build #48111 has finished for PR 8200 at commit 7eebf6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2015-12-21T14:48:18Z

@yjshen Welcome back!

marmbrus · 2015-12-21T19:54:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+object CNFNormalization extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case q: LogicalPlan => q transformExpressionsUp {
+      case or @ Or(left, right) => (left, right) match {


I would not make this a nested match as I think it makes it unnecessarily hard to read (and if you just use transform directly you won't have to manually handle the default case)

SparkQA · 2015-12-22T03:01:19Z

Test build #48154 has finished for PR 8200 at commit 91b2c26.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-22T04:51:04Z

Test build #48156 has finished for PR 8200 at commit 91b2c26.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-12-22T07:10:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+  *
+  * Refer to https://en.wikipedia.org/wiki/Conjunctive_normal_form for more information
+  */
+object CNFNormalization extends Rule[LogicalPlan] {


@marmbrus @nongli do we want to do this for all expressions? If we do, maybe we should have a feature flag for this?

Actually come to think of it, it'd be great to be able to turn on/off optimization rules for testing. Most of these can be undocumented.

This seems scary to do in general without some kind of bounding. The transformation can explode the number of expressions. Is there an easy way we can cap this?

How about using the following heuristic solution to prevent exponential explosion:

Add a simple size method to TreeNode, which returns the size (total number of nodes) of a tree:

def size: Int = 1 + children.map(_.size).sum

Gives up CNF conversion once the result predicate exceeds a predefined threshold.

For example, we can stop if the size of the converted predicate is 10 times larger than the original one.

(I wonder how traditional RDBMS copes with the CNF exponential expansion issue?)

Capping the size seems reasonable. We need to make sure it continues to work even if the pass is rerun (respects the original limit).

liancheng · 2015-12-23T09:34:25Z

Made a draft PR #10444 for further discussion. It's based on the idea described in another comment of mine above to workaround exponential expansion issue of CNF normalization.

liancheng · 2015-12-23T10:41:00Z

Just realized that one of the common factor elimination rules defined in BooleanSimplification actually reverts CNF normalization, thus we may never reach fixed point. Should we remove it?

Update: According to #3784, this rule is actually useful for optimizing cartesian product into equi-join.

nongli · 2015-12-28T20:23:44Z

...lyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/BooleanSimplificationSuite.scala

@@ -116,13 +110,4 @@ class BooleanSimplificationSuite extends PlanTest with PredicateHelper {
      testRelation.where('a > 2 && ('b > 3 || 'b < 5)))
    comparePlans(actual, expected)


Can we add this case with the CNF rule enabled?

liancheng reviewed Aug 14, 2015
View reviewed changes

chenghao-intel reviewed Aug 17, 2015
View reviewed changes

yjshen force-pushed the cnf branch from 3d7fc52 to b78105d Compare August 18, 2015 07:52

cloud-fan mentioned this pull request Sep 11, 2015

[SPARK-7142][SQL]: Minor enhancement to BooleanSimplification Optimizer rule #5700

Closed

yjshen force-pushed the cnf branch from 5c8e3db to 9dc64e6 Compare September 14, 2015 11:11

codingjaguar mentioned this pull request Dec 10, 2015

[SPARK-12161][SQL] Ignore order of predicates in cache matching #10163

Closed

yjshen added 3 commits December 21, 2015 11:35

CNF transformation

0afcdd3

address comments

322f58a

merge apache-spark/master

125e97b

maropu mentioned this pull request Dec 21, 2015

[SPARK-12085] [SQL] The join condition hidden in DNF can't be pushed down to join operator #10087

Closed

make CNFNormalizaiton as a general optimization

7eebf6d

yjshen force-pushed the cnf branch from 9dc64e6 to 7eebf6d Compare December 21, 2015 10:23

yjshen reviewed Dec 21, 2015
View reviewed changes

yjshen changed the title ~~[SPARK-6624][SQL]Convert filters into CNF for data sources~~ [SPARK-6624][SQL]Add CNF Normalization as part of optimization Dec 21, 2015

marmbrus reviewed Dec 21, 2015
View reviewed changes

address Michael's comment

91b2c26

rxin reviewed Dec 22, 2015
View reviewed changes

liancheng mentioned this pull request Dec 23, 2015

[SPARK-6624][WIP] Draft of another alternative version of CNF normalization #10444

Closed

nongli reviewed Dec 28, 2015
View reviewed changes

yjshen closed this May 18, 2016

gatorsmile mentioned this pull request Sep 7, 2016

[SPARK-17357][SQL] Fix current predicate pushdown #14912

Closed

viirya mentioned this pull request Oct 20, 2016

[SPARK-17357][SPARK-6624][SQL] Convert filter predicate to CNF in Optimizer for pushdown #15558

Closed

		@@ -116,13 +110,4 @@ class BooleanSimplificationSuite extends PlanTest with PredicateHelper {
		testRelation.where('a > 2 && ('b > 3 \|\| 'b < 5)))
		comparePlans(actual, expected)

[SPARK-6624][SQL]Add CNF Normalization as part of optimization #8200

[SPARK-6624][SQL]Add CNF Normalization as part of optimization #8200

Conversation

yjshen commented Aug 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 14, 2015

yjshen commented Aug 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 14, 2015

yjshen commented Aug 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Aug 17, 2015

liancheng commented Aug 18, 2015

cloud-fan commented Aug 18, 2015

SparkQA commented Aug 18, 2015

SparkQA commented Aug 18, 2015

SparkQA commented Aug 18, 2015

SparkQA commented Aug 19, 2015

yjshen commented Aug 21, 2015

yjshen commented Sep 14, 2015

SparkQA commented Sep 14, 2015

marmbrus commented Sep 15, 2015

yjshen commented Sep 16, 2015

marmbrus commented Sep 16, 2015

gatorsmile commented Dec 18, 2015

gatorsmile commented Dec 21, 2015

maropu commented Dec 21, 2015

Choose a reason for hiding this comment

yjshen commented Dec 21, 2015

SparkQA commented Dec 21, 2015

gatorsmile commented Dec 21, 2015

Choose a reason for hiding this comment

SparkQA commented Dec 22, 2015

SparkQA commented Dec 22, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liancheng commented Dec 23, 2015

liancheng commented Dec 23, 2015

Choose a reason for hiding this comment