Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6624][SQL]Add CNF Normalization as part of optimization #8200

Closed
wants to merge 5 commits into from

Conversation

yjshen
Copy link
Member

@yjshen yjshen commented Aug 14, 2015

This PR aims at adding CNF Normalization as part of optimization.

For example:

a && b || f => (a || f) && (b || f)
a || b || c && d => (a || b || c) && (a || b || d)
a || (b && c || d) => (a || b || d) && (a || c || d)

JIRA: https://issues.apache.org/jira/browse/SPARK-6624

case (l, r) => Or(l, r)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following version may be questionably more readable:

  private def pushOrToBottom(condition: Expression): Expression = {
    condition match {
      case Or(And(innerLhs, innerRhs), rhs) =>
        And(pushOrToBottom(Or(innerLhs, rhs)), pushOrToBottom(Or(innerRhs, rhs)))

      case Or(lhs, And(innerLhs, innerRhs)) =>
        And(pushOrToBottom(Or(lhs, innerLhs)), pushOrToBottom(Or(lhs, innerRhs)))

      case _ => condition
    }
  }

Shall we also cover cases like Not(Or(x, y)) => And(Not(x), Not(y)) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I could do Not(Or(x, y)) => And(Not(x), Not(y)) in BooleanSimplification? or just push Not to bottom as the previous phase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Moving De Morgan transformation intto BooleanSimplification makes sense. Please document this assumption. And you need to use optimizedPlan instead of analyzed in your test suite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a second thought, I tend to not move De Morgan conversion to BooleanSimplification. Coupling these two seems to be dangerous and the assumption can be easily broken by others in the future.

@SparkQA
Copy link

SparkQA commented Aug 14, 2015

Test build #40873 has finished for PR 8200 at commit c00f3a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yjshen
Copy link
Member Author

yjshen commented Aug 14, 2015

Thanks @liancheng , I will update my PR soon.

atoms += expression
}
atoms.sortBy(_.toString).reduce(Or)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that this method is equivalent to:

expression
  .collect { case e if !e.isInstanceOf[Or] => e }
  .sortBy(_.toString)
  .reduce(Or)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just return a Seq[Expression] without the final reduce(Or).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The assumption here is that expression has already gone through CNF transformation, so that any sub-expression that is not an Or doesn't contain any Or either.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, please ignore my comments above, made a mistake and the assumption is wrong...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that possible we still get Not(expr) even we finished boolean simplification? If so, simply matching !e.isInstanceOf[Or] and collect seems not proper?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... It seems to be correct? Since you first do a splitConjunctivePredicates and then pass in elements of the result to this method... 😵

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, will Not(expr) be collected as Not(expr) and expr as two separate expressions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yjshen Yeah, the above assumption is only correct if you do REAL CNF conversion in this PR. Currently De Morgan law is not considered.

(This one replies this comment.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yjshen If you do real CNF conversion here (namely, taking Not into consideration), then it would be OK, since e in Not(e) cannot be Or.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your current code also suffers from the Not(Or(x, y)) case (x and y are atoms), because you're using foreachUp rather than transformUp here.

@SparkQA
Copy link

SparkQA commented Aug 14, 2015

Test build #40897 has finished for PR 8200 at commit 3d7fc52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yjshen
Copy link
Member Author

yjshen commented Aug 17, 2015

@liancheng, do you mind to review this again? Thanks.

@@ -501,6 +501,10 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper {
case LessThanOrEqual(l, r) => GreaterThan(l, r)
// not(not(e)) => e
case Not(e) => e
// De Morgan's law: !(a || b) => !a && !b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the BooleanSimplification, this probably not an optimization, isn't it? Probably we'd better to move this logic into predicates.scala, which used only for the cases for the join predicate push down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a boolean simplification because we can only do further optimisation when Not is on leaf node, the above Not(LessThanOrEqual) or Not(Not) is a good example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we couldn't tell And(Not(lhs), Not(rhs)) is more optimal than Not(Or(lhs, rhs)), do we? Sorry if I missed something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #8200 (comment), I tend not to have this one in BooleanSimplification. I'd prefer not having CNF conversion coupled with BooleanSimplification, and do transformations related to Not in the CNF part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenghao-intel The Not case added here is not for simplification. It's used to push Not predicates to the bottom, so that e in Not(e) cannot be either And or Or. CNF conversion code relies on this assumption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I got it. I meant the same thing, for the normal expression evaluation, this code change probably cause performance regression, it's better to move the CNF stuff into the predicates.scala or patterns.scala, which only works for the predicate push down.

@cloud-fan
Copy link
Contributor

I'm new to data source, can someone explain what can we benefit from CNF? I'm sorry if this question is too stupid...

@liancheng
Copy link
Contributor

@cloud-fan Conjunctions (And predicates) are more friendlier to filter push-down optimization. Because it doesn't require both branches to be convertible. Take predicate a <= 1 AND someUdf(b) as an example, UDF is not pushable, but it's still safe to push a <= 1. On the other hand, if you got a > 1 OR !someUdf(b), you can't push down a > 1.

@cloud-fan
Copy link
Contributor

Ah got it! Thanks for the explanation :)

@SparkQA
Copy link

SparkQA commented Aug 18, 2015

Test build #41106 has finished for PR 8200 at commit b78105d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 18, 2015

Test build #41112 has finished for PR 8200 at commit b78105d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 18, 2015

Test build #41125 has finished for PR 8200 at commit b78105d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2015

Test build #41193 has finished for PR 8200 at commit 5c8e3db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yjshen
Copy link
Member Author

yjshen commented Aug 21, 2015

@liancheng , is the current version OK to you? seems you didn't see the updates.

@yjshen
Copy link
Member Author

yjshen commented Sep 14, 2015

Since #5700 has been merged, I would revert the PR to its original version, without De Morgan's laws

@SparkQA
Copy link

SparkQA commented Sep 14, 2015

Test build #42422 has finished for PR 8200 at commit 9dc64e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Is there a reason to not do all of this in the optimizer?

@yjshen
Copy link
Member Author

yjshen commented Sep 16, 2015

@marmbrus converting a filter into CNF may lead to an expanded filter, which I think is not necessarily a general optimisation.

@marmbrus
Copy link
Contributor

I don't think that is true. Its pretty standard to convert predicates into CNF as part of optimization: http://db.cs.berkeley.edu/papers/UCB-MS-zfong.pdf

@gatorsmile
Copy link
Member

As discussed in another PR #10362, we plan to add CNF normalization into the Optimizer. Will you do it? Otherwise, I can do it. Thanks!

@gatorsmile
Copy link
Member

It sounds like multiple PRs are blocked by this PR. I will submit a PR for fixing it tomorrow. Thanks!

@maropu
Copy link
Member

maropu commented Dec 21, 2015

@gatorsmile +1 and great work :))

@@ -489,13 +511,13 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper {
case (l, Or(r, l1)) if (Not(l) == l1) => And(l, r)
case (Or(l, l1), r) if (l1 == Not(r)) => And(l, r)
case (Or(l1, l), r) if (l1 == Not(r)) => And(l, r)
// (a || b) && (a || c) => a || (b && c)
// (a || b) && (a || b || c) => a || b
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a || b) && (a || c) => a || (b && c) is just a transformation instead of optimization, it is only the case when we could eliminate one side like: (a || b) && (a || b || c) => a || b. Besides, the original transformation is opposite to CNF Normalize.

@yjshen
Copy link
Member Author

yjshen commented Dec 21, 2015

@marmbrus @maropu @gatorsmile I've update my PR to add CNFNormalization into the Optimizer. I'm so sorry for the delay in my reply.

@yjshen yjshen changed the title [SPARK-6624][SQL]Convert filters into CNF for data sources [SPARK-6624][SQL]Add CNF Normalization as part of optimization Dec 21, 2015
@SparkQA
Copy link

SparkQA commented Dec 21, 2015

Test build #48111 has finished for PR 8200 at commit 7eebf6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@yjshen Welcome back!

object CNFNormalization extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case q: LogicalPlan => q transformExpressionsUp {
case or @ Or(left, right) => (left, right) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not make this a nested match as I think it makes it unnecessarily hard to read (and if you just use transform directly you won't have to manually handle the default case)

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48154 has finished for PR 8200 at commit 91b2c26.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48156 has finished for PR 8200 at commit 91b2c26.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*
* Refer to https://en.wikipedia.org/wiki/Conjunctive_normal_form for more information
*/
object CNFNormalization extends Rule[LogicalPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus @nongli do we want to do this for all expressions? If we do, maybe we should have a feature flag for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually come to think of it, it'd be great to be able to turn on/off optimization rules for testing. Most of these can be undocumented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems scary to do in general without some kind of bounding. The transformation can explode the number of expressions. Is there an easy way we can cap this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using the following heuristic solution to prevent exponential explosion:

  1. Add a simple size method to TreeNode, which returns the size (total number of nodes) of a tree:
def size: Int = 1 + children.map(_.size).sum
  1. Gives up CNF conversion once the result predicate exceeds a predefined threshold.

For example, we can stop if the size of the converted predicate is 10 times larger than the original one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I wonder how traditional RDBMS copes with the CNF exponential expansion issue?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capping the size seems reasonable. We need to make sure it continues to work even if the pass is rerun (respects the original limit).

@liancheng
Copy link
Contributor

Made a draft PR #10444 for further discussion. It's based on the idea described in another comment of mine above to workaround exponential expansion issue of CNF normalization.

@liancheng
Copy link
Contributor

Just realized that one of the common factor elimination rules defined in BooleanSimplification actually reverts CNF normalization, thus we may never reach fixed point. Should we remove it?

Update: According to #3784, this rule is actually useful for optimizing cartesian product into equi-join.

@@ -116,13 +110,4 @@ class BooleanSimplificationSuite extends PlanTest with PredicateHelper {
testRelation.where('a > 2 && ('b > 3 || 'b < 5)))
comparePlans(actual, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this case with the CNF rule enabled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants