-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6624][SQL]Add CNF Normalization as part of optimization #8200
Conversation
case (l, r) => Or(l, r) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following version may be questionably more readable:
private def pushOrToBottom(condition: Expression): Expression = {
condition match {
case Or(And(innerLhs, innerRhs), rhs) =>
And(pushOrToBottom(Or(innerLhs, rhs)), pushOrToBottom(Or(innerRhs, rhs)))
case Or(lhs, And(innerLhs, innerRhs)) =>
And(pushOrToBottom(Or(lhs, innerLhs)), pushOrToBottom(Or(lhs, innerRhs)))
case _ => condition
}
}
Shall we also cover cases like Not(Or(x, y))
=> And(Not(x), Not(y))
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I could do Not(Or(x, y)) => And(Not(x), Not(y))
in BooleanSimplification? or just push Not to bottom as the previous phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. Moving De Morgan transformation intto BooleanSimplification
makes sense. Please document this assumption. And you need to use optimizedPlan
instead of analyzed
in your test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a second thought, I tend to not move De Morgan conversion to BooleanSimplification
. Coupling these two seems to be dangerous and the assumption can be easily broken by others in the future.
Test build #40873 has finished for PR 8200 at commit
|
Thanks @liancheng , I will update my PR soon. |
atoms += expression | ||
} | ||
atoms.sortBy(_.toString).reduce(Or) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that this method is equivalent to:
expression
.collect { case e if !e.isInstanceOf[Or] => e }
.sortBy(_.toString)
.reduce(Or)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just return a Seq[Expression]
without the final reduce(Or)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The assumption here is that expression
has already gone through CNF transformation, so that any sub-expression that is not an Or
doesn't contain any Or
either.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, please ignore my comments above, made a mistake and the assumption is wrong...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that possible we still get Not(expr) even we finished boolean simplification? If so, simply matching !e.isInstanceOf[Or]
and collect seems not proper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... It seems to be correct? Since you first do a splitConjunctivePredicates
and then pass in elements of the result to this method... 😵
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, will Not(expr)
be collected as Not(expr)
and expr
as two separate expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yjshen Yeah, the above assumption is only correct if you do REAL CNF conversion in this PR. Currently De Morgan law is not considered.
(This one replies this comment.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yjshen If you do real CNF conversion here (namely, taking Not
into consideration), then it would be OK, since e
in Not(e)
cannot be Or
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your current code also suffers from the Not(Or(x, y))
case (x
and y
are atoms), because you're using foreachUp
rather than transformUp
here.
Test build #40897 has finished for PR 8200 at commit
|
@liancheng, do you mind to review this again? Thanks. |
@@ -501,6 +501,10 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { | |||
case LessThanOrEqual(l, r) => GreaterThan(l, r) | |||
// not(not(e)) => e | |||
case Not(e) => e | |||
// De Morgan's law: !(a || b) => !a && !b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the BooleanSimplification
, this probably not an optimization, isn't it? Probably we'd better to move this logic into predicates.scala
, which used only for the cases for the join predicate push down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a boolean simplification because we can only do further optimisation when Not
is on leaf node, the above Not(LessThanOrEqual)
or Not(Not)
is a good example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we couldn't tell And(Not(lhs), Not(rhs))
is more optimal than Not(Or(lhs, rhs))
, do we? Sorry if I missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in #8200 (comment), I tend not to have this one in BooleanSimplification
. I'd prefer not having CNF conversion coupled with BooleanSimplification
, and do transformations related to Not
in the CNF part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenghao-intel The Not
case added here is not for simplification. It's used to push Not
predicates to the bottom, so that e
in Not(e)
cannot be either And
or Or
. CNF conversion code relies on this assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I got it. I meant the same thing, for the normal expression evaluation, this code change probably cause performance regression, it's better to move the CNF stuff into the predicates.scala
or patterns.scala
, which only works for the predicate push down.
I'm new to data source, can someone explain what can we benefit from CNF? I'm sorry if this question is too stupid... |
@cloud-fan Conjunctions ( |
Ah got it! Thanks for the explanation :) |
Test build #41106 has finished for PR 8200 at commit
|
Test build #41112 has finished for PR 8200 at commit
|
Test build #41125 has finished for PR 8200 at commit
|
Test build #41193 has finished for PR 8200 at commit
|
@liancheng , is the current version OK to you? seems you didn't see the updates. |
Since #5700 has been merged, I would revert the PR to its original version, without De Morgan's laws |
Test build #42422 has finished for PR 8200 at commit
|
Is there a reason to not do all of this in the optimizer? |
@marmbrus converting a filter into CNF may lead to an expanded filter, which I think is not necessarily a general optimisation. |
I don't think that is true. Its pretty standard to convert predicates into CNF as part of optimization: http://db.cs.berkeley.edu/papers/UCB-MS-zfong.pdf |
As discussed in another PR #10362, we plan to add CNF normalization into the Optimizer. Will you do it? Otherwise, I can do it. Thanks! |
It sounds like multiple PRs are blocked by this PR. I will submit a PR for fixing it tomorrow. Thanks! |
@gatorsmile +1 and great work :)) |
@@ -489,13 +511,13 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { | |||
case (l, Or(r, l1)) if (Not(l) == l1) => And(l, r) | |||
case (Or(l, l1), r) if (l1 == Not(r)) => And(l, r) | |||
case (Or(l1, l), r) if (l1 == Not(r)) => And(l, r) | |||
// (a || b) && (a || c) => a || (b && c) | |||
// (a || b) && (a || b || c) => a || b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a || b) && (a || c) => a || (b && c)
is just a transformation instead of optimization, it is only the case when we could eliminate one side like: (a || b) && (a || b || c) => a || b
. Besides, the original transformation is opposite to CNF Normalize.
@marmbrus @maropu @gatorsmile I've update my PR to add CNFNormalization into the Optimizer. I'm so sorry for the delay in my reply. |
Test build #48111 has finished for PR 8200 at commit
|
@yjshen Welcome back! |
object CNFNormalization extends Rule[LogicalPlan] { | ||
def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
case q: LogicalPlan => q transformExpressionsUp { | ||
case or @ Or(left, right) => (left, right) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not make this a nested match as I think it makes it unnecessarily hard to read (and if you just use transform directly you won't have to manually handle the default case)
Test build #48154 has finished for PR 8200 at commit
|
Test build #48156 has finished for PR 8200 at commit
|
* | ||
* Refer to https://en.wikipedia.org/wiki/Conjunctive_normal_form for more information | ||
*/ | ||
object CNFNormalization extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually come to think of it, it'd be great to be able to turn on/off optimization rules for testing. Most of these can be undocumented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems scary to do in general without some kind of bounding. The transformation can explode the number of expressions. Is there an easy way we can cap this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using the following heuristic solution to prevent exponential explosion:
- Add a simple
size
method toTreeNode
, which returns the size (total number of nodes) of a tree:
def size: Int = 1 + children.map(_.size).sum
- Gives up CNF conversion once the result predicate exceeds a predefined threshold.
For example, we can stop if the size of the converted predicate is 10 times larger than the original one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I wonder how traditional RDBMS copes with the CNF exponential expansion issue?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capping the size seems reasonable. We need to make sure it continues to work even if the pass is rerun (respects the original limit).
Made a draft PR #10444 for further discussion. It's based on the idea described in another comment of mine above to workaround exponential expansion issue of CNF normalization. |
Just realized that one of the common factor elimination rules defined in Update: According to #3784, this rule is actually useful for optimizing cartesian product into equi-join. |
@@ -116,13 +110,4 @@ class BooleanSimplificationSuite extends PlanTest with PredicateHelper { | |||
testRelation.where('a > 2 && ('b > 3 || 'b < 5))) | |||
comparePlans(actual, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add this case with the CNF rule enabled?
This PR aims at adding CNF Normalization as part of optimization.
For example:
JIRA: https://issues.apache.org/jira/browse/SPARK-6624