-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27915][SQL][WIP] Update logical Filter's output nullability based on IsNotNull conditions #24765
Conversation
…l filter operations.
// However, if g is NOT NullIntolerant (e.g. if g(null) is non-null) then we cannot | ||
// conclude anything about x's nullability. | ||
def getExprIdIfNamed(expr: Expression): Set[ExprId] = expr match { | ||
case ne: NamedExpression => Set(ne.toAttribute.exprId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be AttributeReference
? I couldn't remember offhand how to get ExprIds
from arbitrary expressions, hence this hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use AttributeSet
?
Test build #106056 has finished for PR 24765 at commit
|
jenkins retest this please |
// Mark this as empty. We'll evaluate the input during doConsume(). We don't want to evaluate | ||
// all the variables at the beginning to take advantage of short circuiting. | ||
override def usedInputs: AttributeSet = AttributeSet.empty | ||
|
||
// Split out all the IsNotNulls from condition. | ||
private val (notNullPreds, otherPreds) = splitConjunctivePredicates(condition).partition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the old code here to be slightly confusing because it seemed to be using notNullPreds
for two different purposes:
- If we see
IsNotNull
conjuncts in the filter then evaluate them first / earlier because (a) these expressions are cheap to evaluate and may allow for short-circuiting and skipping more expensive expressions, and (b) evaluating these earlier allows other expressions to omit null checks (for example, if we haveIsNotNull(x)
andx * 100 < 10
then we already implicitly need to null-checkx
as part of the second expression so we might as well do the explicit null check expression first). - Given that tuples have successfully passed through the filter, we can rely on the presence of
IsNotNull
checks to default subsequent expressions' null checks tofalse
. For example, let's say we had a.filter().select()
which gets compiled into a single whole stage codegen: after tuples have passed through the filter we know that certain fields cannot possibly be null, so we can elide null checks at codegen time by just settingnullable = false
in subsequent code.
There might be some subtleties related in (1) related to non-deterministic expressions, but I think that's accounted for further down at the place where we're actually generating the checks.
In the old code, the (notNullPreds, otherPreds)
on this line was being used for both purposes: for (1) I think we could simply collect all IsNotNull
expressions, but the existing implementation of (2) relied on the additional nullIntolerant
/ a.references
checks in order to be correct.
In this PR, I've separated these two usages: the "update nullability for downstream operators" now uses the more precise condition implemented in getImpliedNotNullExprIds
, while the "optimize short-circuiting" simply checks for IsNotNull
and ignores child attributes.
Test build #106058 has finished for PR 24765 at commit
|
Test build #106057 has finished for PR 24765 at commit
|
Test build #106059 has finished for PR 24765 at commit
|
This seems to break tests in |
/cc @maropu, who submitted a very similar change ~1 year prior in #21148 (I was unaware of that PR when I created this one). Chasing down references from that PR, I discovered #23390 and #23508, both of which are concerned with fixing up nullability in attribute references; maybe one of those holds the trick to fixing the blocker identified in my previous comment. |
Yea, thanks for revisiting this, @JoshRosen! I remember we have the two suggestions from @gatorsmile and @cloud-fan in the previous discussion; 1) nullability is just a hint for the optimizer and it might be good to add a new trait for this hint. And, 2) the optimization for |
val childOutputNullability = child.output.map(a => a.exprId -> a.nullable).toMap | ||
projectList | ||
.map(_.toAttribute) | ||
.map{ a => childOutputNullability.get(a.exprId).map(a.withNullability).getOrElse(a) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to fix this part? It seems UpdateAttributeNullability
could handle this case if Filter.output
works well?
…ames in PlanTestBase.comparePlans failures ## What changes were proposed in this pull request? This pr proposes to add a prefix '*' to non-nullable attribute names in PlanTestBase.comparePlans failures. In the current master, nullability mismatches might generate the same error message for left/right logical plans like this; ``` // This failure message was extracted from apache#24765 - constraints should be inferred from aliased literals *** FAILED *** == FAIL: Plans do not match === !'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0) :- Filter (isnotnull(a#0) AND (2 <=> a#0)) :- Filter (isnotnull(a#0) AND (2 <=> a#0)) : +- LocalRelation <empty>, [a#0, b#0, c#0] : +- LocalRelation <empty>, [a#0, b#0, c#0] +- Project [2 AS two#0] +- Project [2 AS two#0] +- LocalRelation <empty>, [a#0, b#0, c#0] +- LocalRelation <empty>, [a#0, b#0, c#0] (PlanTest.scala:145) ``` With this pr, this error message is changed to one below; ``` - constraints should be inferred from aliased literals *** FAILED *** == FAIL: Plans do not match === !'Join Inner, (*two#0 = a#0) 'Join Inner, (*two#0 = *a#0) :- Filter (isnotnull(a#0) AND (2 <=> a#0)) :- Filter (isnotnull(a#0) AND (2 <=> a#0)) : +- LocalRelation <empty>, [a#0, b#0, c#0] : +- LocalRelation <empty>, [a#0, b#0, c#0] +- Project [2 AS two#0] +- Project [2 AS two#0] +- LocalRelation <empty>, [a#0, b#0, c#0] +- LocalRelation <empty>, [a#0, b#0, c#0] (PlanTest.scala:145) ``` ## How was this patch tested? N/A Closes apache#25213 from maropu/MarkForNullability. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
What changes were proposed in this pull request?
This PR changes the logical
Filter
operator to update its outputs' nullability when filter conditions imply that outputs cannot be null.In addition, I refined similar existing logic in the physical
FilterExec
(changing the existing code to be more precise / less conservative in its non-nullability inference) and improved propagation of inferred nullability information inProject
.This is useful because of how it composes with other optimizations: Spark has several logical and physical optimizations which leverage non-nullability, so improving nullability inference increases the value of those existing optimizations.
.schema
may change as a result of this optimization: this may have consequences in case nullability information is used by downstream systems (e.g. forCREATE TABLE
DDL).Examples
Consider the query
where
t.key
is nullable.Because of the
key IS NOT NULL
filter condition,key
will always be non-null. Prior to this patch, this query's result schema was overly-conservative, continuing to markkey
as nullable. However, if we take advantage of thekey IS NOT NULL
condition we can setnullable = false
forkey
.This was a somewhat trivial example, so let's look at some more complicated cases:
Consider
where all columns of
A
andB
are nullable. Because of the equality join condition, we know thatkey
must be non-null in both tables. In addition, the condition(A.num + B.num) > 0
can only hold if bothnum
values are not null: addition is a null-intolerant operator, meaning that it returnsnull
if any of its operands is null.Leveraging this, we should be able to mark both
key
andvalue
as non-null in the join result's schema (even though both values are nullable in the underlying input relation).Finally, let's look at an example of a non null-intolerant operator:
coalesce(a, b) IS NOT NULL
could still mean thata
orb
is null, so inwe can infer that
qux
is not null but cannot make any claims aboutfoo
orbar
's nullability.Description of changes
PredicateHelper.getImpliedNotNullExprIds(IsNotNull)
helper, which takes anIsNotNull
expression and returns theExprId
s of expressions which cannot be null. This handles simple cases likeIsNotNull(columnFromTable)
, as well as more complex cases involving expression trees (properly accounting for null-(in)tolerance).FilterExec
, but I think it was overly conservative: givenIsNotNull(x)
, it would claim thatx
and all of its descendants were not null if and only if every ancestor ofx
wasNullIntolerant
. However, even ifx
is null-tolerant we can still make claims aboutx
's non-nullability even if we can't make further claims about its children.logical.Filter
to leverage this new function to update output nullability.FilterExec
to re-use this logic. This part is a bit tricky because theFilterExec
code looks atIsNotNull
expressions both for optimizing the order of expression evaluation and for refining nullability to elide null checks in downstream operators.logical.Project
so that inferred non-nullability information from child operators is preserved.Background on related historical changes / bugs
While developing this patch, I found the following historical PRs to be useful references (note: many of these original PRs contained correctness bugs which were subsequently fixed in later PRs):
IsNotNull
conditions from expressions).FilterExec
to update output nullability based onIsNotNull
conditions.FilterExec
to add special handling forIsNotNull
expression codegen, altering evaluation order to allow for better short-circuiting.NullIntolerant
trait to generalize theIsNotNull
extraction logic.FilterExec
'sIsNotNull
-handling path to useNullIntolerant
FilterExec
's logic: it did not properly account for null-tolerant operators which were ancestors ofIsNotNull
expressions.IsNotNull
.How was this patch tested?
Added new tests for the added
PredicateHelper.getImpliedNotNullExprIds
.TODO: add new end-to-end tests reflecting the examples listed above (in order to properly test the integration of this new logic into
logical.Filter
andlogical.Project
).