[SPARK-12218] [SQL] Invalid splitting of nested AND expressions in Data Source filter API #10362

yhuai · 2015-12-17T20:59:27Z

JIRA: https://issues.apache.org/jira/browse/SPARK-12218

When creating filters for Parquet/ORC, we should not push nested AND expressions partially.

yhuai · 2015-12-17T21:00:11Z

@gatorsmile This is my fix for #10278. I feel the real problem is that we do not handle AND properly. What do you think?

nongli · 2015-12-17T21:38:55Z

LGTM

SparkQA · 2015-12-17T23:06:26Z

Test build #47951 has finished for PR 10362 at commit 37c1849.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2015-12-17T23:13:36Z

@yhuai Based on my understanding, currently, our strategy of data source filtering is very conservative but pushing down is very aggressive. We do the filtering twice. We let data sources do the filters at the first and then Spark will do it again.

For example, given a filter A or (B AND C), if the data source is unable to process C. We still push it down. The result set we got is A or B, which is less restrictive than A or (B AND C), but it does not miss any satisfying row. Spark will do another round of filtering to ensure the result is satisfying A or (B AND C).

I think the current strategy is OK. It will still improve the performance in most cases especially when the data sources support indexing and disk/network IO are the major concerns.
In this JIRA, the root cause is that we failed to process Not. In the original code, the logics is like

        not(A and B) => not(A) and not(B)
        not(A or B) => not(A) or not(B)

The above logic is wrong. It filters out the satisfying results. Thus, we are unable to get the correct result.

My fix does not want to redo the DeMorgan's law, which has been done in Optimizer. Thus, it just stops generating Filters if Not is not in the inner most level.

@liancheng Please correct me if my understanding is wrong.

yhuai · 2015-12-17T23:32:47Z

@gatorsmile My feeling is that when we process an inner AND, in general, we do not know what is the outer predicate. So, we do not know if it is safe to just push one side of the AND. It is the reason that I an removing this partial pushdown logic for inner AND.

gatorsmile · 2015-12-17T23:42:42Z

To be honest, I did the exactly same fix at the beginning, because this hole is so large. Then, after running more test cases, I realized the current strategy is very tricky. It can process all the cases except Not. Only the Not operator changes the inner And and Or. Then, I did the fix with the minimal code change to stop filtering in extreme cases.

Maybe we still can keep the current aggressive push-down strategy. I am wondering who did this initial design? : )

yhuai · 2015-12-18T00:05:52Z

For our current implementation, it may be fine to still push one side of an inner AND. However, it just happens to work and we cannot really say that the rule for AND is correct.

gatorsmile · 2015-12-18T00:14:56Z

The implementation is very tricky, but I thought that was intentional...

Do you want me to backport #5700 and #8716 to 1.5 in the same PR?

Thanks!

liancheng · 2015-12-18T10:11:04Z

@gatorsmile Discussed with @yhuai offline, and here's my two cents:

Correctness

The current master and 1.6 code is correct, but unfortunately it's correct by accident rather than by design.

Essentially, the current Parquet and ORC filter push-down code is flawed because it doesn't handle Not correctly. Take Parquet as an example, the filter push-down code within parquet-mr has a separate phase for eliminating Not (e.g., converting !(a > 1) to a <= 1). That's why it's safe for parquet-mr to prune the data using only a single branch of And if the other branch is not applicable according to row group statistics (min/max, etc.).

As you mentioned, the current logic is quite tricky, which implies that it's error prone and hard to maintain. Here I tend to agree with @yhuai and be more conservative for better maintainability. For the long run, CNF conversion can be a promising solution.
2. Performance

Being conservative while dealing with And doesn't hurt performance in most cases for Parquet because of the way we deal with Parquet filter predicate conversion. You may see that, in ParquetRelation, the predicate is firstly split and then converted individually. So a filter predicate like

a > 1 AND b < 10 AND weird_udf(c)

is not affected. a > 1 and b < 10 are still pushed down. (Many thanks to @yhuai, actually I didn't notice this at first.) But filter predicates containing nested conjunction(s) are affected. E.g., nothing in the following predicate can be pushed down:

a > 1 OR (b < 10 AND weird_udf(c))

(Note that this case can be fixed once we have CNF conversion)

However, we did't do the same thing in OrcRelation, so filter predicates in both cases are affected in ORC. I'm working on a fix for this.
3. Conservative filtering strategy

We just added an unhandledFilters API for data sources, so that Spark SQL doesn't perform the conservative filtering if the underlying data source implements this method and tells Spark SQL which filters they can't handle. Parquet and ORC data sources in Spark SQL already implemented this API.
4. Follow-ups

Backporting [SPARK-7142][SQL]: Minor enhancement to BooleanSimplification Optimizer rule #5700 and [SQL][SPARK-7142]: Minor enhancement to BooleanSimplification Optimizer rule. Incorporate review comments #8716 to branch-1.5.
Fixing the ORC filter push-down performance issue mentioned above. (I'm working on it.)
Adding CNF conversion to the optimizer, and explicitly guaranteeing that all the filter predicate a data source receives are already in CNF. (This also implies that data sources don't even need to handle And anymore.)
Optionally, it would be good to have an optimization rule for eliminating negations.

liancheng · 2015-12-18T10:18:27Z

And BTW, LGTM :)

liancheng · 2015-12-18T11:16:07Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+
+        checkAnswer(
+          sqlContext.read.parquet(path).where("not (a = 2 and b in ('1'))"),
+          (1 to 5).map(i => Row(i, (i%2).toString)))


Nit: Spaces around all % in this test case.

liancheng · 2015-12-18T11:58:10Z

@yhuai Opened #10377 to fix the ORC issue we discussed offline (and mentioned in my last comment). To avoid merge conflicts, I copied your ORC fix to #10377. How about removing the ORC fix from this PR and only fix SPARK-12218 on Parquet side?

gatorsmile · 2015-12-18T15:28:48Z

Thank you for the detailed explanation! @liancheng

Actually, in the long term, I like the fix by @yhuai . : ) I do not like the tricky way. It took me a few hours to fully understand the current implementation. I will try to work on the remaining follow-up issues.

BTW, the push-down design is very critical. It largely affects the overall costs and performance. The current design of Push-down is based on the implementation of data source APIs. As long as the data sources can do it, we will push it down. That might not be always optimal. End users might have different strategies due to various reasons. Anyway, let me think about it. Thank you!

yhuai · 2015-12-18T18:51:30Z

I am merging it to 1.5, 1.6, and master. Will fix the format thing while I am merge it.

…a Source filter API JIRA: https://issues.apache.org/jira/browse/SPARK-12218 When creating filters for Parquet/ORC, we should not push nested AND expressions partially. Author: Yin Huai <[email protected]> Closes #10362 from yhuai/SPARK-12218. (cherry picked from commit 41ee7c5) Signed-off-by: Yin Huai <[email protected]>

…a Source filter API JIRA: https://issues.apache.org/jira/browse/SPARK-12218 When creating filters for Parquet/ORC, we should not push nested AND expressions partially. Author: Yin Huai <[email protected]> Closes #10362 from yhuai/SPARK-12218. (cherry picked from commit 41ee7c5) Signed-off-by: Yin Huai <[email protected]> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

This PR is a follow-up of PR apache#10362. Two major changes: 1. The fix introduced in apache#10362 is OK for Parquet, but may disable ORC PPD in many cases PR apache#10362 stops converting an `AND` predicate if any branch is inconvertible. On the other hand, `OrcFilters` combines all filters into a single big conjunction first and then tries to convert it into ORC `SearchArgument`. This means, if any filter is inconvertible, no filters can be pushed down. This PR fixes this issue by finding out all convertible filters first before doing the actual conversion. The reason behind the current implementation is mostly due to the limitation of ORC `SearchArgument` builder, which is documented in this PR in detail. 1. Copied the `AND` predicate fix for ORC from apache#10362 to avoid merge conflict. Same as apache#10362, this PR targets master (2.0.0-SNAPSHOT), branch-1.6, and branch-1.5. Author: Cheng Lian <[email protected]> Closes apache#10377 from liancheng/spark-12218.fix-orc-conjunction-ppd.

…C data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <[email protected]> Closes #19776 from jliwork/spark-22548. (cherry picked from commit 881c5c8) Signed-off-by: gatorsmile <[email protected]>

…C data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <[email protected]> Closes #19776 from jliwork/spark-22548.

…C data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](apache#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <[email protected]> Closes apache#19776 from jliwork/spark-22548. (cherry picked from commit 881c5c8) Signed-off-by: gatorsmile <[email protected]>

yujiantao · 2020-01-08T05:53:22Z

@liancheng Is there any fix on this case? We have a big performance issue as we have many query conditions like this.
a > 1 OR (b < 10 AND weird_udf(c))

Fix

37c1849

liancheng reviewed Dec 18, 2015
View reviewed changes

liancheng mentioned this pull request Dec 18, 2015

[SPARK-12218] Fixes ORC conjunction predicate push down #10377

Closed

gatorsmile mentioned this pull request Dec 18, 2015

[SPARK-6624][SQL]Add CNF Normalization as part of optimization #8200

Closed

asfgit closed this in 41ee7c5 Dec 18, 2015

gatorsmile mentioned this pull request Dec 18, 2015

[SPARK-7142] [SQL] [Backport-1.5]: Minor enhancement to BooleanSimplification Optimizer rule #10383

Closed

liancheng deleted the SPARK-12218 branch December 19, 2015 00:06

jliwork mentioned this pull request Nov 18, 2017

[SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source #19776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12218] [SQL] Invalid splitting of nested AND expressions in Data Source filter API #10362

[SPARK-12218] [SQL] Invalid splitting of nested AND expressions in Data Source filter API #10362

yhuai commented Dec 17, 2015

yhuai commented Dec 17, 2015

nongli commented Dec 17, 2015

SparkQA commented Dec 17, 2015

gatorsmile commented Dec 17, 2015

yhuai commented Dec 17, 2015

gatorsmile commented Dec 17, 2015

yhuai commented Dec 18, 2015

gatorsmile commented Dec 18, 2015

liancheng commented Dec 18, 2015

liancheng commented Dec 18, 2015

liancheng Dec 18, 2015

liancheng commented Dec 18, 2015

gatorsmile commented Dec 18, 2015

yhuai commented Dec 18, 2015

yujiantao commented Jan 8, 2020 •

edited

Loading

[SPARK-12218] [SQL] Invalid splitting of nested AND expressions in Data Source filter API #10362

[SPARK-12218] [SQL] Invalid splitting of nested AND expressions in Data Source filter API #10362

Conversation

yhuai commented Dec 17, 2015

yhuai commented Dec 17, 2015

nongli commented Dec 17, 2015

SparkQA commented Dec 17, 2015

gatorsmile commented Dec 17, 2015

yhuai commented Dec 17, 2015

gatorsmile commented Dec 17, 2015

yhuai commented Dec 18, 2015

gatorsmile commented Dec 18, 2015

liancheng commented Dec 18, 2015

liancheng commented Dec 18, 2015

liancheng Dec 18, 2015

Choose a reason for hiding this comment

liancheng commented Dec 18, 2015

gatorsmile commented Dec 18, 2015

yhuai commented Dec 18, 2015

yujiantao commented Jan 8, 2020 • edited Loading

yujiantao commented Jan 8, 2020 •

edited

Loading