-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12218] [SQL] Invalid splitting of nested AND expressions in Data Source filter API #10362
Conversation
@gatorsmile This is my fix for #10278. I feel the real problem is that we do not handle |
LGTM |
Test build #47951 has finished for PR 10362 at commit
|
@yhuai Based on my understanding, currently, our strategy of data source filtering is very conservative but pushing down is very aggressive. We do the filtering twice. We let data sources do the filters at the first and then Spark will do it again. For example, given a filter I think the current strategy is OK. It will still improve the performance in most cases especially when the data sources support indexing and disk/network IO are the major concerns.
The above logic is wrong. It filters out the satisfying results. Thus, we are unable to get the correct result. My fix does not want to redo the DeMorgan's law, which has been done in Optimizer. Thus, it just stops generating Filters if @liancheng Please correct me if my understanding is wrong. |
@gatorsmile My feeling is that when we process an inner |
To be honest, I did the exactly same fix at the beginning, because this hole is so large. Then, after running more test cases, I realized the current strategy is very tricky. It can process all the cases except Maybe we still can keep the current aggressive push-down strategy. I am wondering who did this initial design? : ) |
For our current implementation, it may be fine to still push one side of an inner |
@gatorsmile Discussed with @yhuai offline, and here's my two cents:
The current master and 1.6 code is correct, but unfortunately it's correct by accident rather than by design. Essentially, the current Parquet and ORC filter push-down code is flawed because it doesn't handle As you mentioned, the current logic is quite tricky, which implies that it's error prone and hard to maintain. Here I tend to agree with @yhuai and be more conservative for better maintainability. For the long run, CNF conversion can be a promising solution. Being conservative while dealing with
is not affected.
(Note that this case can be fixed once we have CNF conversion) However, we did't do the same thing in We just added an
|
And BTW, LGTM :) |
|
||
checkAnswer( | ||
sqlContext.read.parquet(path).where("not (a = 2 and b in ('1'))"), | ||
(1 to 5).map(i => Row(i, (i%2).toString))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Spaces around all %
in this test case.
@yhuai Opened #10377 to fix the ORC issue we discussed offline (and mentioned in my last comment). To avoid merge conflicts, I copied your ORC fix to #10377. How about removing the ORC fix from this PR and only fix SPARK-12218 on Parquet side? |
Thank you for the detailed explanation! @liancheng Actually, in the long term, I like the fix by @yhuai . : ) I do not like the tricky way. It took me a few hours to fully understand the current implementation. I will try to work on the remaining follow-up issues. BTW, the push-down design is very critical. It largely affects the overall costs and performance. The current design of Push-down is based on the implementation of data source APIs. As long as the data sources can do it, we will push it down. That might not be always optimal. End users might have different strategies due to various reasons. Anyway, let me think about it. Thank you! |
I am merging it to 1.5, 1.6, and master. Will fix the format thing while I am merge it. |
…a Source filter API JIRA: https://issues.apache.org/jira/browse/SPARK-12218 When creating filters for Parquet/ORC, we should not push nested AND expressions partially. Author: Yin Huai <[email protected]> Closes #10362 from yhuai/SPARK-12218. (cherry picked from commit 41ee7c5) Signed-off-by: Yin Huai <[email protected]>
…a Source filter API JIRA: https://issues.apache.org/jira/browse/SPARK-12218 When creating filters for Parquet/ORC, we should not push nested AND expressions partially. Author: Yin Huai <[email protected]> Closes #10362 from yhuai/SPARK-12218. (cherry picked from commit 41ee7c5) Signed-off-by: Yin Huai <[email protected]> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
This PR is a follow-up of PR apache#10362. Two major changes: 1. The fix introduced in apache#10362 is OK for Parquet, but may disable ORC PPD in many cases PR apache#10362 stops converting an `AND` predicate if any branch is inconvertible. On the other hand, `OrcFilters` combines all filters into a single big conjunction first and then tries to convert it into ORC `SearchArgument`. This means, if any filter is inconvertible, no filters can be pushed down. This PR fixes this issue by finding out all convertible filters first before doing the actual conversion. The reason behind the current implementation is mostly due to the limitation of ORC `SearchArgument` builder, which is documented in this PR in detail. 1. Copied the `AND` predicate fix for ORC from apache#10362 to avoid merge conflict. Same as apache#10362, this PR targets master (2.0.0-SNAPSHOT), branch-1.6, and branch-1.5. Author: Cheng Lian <[email protected]> Closes apache#10377 from liancheng/spark-12218.fix-orc-conjunction-ppd.
…C data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <[email protected]> Closes #19776 from jliwork/spark-22548. (cherry picked from commit 881c5c8) Signed-off-by: gatorsmile <[email protected]>
…C data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <[email protected]> Closes #19776 from jliwork/spark-22548. (cherry picked from commit 881c5c8) Signed-off-by: gatorsmile <[email protected]>
…C data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <[email protected]> Closes #19776 from jliwork/spark-22548.
…C data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](apache#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <[email protected]> Closes apache#19776 from jliwork/spark-22548. (cherry picked from commit 881c5c8) Signed-off-by: gatorsmile <[email protected]>
@liancheng Is there any fix on this case? We have a big performance issue as we have many query conditions like this. |
JIRA: https://issues.apache.org/jira/browse/SPARK-12218
When creating filters for Parquet/ORC, we should not push nested AND expressions partially.