Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12218] [SQL] Invalid splitting of nested AND expressions in Data Source filter API #10362

Closed
wants to merge 1 commit into from

Conversation

yhuai
Copy link
Contributor

@yhuai yhuai commented Dec 17, 2015

JIRA: https://issues.apache.org/jira/browse/SPARK-12218

When creating filters for Parquet/ORC, we should not push nested AND expressions partially.

@yhuai
Copy link
Contributor Author

yhuai commented Dec 17, 2015

@gatorsmile This is my fix for #10278. I feel the real problem is that we do not handle AND properly. What do you think?

@nongli
Copy link
Contributor

nongli commented Dec 17, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47951 has finished for PR 10362 at commit 37c1849.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@yhuai Based on my understanding, currently, our strategy of data source filtering is very conservative but pushing down is very aggressive. We do the filtering twice. We let data sources do the filters at the first and then Spark will do it again.

For example, given a filter A or (B AND C), if the data source is unable to process C. We still push it down. The result set we got is A or B, which is less restrictive than A or (B AND C), but it does not miss any satisfying row. Spark will do another round of filtering to ensure the result is satisfying A or (B AND C).

I think the current strategy is OK. It will still improve the performance in most cases especially when the data sources support indexing and disk/network IO are the major concerns.
In this JIRA, the root cause is that we failed to process Not. In the original code, the logics is like

        not(A and B) => not(A) and not(B)
        not(A or B) => not(A) or not(B)

The above logic is wrong. It filters out the satisfying results. Thus, we are unable to get the correct result.

My fix does not want to redo the DeMorgan's law, which has been done in Optimizer. Thus, it just stops generating Filters if Not is not in the inner most level.

@liancheng Please correct me if my understanding is wrong.

@yhuai
Copy link
Contributor Author

yhuai commented Dec 17, 2015

@gatorsmile My feeling is that when we process an inner AND, in general, we do not know what is the outer predicate. So, we do not know if it is safe to just push one side of the AND. It is the reason that I an removing this partial pushdown logic for inner AND.

@gatorsmile
Copy link
Member

To be honest, I did the exactly same fix at the beginning, because this hole is so large. Then, after running more test cases, I realized the current strategy is very tricky. It can process all the cases except Not. Only the Not operator changes the inner And and Or. Then, I did the fix with the minimal code change to stop filtering in extreme cases.

Maybe we still can keep the current aggressive push-down strategy. I am wondering who did this initial design? : )

@yhuai
Copy link
Contributor Author

yhuai commented Dec 18, 2015

For our current implementation, it may be fine to still push one side of an inner AND. However, it just happens to work and we cannot really say that the rule for AND is correct.

@gatorsmile
Copy link
Member

The implementation is very tricky, but I thought that was intentional...

Do you want me to backport #5700 and #8716 to 1.5 in the same PR?

Thanks!

@liancheng
Copy link
Contributor

@gatorsmile Discussed with @yhuai offline, and here's my two cents:

  1. Correctness

The current master and 1.6 code is correct, but unfortunately it's correct by accident rather than by design.

Essentially, the current Parquet and ORC filter push-down code is flawed because it doesn't handle Not correctly. Take Parquet as an example, the filter push-down code within parquet-mr has a separate phase for eliminating Not (e.g., converting !(a > 1) to a <= 1). That's why it's safe for parquet-mr to prune the data using only a single branch of And if the other branch is not applicable according to row group statistics (min/max, etc.).

As you mentioned, the current logic is quite tricky, which implies that it's error prone and hard to maintain. Here I tend to agree with @yhuai and be more conservative for better maintainability. For the long run, CNF conversion can be a promising solution.
2. Performance

Being conservative while dealing with And doesn't hurt performance in most cases for Parquet because of the way we deal with Parquet filter predicate conversion. You may see that, in ParquetRelation, the predicate is firstly split and then converted individually. So a filter predicate like

a > 1 AND b < 10 AND weird_udf(c)

is not affected. a > 1 and b < 10 are still pushed down. (Many thanks to @yhuai, actually I didn't notice this at first.) But filter predicates containing nested conjunction(s) are affected. E.g., nothing in the following predicate can be pushed down:

a > 1 OR (b < 10 AND weird_udf(c))

(Note that this case can be fixed once we have CNF conversion)

However, we did't do the same thing in OrcRelation, so filter predicates in both cases are affected in ORC. I'm working on a fix for this.
3. Conservative filtering strategy

We just added an unhandledFilters API for data sources, so that Spark SQL doesn't perform the conservative filtering if the underlying data source implements this method and tells Spark SQL which filters they can't handle. Parquet and ORC data sources in Spark SQL already implemented this API.
4. Follow-ups

@liancheng
Copy link
Contributor

And BTW, LGTM :)


checkAnswer(
sqlContext.read.parquet(path).where("not (a = 2 and b in ('1'))"),
(1 to 5).map(i => Row(i, (i%2).toString)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Spaces around all % in this test case.

@liancheng
Copy link
Contributor

@yhuai Opened #10377 to fix the ORC issue we discussed offline (and mentioned in my last comment). To avoid merge conflicts, I copied your ORC fix to #10377. How about removing the ORC fix from this PR and only fix SPARK-12218 on Parquet side?

@gatorsmile
Copy link
Member

Thank you for the detailed explanation! @liancheng

Actually, in the long term, I like the fix by @yhuai . : ) I do not like the tricky way. It took me a few hours to fully understand the current implementation. I will try to work on the remaining follow-up issues.

BTW, the push-down design is very critical. It largely affects the overall costs and performance. The current design of Push-down is based on the implementation of data source APIs. As long as the data sources can do it, we will push it down. That might not be always optimal. End users might have different strategies due to various reasons. Anyway, let me think about it. Thank you!

@yhuai
Copy link
Contributor Author

yhuai commented Dec 18, 2015

I am merging it to 1.5, 1.6, and master. Will fix the format thing while I am merge it.

asfgit pushed a commit that referenced this pull request Dec 18, 2015
…a Source filter API

JIRA: https://issues.apache.org/jira/browse/SPARK-12218

When creating filters for Parquet/ORC, we should not push nested AND expressions partially.

Author: Yin Huai <[email protected]>

Closes #10362 from yhuai/SPARK-12218.

(cherry picked from commit 41ee7c5)
Signed-off-by: Yin Huai <[email protected]>
@asfgit asfgit closed this in 41ee7c5 Dec 18, 2015
asfgit pushed a commit that referenced this pull request Dec 18, 2015
…a Source filter API

JIRA: https://issues.apache.org/jira/browse/SPARK-12218

When creating filters for Parquet/ORC, we should not push nested AND expressions partially.

Author: Yin Huai <[email protected]>

Closes #10362 from yhuai/SPARK-12218.

(cherry picked from commit 41ee7c5)
Signed-off-by: Yin Huai <[email protected]>

Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
@liancheng liancheng deleted the SPARK-12218 branch December 19, 2015 00:06
ghost pushed a commit to dbtsai/spark that referenced this pull request Dec 28, 2015
This PR is a follow-up of PR apache#10362.

Two major changes:

1.  The fix introduced in apache#10362 is OK for Parquet, but may disable ORC PPD in many cases

    PR apache#10362 stops converting an `AND` predicate if any branch is inconvertible.  On the other hand, `OrcFilters` combines all filters into a single big conjunction first and then tries to convert it into ORC `SearchArgument`.  This means, if any filter is inconvertible, no filters can be pushed down.  This PR fixes this issue by finding out all convertible filters first before doing the actual conversion.

    The reason behind the current implementation is mostly due to the limitation of ORC `SearchArgument` builder, which is documented in this PR in detail.

1.  Copied the `AND` predicate fix for ORC from apache#10362 to avoid merge conflict.

Same as apache#10362, this PR targets master (2.0.0-SNAPSHOT), branch-1.6, and branch-1.5.

Author: Cheng Lian <[email protected]>

Closes apache#10377 from liancheng/spark-12218.fix-orc-conjunction-ppd.
asfgit pushed a commit that referenced this pull request Nov 22, 2017
…C data source

## What changes were proposed in this pull request?

Let’s say I have a nested AND expression shown below and p2 can not be pushed down,

(p1 AND p2) OR p3

In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing.

Note that:
- The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not
- If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression.
- The current Spark code logic for OR is OK. It either pushes both legs or nothing.

The same translation method is also called by Data Source V2.

## How was this patch tested?

Added new unit test cases to JDBCSuite

gatorsmile

Author: Jia Li <[email protected]>

Closes #19776 from jliwork/spark-22548.

(cherry picked from commit 881c5c8)
Signed-off-by: gatorsmile <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 22, 2017
…C data source

## What changes were proposed in this pull request?

Let’s say I have a nested AND expression shown below and p2 can not be pushed down,

(p1 AND p2) OR p3

In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing.

Note that:
- The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not
- If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression.
- The current Spark code logic for OR is OK. It either pushes both legs or nothing.

The same translation method is also called by Data Source V2.

## How was this patch tested?

Added new unit test cases to JDBCSuite

gatorsmile

Author: Jia Li <[email protected]>

Closes #19776 from jliwork/spark-22548.

(cherry picked from commit 881c5c8)
Signed-off-by: gatorsmile <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 22, 2017
…C data source

## What changes were proposed in this pull request?

Let’s say I have a nested AND expression shown below and p2 can not be pushed down,

(p1 AND p2) OR p3

In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing.

Note that:
- The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not
- If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression.
- The current Spark code logic for OR is OK. It either pushes both legs or nothing.

The same translation method is also called by Data Source V2.

## How was this patch tested?

Added new unit test cases to JDBCSuite

gatorsmile

Author: Jia Li <[email protected]>

Closes #19776 from jliwork/spark-22548.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…C data source

## What changes were proposed in this pull request?

Let’s say I have a nested AND expression shown below and p2 can not be pushed down,

(p1 AND p2) OR p3

In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](apache#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing.

Note that:
- The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not
- If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression.
- The current Spark code logic for OR is OK. It either pushes both legs or nothing.

The same translation method is also called by Data Source V2.

## How was this patch tested?

Added new unit test cases to JDBCSuite

gatorsmile

Author: Jia Li <[email protected]>

Closes apache#19776 from jliwork/spark-22548.

(cherry picked from commit 881c5c8)
Signed-off-by: gatorsmile <[email protected]>
@yujiantao
Copy link

yujiantao commented Jan 8, 2020

@liancheng Is there any fix on this case? We have a big performance issue as we have many query conditions like this.
a > 1 OR (b < 10 AND weird_udf(c))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants