[SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source #19776

jliwork · 2017-11-18T01:17:25Z

What changes were proposed in this pull request?

Let’s say I have a nested AND expression shown below and p2 can not be pushed down,

(p1 AND p2) OR p3

In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to SPARK-12218 for Parquet. When we have AND nested below another expression, we should either push both legs or nothing.

Note that:

The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not
If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression.
The current Spark code logic for OR is OK. It either pushes both legs or nothing.

The same translation method is also called by Data Source V2.

How was this patch tested?

Added new unit test cases to JDBCSuite

@gatorsmile

…C data source

viirya · 2017-11-20T04:14:19Z

Looks like a bug and be there for a long while. cc @cloud-fan @HyukjinKwon can you help trigger the test? Thanks.

viirya · 2017-11-20T04:15:46Z

@jliwork Can you fix the PR title? The title is cut when pasting on.

jliwork · 2017-11-20T04:19:36Z

@viirya Thanks for letting me know, Simon. I've fixed the title. Can someone help trigger the tests please?

HyukjinKwon · 2017-11-20T04:21:49Z

ok to test

HyukjinKwon · 2017-11-20T04:30:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

@@ -497,7 +497,10 @@ object DataSourceStrategy {
        Some(sources.IsNotNull(a.name))

      case expressions.And(left, right) =>
-        (translateFilter(left) ++ translateFilter(right)).reduceOption(sources.And)
+        for {


Let's add a small comment like the PR you pointed out.

Sure. Will do. Thanks.

Yeah. Follow what @yhuai wrote in the PR https://github.com/apache/spark/pull/10362/files

Thanks. Just did that as you suggested.

viirya · 2017-11-20T04:36:21Z

This affects correctness, should we also backport to 2.2?

jliwork · 2017-11-20T04:49:25Z

@viirya I'm fine with backport to 2.2 unless anyone objects.

viirya · 2017-11-20T04:51:29Z

@jliwork Let's see what @cloud-fan @felixcheung think about it.

SparkQA · 2017-11-20T07:03:34Z

Test build #84010 has finished for PR 19776 at commit 58de88c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-20T07:34:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

@@ -497,7 +497,11 @@ object DataSourceStrategy {
        Some(sources.IsNotNull(a.name))

      case expressions.And(left, right) =>
-        (translateFilter(left) ++ translateFilter(right)).reduceOption(sources.And)
+        // See SPARK-12218 and PR 10362 for detailed discussion


In the comment, you need to give an example to explain why.

Sure. I have added more comments there with an example. Thanks, Sean!

Usually we don't list PR number but just JIRA number is enough.

@viirya I see. Thanks, Simon! I've removed the PR number from the comment.

SparkQA · 2017-11-20T08:05:01Z

Test build #84017 has finished for PR 19776 at commit e540790.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-20T08:05:01Z

Test build #84018 has finished for PR 19776 at commit 635768e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-20T10:56:09Z

Test build #84019 has finished for PR 19776 at commit 3bb7d3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-20T11:00:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+        for {
+          leftFilter <- translateFilter(left)
+          rightFilter <- translateFilter(right)
+        } yield sources.And(leftFilter, rightFilter)


do we still need SPARK-12218 after this?

I would think so. SPARK-12218 put fixes into ParquetFilters.createFilter and OrcFilters.createFilter. They're similar to DataSourceStrategy.translateFilter but have different signature customized for Parquet and ORC. For all datasources including JDBC, Parquet, etc, translateFilter is called to determine if a predicate Expression can be pushed down as a Filter or not. Next for Parquet and ORC, Filters get mapped to Parquet or ORC specific filters with their own createFilter method.

So this PR does help all data sources to get the correct set of push down predicates. Without this PR we simply got lucky with Parquet and ORC in terms of result correctness because 1) it looks like we always apply Filter on top of scan; 2) we end up with same number of or more rows returned with one leg missing from AND.

JDBC data source does not always come with Filter on top of scan therefore exposed the bug.

We do not need to clean up the codes in this PR. Let us minimize the code changes and it can simplify the backport.

Although Catalyst predicate expressions are all converted to sources.Filter when we try to push down them. Not all convertible filters can be handled by Parquet and ORC. So I think we still can face the case only one sub-filter of AND can be pushed down by the file format.

cloud-fan · 2017-11-20T11:08:49Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

+    assert(df7.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+    assert(df8.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+    assert(df9.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+    assert(df10.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))


I'd like to create a new DataSourceStrategySuite to test the translateFilter.

Sure. I can help.

They are end-to-end test cases.

If you can, we should also add such a unit test suite. In the future, we can add more unit test cases for verifying more complex cases.

I went ahead and added a new DataSourceStrategySuite to test the translateFilter. Please free feel to let me know of any further comments. Thanks!

cloud-fan · 2017-11-20T11:09:22Z

good catch! It's a long-standing bug and I think we should backport it all the way to 2.0

…slateFilter

gatorsmile · 2017-11-21T06:51:30Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala

+import org.apache.spark.sql.types._
+
+
+class DataSourceStrategySuite extends QueryTest with SharedSQLContext {


extends PlanTest

Fixed. Thanks!

gatorsmile · 2017-11-21T07:02:59Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala

+
+    assertResult(Some(sources.EqualTo("cint", 1))) {
+      DataSourceStrategy.translateFilter(
+        expressions.EqualTo(attrInt, Literal(1)))


No need to call Literal here. It will be implicitly casted to Literal

expressions.EqualTo(attrInt,1))

Fixed. Thanks!

gatorsmile · 2017-11-21T07:04:24Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala

+
+  test("translate simple expression") {
+    val attrInt = AttributeReference("cint", IntegerType)()
+    val attrStr = AttributeReference("cstr", StringType)()


import org.apache.spark.sql.catalyst.dsl.expressions._

You can simplify your test cases.

val attrInt = 'cint.int val attrStr = 'cstr.string

Fixed. Thanks!

cloud-fan · 2017-11-21T08:02:43Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala

+    assertResult(None) {
+      DataSourceStrategy.translateFilter(
+        expressions.LessThanOrEqual(
+          expressions.Subtract(expressions.Abs(attrInt), 2), 1))


would be better to add a comment to say that abs is not supported

cloud-fan · 2017-11-21T08:02:50Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala

+        expressions.And(
+          expressions.GreaterThan(attrInt, 1),
+          expressions.LessThan(
+            expressions.Abs(attrInt), 10)


SparkQA · 2017-11-21T08:05:01Z

Test build #84055 has finished for PR 19776 at commit fc34568.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-21T08:05:02Z

Test build #84047 has finished for PR 19776 at commit 0aebdfb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataSourceStrategySuite extends QueryTest with SharedSQLContext

SparkQA · 2017-11-21T08:05:02Z

Test build #84053 has finished for PR 19776 at commit ba06181.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataSourceStrategySuite extends PlanTest with SharedSQLContext

cloud-fan · 2017-11-21T08:06:19Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

+      "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR NAME != 'fred'))")
+    val df9 = sql("SELECT * FROM foobar " +
+      "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR TRIM(NAME) != 'fred'))")
+    val df10 = sql("SELECT * FROM foobar " +


why do we need to test so many cases? as an end-to-end test, I think we only need a typical case.

cloud-fan · 2017-11-21T08:06:35Z

LGTM except a few minor comments

jliwork · 2017-11-21T08:31:23Z

@cloud-fan Thank you for your comments! I have updated the test cases as you suggested.

cloud-fan · 2017-11-21T08:39:30Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala

+    assertResult(None) {
+      DataSourceStrategy.translateFilter(
+        expressions.LessThanOrEqual(
+          expressions.Subtract(expressions.Abs(attrInt), 2), 1))


can we move the comment to this line? i.e.

// `Abs` expression cannot be pushed down expressions.Subtract(expressions.Abs(attrInt), 2), 1))

Fixed. Thanks.

SparkQA · 2017-11-21T11:11:50Z

Test build #84063 has finished for PR 19776 at commit 0cbb528.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-11-21T12:34:31Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

@@ -296,9 +296,15 @@ class JDBCSuite extends SparkFunSuite
    // The older versions of spark have this kind of bugs in parquet data source.
    val df1 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2 AND NAME != 'mary')")


As I leave the comment in #10468 (comment), the above test doesn't actually test against SPARK-12218 issue. Maybe we can simply drop it.

viirya · 2017-11-21T12:41:35Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala

+    assertResult(Some(sources.EqualTo("cint", 1))) {
+      DataSourceStrategy.translateFilter(
+        expressions.EqualTo(attrInt, 1))
+    }


Looks like we can have a small helper function:

def testTranslateFilter(catalystFilter: Expression, result: Option[sources.Filter]): Unit = { assertResult(result) { DataSourceStrategy.translateFilter(catalystFilter) } }

So the tests can be rewritten as:

testTranslateFilter(expressions.EqualTo(attrInt, 1), Some(sources.EqualTo("cint", 1)))

Thanks! I've followed your suggestion and the test suite looks cleaner now.

viirya · 2017-11-21T12:44:22Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala

+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions._


As you import expressions._, I think we can write EqualTo instead of expressions.EqualTo for catalyst predicates below?

Because you always write sources.EqualTo`, I think we don't confuse with them?

Thanks for the suggestion. Fixed.

viirya · 2017-11-21T12:46:21Z

Few comments, otherwise LGTM.

HyukjinKwon · 2017-11-21T12:47:18Z

LGTM otherwise too.

jliwork · 2017-11-21T22:43:51Z

Thanks for everyone's comments! I have polished the test cases.

cloud-fan · 2017-11-21T22:57:58Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala

+
+    // ABS(cint) - 2 = 1
+    testTranslateFilter(LessThanOrEqual(
+      // Expressions are not supported


good catch @_@ fixed the typo. Thanks!

cloud-fan · 2017-11-21T22:58:35Z

LGTM

SparkQA · 2017-11-22T01:19:19Z

Test build #84086 has finished for PR 19776 at commit a0b3d4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-22T01:29:29Z

LGTM

…C data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <[email protected]> Closes #19776 from jliwork/spark-22548. (cherry picked from commit 881c5c8) Signed-off-by: gatorsmile <[email protected]>

gatorsmile · 2017-11-22T01:31:42Z

Thanks! Merged to master/2.2/2.1

jliwork · 2017-11-22T01:59:25Z

@gatorsmile @cloud-fan @viirya @HyukjinKwon Thanks a lot! =)

SparkQA · 2017-11-22T03:13:48Z

Test build #84087 has finished for PR 19776 at commit 7a19ac6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…C data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](apache#10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <[email protected]> Closes apache#19776 from jliwork/spark-22548. (cherry picked from commit 881c5c8) Signed-off-by: gatorsmile <[email protected]>

[SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDB…

58de88c

…C data source

jliwork changed the title ~~[SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDB…~~ [SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source Nov 20, 2017

viirya mentioned this pull request Nov 20, 2017

[SPARK-12409][SPARK-12387][SPARK-12391][SQL] Support AND/OR/IN/LIKE push-down filters for JDBC #10468

Closed

HyukjinKwon reviewed Nov 20, 2017

View reviewed changes

address comment to add PR discussion number for reference

e540790

gatorsmile reviewed Nov 20, 2017

View reviewed changes

address comment to explain the fix with example

635768e

address comment to leave just JIRA number there

3bb7d3c

cloud-fan reviewed Nov 20, 2017

View reviewed changes

address comment to add a new DataSourceStrategySuite to test the tran…

0aebdfb

…slateFilter

gatorsmile reviewed Nov 21, 2017

View reviewed changes

address comment to extend QueryTest

ba06181

gatorsmile reviewed Nov 21, 2017

View reviewed changes

address comment to improve test suite

fc34568

cloud-fan reviewed Nov 21, 2017

View reviewed changes

address comments to improve tests

0cbb528

cloud-fan reviewed Nov 21, 2017

View reviewed changes

viirya reviewed Nov 21, 2017

View reviewed changes

address comments to polish test suite

a0b3d4e

cloud-fan reviewed Nov 21, 2017

View reviewed changes

address comments to fix a typo in test case

7a19ac6

asfgit closed this in 881c5c8 Nov 22, 2017

dongjoon-hyun mentioned this pull request Nov 12, 2023

[SPARK-44493][SQL] Translate catalyst expression into partial datasource filter #43769

Closed

		import org.apache.spark.sql.types._


		class DataSourceStrategySuite extends QueryTest with SharedSQLContext {

		@@ -296,9 +296,15 @@ class JDBCSuite extends SparkFunSuite
		// The older versions of spark have this kind of bugs in parquet data source.
		val df1 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2 AND NAME != 'mary')")

[SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source #19776

[SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source #19776

Conversation

jliwork commented Nov 18, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

viirya commented Nov 20, 2017

viirya commented Nov 20, 2017

jliwork commented Nov 20, 2017 • edited Loading

HyukjinKwon commented Nov 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Nov 20, 2017

jliwork commented Nov 20, 2017

viirya commented Nov 20, 2017

SparkQA commented Nov 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 20, 2017

SparkQA commented Nov 20, 2017

SparkQA commented Nov 20, 2017

Choose a reason for hiding this comment

jliwork Nov 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Nov 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 21, 2017

SparkQA commented Nov 21, 2017

SparkQA commented Nov 21, 2017

Choose a reason for hiding this comment

cloud-fan commented Nov 21, 2017

jliwork commented Nov 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 21, 2017

viirya Nov 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Nov 21, 2017

HyukjinKwon commented Nov 21, 2017

jliwork commented Nov 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Nov 21, 2017

SparkQA commented Nov 22, 2017

gatorsmile commented Nov 22, 2017

gatorsmile commented Nov 22, 2017

jliwork commented Nov 22, 2017

SparkQA commented Nov 22, 2017

jliwork commented Nov 18, 2017 •

edited

Loading

jliwork commented Nov 20, 2017 •

edited

Loading

jliwork Nov 20, 2017 •

edited

Loading

viirya Nov 21, 2017 •

edited

Loading