[SPARK-25557][SQL] Nested column predicate pushdown for ORC #28761

viirya · 2020-06-09T00:16:45Z

What changes were proposed in this pull request?

We added nested column predicate pushdown for Parquet in #27728. This patch extends the feature support to ORC.

Why are the changes needed?

Extending the feature to ORC for feature parity. Better performance for handling nested predicate pushdown.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests.

SparkQA · 2020-06-09T00:23:49Z

Test build #123660 has finished for PR 28761 at commit 1486382.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-09T06:51:34Z

Test build #123662 has finished for PR 28761 at commit e12939e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-06-09T06:56:23Z

cc @dbtsai @dongjoon-hyun @cloud-fan @maropu

maropu

Basically, looks okay to me.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcV1FilterSuite.scala

SparkQA · 2020-06-10T07:05:02Z

Test build #123727 has finished for PR 28761 at commit bd691ed.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-06-10T07:11:08Z

retest this please

SparkQA · 2020-06-10T15:23:11Z

Test build #123732 has finished for PR 28761 at commit bd691ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-06-16T23:50:17Z

retest this please

SparkQA · 2020-06-17T04:51:25Z

Test build #124143 has finished for PR 28761 at commit bd691ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-06-25T00:10:02Z

kindly ping @dbtsai @dongjoon-hyun @cloud-fan

maropu · 2020-06-25T00:10:23Z

retest this please

SparkQA · 2020-06-25T04:52:08Z

Test build #124506 has finished for PR 28761 at commit bd691ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

...core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala

SparkQA · 2020-06-29T13:06:54Z

Test build #124634 has finished for PR 28761 at commit e76b5f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…potential conflicts in dev ### What changes were proposed in this pull request? This PR proposes to partially reverts back in the tests and some codes at #27728 without touching any behaivours. Most of changes in tests are back before #27728 by combining `withNestedDataFrame` and `withParquetDataFrame`. Basically, it addresses the comments #27728 (comment), and my own comment in another PR at #28761 (comment) ### Why are the changes needed? For maintenance purpose and to avoid a potential conflicts during backports. And also in case when other codes are matched with this. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested. Closes #28955 from HyukjinKwon/SPARK-25556-followup. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…potential conflicts in dev ### What changes were proposed in this pull request? This PR proposes to partially reverts back in the tests and some codes at #27728 without touching any behaivours. Most of changes in tests are back before #27728 by combining `withNestedDataFrame` and `withParquetDataFrame`. Basically, it addresses the comments #27728 (comment), and my own comment in another PR at #28761 (comment) ### Why are the changes needed? For maintenance purpose and to avoid a potential conflicts during backports. And also in case when other codes are matched with this. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested. Closes #28955 from HyukjinKwon/SPARK-25556-followup. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 8194d9e) Signed-off-by: HyukjinKwon <[email protected]>

viirya · 2020-07-29T06:48:51Z

I'll clean up the tests more.

SparkQA · 2020-07-29T07:05:02Z

Test build #126759 has finished for PR 28761 at commit 7175e7c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-07-29T07:06:47Z

retest this please

SparkQA · 2020-07-29T11:33:24Z

Test build #126761 has finished for PR 28761 at commit 7175e7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-06T14:37:31Z

Thank you for confirming all tests, @viirya . I'll review today.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScanBuilder.scala

dongjoon-hyun · 2020-08-06T18:30:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala

+      // mode, just skip pushdown for these fields, they will trigger Exception when reading,
+      // See: SPARK-25175.
+      val dedupPrimitiveFields =
+      primitiveFields


indentation?

- val dedupPrimitiveFields = - primitiveFields + val dedupPrimitiveFields = primitiveFields

dongjoon-hyun · 2020-08-06T18:31:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala

+    if (caseSensitive) {
+      primitiveFields.toMap
+    } else {
+      // Don't consider ambiguity here, i.e. more than one field is matched in case insensitive


is matched -> are matched?

dongjoon-hyun · 2020-08-06T18:38:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala

@@ -78,7 +78,7 @@ abstract class OrcTest extends QueryTest with FileBasedDataSourceTest with Befor
      (f: String => Unit): Unit = withDataSourceFile(data)(f)

  /**
-   * Writes `data` to a Orc file and reads it back as a `DataFrame`,
+   * Writes `date` dataframe to a Orc file and reads it back as a `DataFrame`,


Is there a reason you change this from data to date? This is not limited to DATE. The original one looks correct to me.

Oops, a typo. :) Will correct it.

dongjoon-hyun · 2020-08-06T18:39:21Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala

+   * dataframes as new test data. It tests both non-nested and nested dataframes
+   * which are written and read back with Orc datasource.
+   *
+   * This is different from [[OrcTest.withOrcDataFrame]] which does not


Do we need OrcTest. prefix?

dongjoon-hyun · 2020-08-06T19:18:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala

+   * This method returns a map which contains ORC field name and data type. Each key
+   * represents a column; `dots` are used as separators for nested columns. If any part
+   * of the names contains `dots`, it is quoted to avoid confusion. See
+   * `org.apache.spark.sql.connector.catalog.quote` for implementation details.


quote -> quoted.

dongjoon-hyun · 2020-08-06T21:26:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala

-    case BinaryType => false
-    case _: AtomicType => true
-    case _ => false
+  protected[sql] def getNameToOrcFieldMap(


OrcField looks a little mismatched because this function returns DataType instead of a field. Currently, it sounds likes ToOrcField.

According to the behavior of this function, this ignores BinaryType, complexType, UserDefinedType. Also, function description doesn't mention the limitation at all. In order to be more clear, we had better have Searchable in the function name like the previous one (isSearchableType).

dongjoon-hyun · 2020-08-06T21:52:21Z

sql/core/v1.2/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala

@@ -231,37 +229,37 @@ private[sql] object OrcFilters extends OrcFiltersBase {
    // Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` characters
    // in order to distinguish predicate pushdown for nested columns.


Since we removed quoteIfNeeded in this file completely, I believe we can remove this old comment (231~232) together in both files v1.2(here) and v2.3.

dongjoon-hyun · 2020-08-06T21:55:12Z

...core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala

-      checkFilterPredicate(Literal(1) >= $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS)
-      checkFilterPredicate(Literal(4) <= $"_1", PredicateLeaf.Operator.LESS_THAN)
+    withNestedOrcDataFrame(
+      (1 to 4).map(i => Tuple1(Option(i.toDouble)))) { case (inputDF, colName, _) =>


indentation?

dongjoon-hyun · 2020-08-06T21:57:13Z

...core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala

    }
  }

  test("filter pushdown - decimal") {
-    withOrcDataFrame((1 to 4).map(i => Tuple1.apply(BigDecimal.valueOf(i)))) { implicit df =>
-      checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL)
+    withNestedOrcDataFrame((1 to 4)


This format looks inconsistent from your other code change. Is this intentional due to some limitation?

You mean indentation?

I mean (1 to 4).

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala

dongjoon-hyun

Thanks, @viirya . I finished one round review. Could you take a look at the comments?

viirya · 2020-08-07T05:51:41Z

Thanks @dongjoon-hyun for the review. Except for #28761 (comment), I think all other comments were addressed. I will add test coverage for that later.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala

dongjoon-hyun · 2020-08-07T06:11:40Z

Yes, @viirya . For that one, let's do later in another PR.

dongjoon-hyun

+1, LGTM (except one minor function naming comment)

SparkQA · 2020-08-07T07:05:02Z

Test build #127180 has finished for PR 28761 at commit 558db46.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-07T07:05:02Z

Test build #127182 has finished for PR 28761 at commit dc77290.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Thank you, @viirya . GitHub Action passed.
Merged to master for Apache Spark 3.1 on December.
This will help other nested column PRs, too.

Thank you, @maropu and @HyukjinKwon , too.

cc @cloud-fan , @dbtsai , @gatorsmile , too.

viirya · 2020-08-07T15:30:02Z

Thanks all.

HyukjinKwon · 2020-08-10T03:34:34Z

+1 looks good to me too

… `Filter` ### What changes were proposed in this pull request? This pr aims remove `private[sql] `function `containsNestedColumn` from `org.apache.spark.sql.sources.Filter`. This function was introduced by #27728 to avoid nested predicate pushdown for Orc. After #28761, Orc also support nested column predicate pushdown, so this function become unused. ### Why are the changes needed? Remove unused `private[sql] ` function `containsNestedColumn`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #42239 from LuciferYang/SPARK-44607. Authored-by: yangjie01 <[email protected]> Signed-off-by: yangjie01 <[email protected]>

… `Filter` ### What changes were proposed in this pull request? This pr aims remove `private[sql] `function `containsNestedColumn` from `org.apache.spark.sql.sources.Filter`. This function was introduced by apache#27728 to avoid nested predicate pushdown for Orc. After apache#28761, Orc also support nested column predicate pushdown, so this function become unused. ### Why are the changes needed? Remove unused `private[sql] ` function `containsNestedColumn`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes apache#42239 from LuciferYang/SPARK-44607. Authored-by: yangjie01 <[email protected]> Signed-off-by: yangjie01 <[email protected]>

Nested column predicate pushdown for ORC.

9903e05

probot-autolabeler bot added the SQL label Jun 9, 2020

Update config doc.

1486382

Fix scala style.

e12939e

maropu reviewed Jun 9, 2020

View reviewed changes

Address comments.

bd691ed

HyukjinKwon reviewed Jun 29, 2020

View reviewed changes

...core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 29, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala Outdated Show resolved Hide resolved

Revert unnecessary change.

e76b5f4

HyukjinKwon mentioned this pull request Jun 30, 2020

[SPARK-32142][SQL][TESTS] Keep the original tests and codes to avoid potential conflicts in dev #28955

Closed

Merge remote-tracking branch 'upstream/master' into SPARK-25557

7175e7c

viirya force-pushed the SPARK-25557 branch from 8136c8d to 7800474 Compare August 4, 2020 06:46

dongjoon-hyun reviewed Aug 6, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScanBuilder.scala Show resolved Hide resolved

dongjoon-hyun reviewed Aug 6, 2020

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-25557][SQL][test-hadoop2.7][test-hive1.2] Nested column predicate pushdown for ORC~~ [SPARK-25557][SQL] Nested column predicate pushdown for ORC Aug 6, 2020

dongjoon-hyun reviewed Aug 6, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala Show resolved Hide resolved

dongjoon-hyun reviewed Aug 6, 2020

View reviewed changes

Address comments.

558db46

dongjoon-hyun reviewed Aug 7, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala Outdated Show resolved Hide resolved

dongjoon-hyun approved these changes Aug 7, 2020

View reviewed changes

Rename.

dc77290

dongjoon-hyun approved these changes Aug 7, 2020

View reviewed changes

dongjoon-hyun closed this in 7b6e1d5 Aug 7, 2020

LuciferYang mentioned this pull request Jul 31, 2023

[SPARK-44607][SQL] Remove unused function containsNestedColumn from Filter #42239

Closed

viirya deleted the SPARK-25557 branch December 27, 2023 18:29

		@@ -231,37 +229,37 @@ private[sql] object OrcFilters extends OrcFiltersBase {
		// Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` characters
		// in order to distinguish predicate pushdown for nested columns.

[SPARK-25557][SQL] Nested column predicate pushdown for ORC #28761

[SPARK-25557][SQL] Nested column predicate pushdown for ORC #28761

Conversation

viirya commented Jun 9, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jun 9, 2020

SparkQA commented Jun 9, 2020

viirya commented Jun 9, 2020

maropu left a comment

Choose a reason for hiding this comment

SparkQA commented Jun 10, 2020

viirya commented Jun 10, 2020

SparkQA commented Jun 10, 2020

viirya commented Jun 16, 2020

SparkQA commented Jun 17, 2020

maropu commented Jun 25, 2020

maropu commented Jun 25, 2020

SparkQA commented Jun 25, 2020

SparkQA commented Jun 29, 2020

viirya commented Jul 29, 2020

SparkQA commented Jul 29, 2020

HyukjinKwon commented Jul 29, 2020

SparkQA commented Jul 29, 2020

dongjoon-hyun commented Aug 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Aug 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Aug 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

viirya commented Aug 7, 2020

dongjoon-hyun commented Aug 7, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Aug 7, 2020

SparkQA commented Aug 7, 2020

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

viirya commented Aug 7, 2020

HyukjinKwon commented Aug 10, 2020

viirya Aug 6, 2020 •

edited

Loading

dongjoon-hyun Aug 6, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading