-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25557][SQL] Nested column predicate pushdown for ORC #28761
Conversation
Test build #123660 has finished for PR 28761 at commit
|
Test build #123662 has finished for PR 28761 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, looks okay to me.
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcV1FilterSuite.scala
Outdated
Show resolved
Hide resolved
Test build #123727 has finished for PR 28761 at commit
|
retest this please |
Test build #123732 has finished for PR 28761 at commit
|
retest this please |
Test build #124143 has finished for PR 28761 at commit
|
kindly ping @dbtsai @dongjoon-hyun @cloud-fan |
retest this please |
Test build #124506 has finished for PR 28761 at commit
|
...core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala
Outdated
Show resolved
Hide resolved
Test build #124634 has finished for PR 28761 at commit
|
…potential conflicts in dev ### What changes were proposed in this pull request? This PR proposes to partially reverts back in the tests and some codes at #27728 without touching any behaivours. Most of changes in tests are back before #27728 by combining `withNestedDataFrame` and `withParquetDataFrame`. Basically, it addresses the comments #27728 (comment), and my own comment in another PR at #28761 (comment) ### Why are the changes needed? For maintenance purpose and to avoid a potential conflicts during backports. And also in case when other codes are matched with this. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested. Closes #28955 from HyukjinKwon/SPARK-25556-followup. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…potential conflicts in dev ### What changes were proposed in this pull request? This PR proposes to partially reverts back in the tests and some codes at #27728 without touching any behaivours. Most of changes in tests are back before #27728 by combining `withNestedDataFrame` and `withParquetDataFrame`. Basically, it addresses the comments #27728 (comment), and my own comment in another PR at #28761 (comment) ### Why are the changes needed? For maintenance purpose and to avoid a potential conflicts during backports. And also in case when other codes are matched with this. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested. Closes #28955 from HyukjinKwon/SPARK-25556-followup. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 8194d9e) Signed-off-by: HyukjinKwon <[email protected]>
I'll clean up the tests more. |
Test build #126759 has finished for PR 28761 at commit
|
retest this please |
Test build #126761 has finished for PR 28761 at commit
|
Thank you for confirming all tests, @viirya . I'll review today. |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScanBuilder.scala
Show resolved
Hide resolved
// mode, just skip pushdown for these fields, they will trigger Exception when reading, | ||
// See: SPARK-25175. | ||
val dedupPrimitiveFields = | ||
primitiveFields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation?
- val dedupPrimitiveFields =
- primitiveFields
+ val dedupPrimitiveFields = primitiveFields
if (caseSensitive) { | ||
primitiveFields.toMap | ||
} else { | ||
// Don't consider ambiguity here, i.e. more than one field is matched in case insensitive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is matched
-> are matched
?
@@ -78,7 +78,7 @@ abstract class OrcTest extends QueryTest with FileBasedDataSourceTest with Befor | |||
(f: String => Unit): Unit = withDataSourceFile(data)(f) | |||
|
|||
/** | |||
* Writes `data` to a Orc file and reads it back as a `DataFrame`, | |||
* Writes `date` dataframe to a Orc file and reads it back as a `DataFrame`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you change this from data
to date
? This is not limited to DATE
. The original one looks correct to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, a typo. :) Will correct it.
* dataframes as new test data. It tests both non-nested and nested dataframes | ||
* which are written and read back with Orc datasource. | ||
* | ||
* This is different from [[OrcTest.withOrcDataFrame]] which does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need OrcTest.
prefix?
* This method returns a map which contains ORC field name and data type. Each key | ||
* represents a column; `dots` are used as separators for nested columns. If any part | ||
* of the names contains `dots`, it is quoted to avoid confusion. See | ||
* `org.apache.spark.sql.connector.catalog.quote` for implementation details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quote
-> quoted
.
case BinaryType => false | ||
case _: AtomicType => true | ||
case _ => false | ||
protected[sql] def getNameToOrcFieldMap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OrcField
looks a little mismatched because this function returnsDataType
instead of a field. Currently, it sounds likesToOrcField
.- According to the behavior of this function, this ignores BinaryType, complexType, UserDefinedType. Also, function description doesn't mention the limitation at all. In order to be more clear, we had better have
Searchable
in the function name like the previous one (isSearchableType).
@@ -231,37 +229,37 @@ private[sql] object OrcFilters extends OrcFiltersBase { | |||
// Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` characters | |||
// in order to distinguish predicate pushdown for nested columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we removed quoteIfNeeded
in this file completely, I believe we can remove this old comment (231~232) together in both files v1.2(here) and v2.3.
checkFilterPredicate(Literal(1) >= $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) | ||
checkFilterPredicate(Literal(4) <= $"_1", PredicateLeaf.Operator.LESS_THAN) | ||
withNestedOrcDataFrame( | ||
(1 to 4).map(i => Tuple1(Option(i.toDouble)))) { case (inputDF, colName, _) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation?
} | ||
} | ||
|
||
test("filter pushdown - decimal") { | ||
withOrcDataFrame((1 to 4).map(i => Tuple1.apply(BigDecimal.valueOf(i)))) { implicit df => | ||
checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL) | ||
withNestedOrcDataFrame((1 to 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This format looks inconsistent from your other code change. Is this intentional due to some limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean (1 to 4)
.
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @viirya . I finished one round review. Could you take a look at the comments?
Thanks @dongjoon-hyun for the review. Except for #28761 (comment), I think all other comments were addressed. I will add test coverage for that later. |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
Outdated
Show resolved
Hide resolved
Yes, @viirya . For that one, let's do later in another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (except one minor function naming comment)
Test build #127180 has finished for PR 28761 at commit
|
Test build #127182 has finished for PR 28761 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @viirya . GitHub Action passed.
Merged to master for Apache Spark 3.1 on December.
This will help other nested column PRs, too.
Thank you, @maropu and @HyukjinKwon , too.
cc @cloud-fan , @dbtsai , @gatorsmile , too.
Thanks all. |
+1 looks good to me too |
… `Filter` ### What changes were proposed in this pull request? This pr aims remove `private[sql] `function `containsNestedColumn` from `org.apache.spark.sql.sources.Filter`. This function was introduced by #27728 to avoid nested predicate pushdown for Orc. After #28761, Orc also support nested column predicate pushdown, so this function become unused. ### Why are the changes needed? Remove unused `private[sql] ` function `containsNestedColumn`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #42239 from LuciferYang/SPARK-44607. Authored-by: yangjie01 <[email protected]> Signed-off-by: yangjie01 <[email protected]>
… `Filter` ### What changes were proposed in this pull request? This pr aims remove `private[sql] `function `containsNestedColumn` from `org.apache.spark.sql.sources.Filter`. This function was introduced by apache#27728 to avoid nested predicate pushdown for Orc. After apache#28761, Orc also support nested column predicate pushdown, so this function become unused. ### Why are the changes needed? Remove unused `private[sql] ` function `containsNestedColumn`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes apache#42239 from LuciferYang/SPARK-44607. Authored-by: yangjie01 <[email protected]> Signed-off-by: yangjie01 <[email protected]>
What changes were proposed in this pull request?
We added nested column predicate pushdown for Parquet in #27728. This patch extends the feature support to ORC.
Why are the changes needed?
Extending the feature to ORC for feature parity. Better performance for handling nested predicate pushdown.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests.