Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25557][SQL] Nested column predicate pushdown for ORC #28761

Closed
wants to merge 9 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Jun 9, 2020

What changes were proposed in this pull request?

We added nested column predicate pushdown for Parquet in #27728. This patch extends the feature support to ORC.

Why are the changes needed?

Extending the feature to ORC for feature parity. Better performance for handling nested predicate pushdown.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests.

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123660 has finished for PR 28761 at commit 1486382.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123662 has finished for PR 28761 at commit e12939e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 9, 2020

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, looks okay to me.

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123727 has finished for PR 28761 at commit bd691ed.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 10, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123732 has finished for PR 28761 at commit bd691ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 16, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 17, 2020

Test build #124143 has finished for PR 28761 at commit bd691ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 25, 2020

kindly ping @dbtsai @dongjoon-hyun @cloud-fan

@maropu
Copy link
Member

maropu commented Jun 25, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 25, 2020

Test build #124506 has finished for PR 28761 at commit bd691ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2020

Test build #124634 has finished for PR 28761 at commit e76b5f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

HyukjinKwon added a commit that referenced this pull request Jul 1, 2020
…potential conflicts in dev

### What changes were proposed in this pull request?

This PR proposes to partially reverts back in the tests and some codes at #27728 without touching any behaivours.

Most of changes in tests are back before #27728 by combining `withNestedDataFrame` and `withParquetDataFrame`.

Basically, it addresses the comments #27728 (comment), and my own comment in another PR at #28761 (comment)

### Why are the changes needed?

For maintenance purpose and to avoid a potential conflicts during backports. And also in case when other codes are matched with this.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested.

Closes #28955 from HyukjinKwon/SPARK-25556-followup.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Jul 1, 2020
…potential conflicts in dev

### What changes were proposed in this pull request?

This PR proposes to partially reverts back in the tests and some codes at #27728 without touching any behaivours.

Most of changes in tests are back before #27728 by combining `withNestedDataFrame` and `withParquetDataFrame`.

Basically, it addresses the comments #27728 (comment), and my own comment in another PR at #28761 (comment)

### Why are the changes needed?

For maintenance purpose and to avoid a potential conflicts during backports. And also in case when other codes are matched with this.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested.

Closes #28955 from HyukjinKwon/SPARK-25556-followup.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 8194d9e)
Signed-off-by: HyukjinKwon <[email protected]>
@viirya
Copy link
Member Author

viirya commented Jul 29, 2020

I'll clean up the tests more.

@SparkQA
Copy link

SparkQA commented Jul 29, 2020

Test build #126759 has finished for PR 28761 at commit 7175e7c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 29, 2020

Test build #126761 has finished for PR 28761 at commit 7175e7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you for confirming all tests, @viirya . I'll review today.

// mode, just skip pushdown for these fields, they will trigger Exception when reading,
// See: SPARK-25175.
val dedupPrimitiveFields =
primitiveFields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation?

- val dedupPrimitiveFields =
- primitiveFields
+ val dedupPrimitiveFields = primitiveFields

if (caseSensitive) {
primitiveFields.toMap
} else {
// Don't consider ambiguity here, i.e. more than one field is matched in case insensitive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is matched -> are matched?

@@ -78,7 +78,7 @@ abstract class OrcTest extends QueryTest with FileBasedDataSourceTest with Befor
(f: String => Unit): Unit = withDataSourceFile(data)(f)

/**
* Writes `data` to a Orc file and reads it back as a `DataFrame`,
* Writes `date` dataframe to a Orc file and reads it back as a `DataFrame`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you change this from data to date? This is not limited to DATE. The original one looks correct to me.

Copy link
Member Author

@viirya viirya Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, a typo. :) Will correct it.

* dataframes as new test data. It tests both non-nested and nested dataframes
* which are written and read back with Orc datasource.
*
* This is different from [[OrcTest.withOrcDataFrame]] which does not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need OrcTest. prefix?

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-25557][SQL][test-hadoop2.7][test-hive1.2] Nested column predicate pushdown for ORC [SPARK-25557][SQL] Nested column predicate pushdown for ORC Aug 6, 2020
* This method returns a map which contains ORC field name and data type. Each key
* represents a column; `dots` are used as separators for nested columns. If any part
* of the names contains `dots`, it is quoted to avoid confusion. See
* `org.apache.spark.sql.connector.catalog.quote` for implementation details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quote -> quoted.

case BinaryType => false
case _: AtomicType => true
case _ => false
protected[sql] def getNameToOrcFieldMap(
Copy link
Member

@dongjoon-hyun dongjoon-hyun Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. OrcField looks a little mismatched because this function returns DataType instead of a field. Currently, it sounds likes ToOrcField.
  2. According to the behavior of this function, this ignores BinaryType, complexType, UserDefinedType. Also, function description doesn't mention the limitation at all. In order to be more clear, we had better have Searchable in the function name like the previous one (isSearchableType).

@@ -231,37 +229,37 @@ private[sql] object OrcFilters extends OrcFiltersBase {
// Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` characters
// in order to distinguish predicate pushdown for nested columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we removed quoteIfNeeded in this file completely, I believe we can remove this old comment (231~232) together in both files v1.2(here) and v2.3.

checkFilterPredicate(Literal(1) >= $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS)
checkFilterPredicate(Literal(4) <= $"_1", PredicateLeaf.Operator.LESS_THAN)
withNestedOrcDataFrame(
(1 to 4).map(i => Tuple1(Option(i.toDouble)))) { case (inputDF, colName, _) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation?

}
}

test("filter pushdown - decimal") {
withOrcDataFrame((1 to 4).map(i => Tuple1.apply(BigDecimal.valueOf(i)))) { implicit df =>
checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL)
withNestedOrcDataFrame((1 to 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This format looks inconsistent from your other code change. Is this intentional due to some limitation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean indentation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean (1 to 4).

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @viirya . I finished one round review. Could you take a look at the comments?

@viirya
Copy link
Member Author

viirya commented Aug 7, 2020

Thanks @dongjoon-hyun for the review. Except for #28761 (comment), I think all other comments were addressed. I will add test coverage for that later.

@dongjoon-hyun
Copy link
Member

Yes, @viirya . For that one, let's do later in another PR.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (except one minor function naming comment)

@SparkQA
Copy link

SparkQA commented Aug 7, 2020

Test build #127180 has finished for PR 28761 at commit 558db46.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2020

Test build #127182 has finished for PR 28761 at commit dc77290.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @viirya . GitHub Action passed.
Merged to master for Apache Spark 3.1 on December.
This will help other nested column PRs, too.

Thank you, @maropu and @HyukjinKwon , too.

cc @cloud-fan , @dbtsai , @gatorsmile , too.

@viirya
Copy link
Member Author

viirya commented Aug 7, 2020

Thanks all.

@HyukjinKwon
Copy link
Member

+1 looks good to me too

LuciferYang added a commit that referenced this pull request Aug 2, 2023
… `Filter`

### What changes were proposed in this pull request?
This pr aims remove `private[sql] `function `containsNestedColumn` from `org.apache.spark.sql.sources.Filter`.
This function was introduced by #27728 to avoid nested predicate pushdown for Orc.
After #28761, Orc also support nested column predicate pushdown, so this function become unused.

### Why are the changes needed?
Remove unused `private[sql] ` function `containsNestedColumn`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #42239 from LuciferYang/SPARK-44607.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
@viirya viirya deleted the SPARK-25557 branch December 27, 2023 18:29
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
… `Filter`

### What changes were proposed in this pull request?
This pr aims remove `private[sql] `function `containsNestedColumn` from `org.apache.spark.sql.sources.Filter`.
This function was introduced by apache#27728 to avoid nested predicate pushdown for Orc.
After apache#28761, Orc also support nested column predicate pushdown, so this function become unused.

### Why are the changes needed?
Remove unused `private[sql] ` function `containsNestedColumn`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes apache#42239 from LuciferYang/SPARK-44607.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants