[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication #4877

alexeykudinkin · 2022-02-23T03:40:45Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

NOTE: This PR is stacked on top of #4818

Refactoring Spark DataSource Relations to avoid code duplication. Following Relations were in scope:

BaseFileOnlyViewRelation
MergeOnReadSnapshotRelaation
MergeOnReadIncrementalRelation

Brief change log

See above

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala

...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala

...ark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDataSourceHelper.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

xushiyan · 2022-03-11T14:43:24Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

+    sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.filterPushdown", "true")
+    sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled", "true")
+    sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader", "true")
+  }


this one was false

Correct. There's no reason to disable vectorization.

Confirmed this with @YannByron

yep. enableVectorizedReader was false. but as discussed with @alexeykudinkin before, we need to enable this to speed up.

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala

alexeykudinkin · 2022-03-14T20:19:34Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

+    sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.filterPushdown", "true")
+    sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled", "true")
+    sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader", "true")
+  }


Correct. There's no reason to disable vectorization.

Confirmed this with @YannByron

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

...ark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDataSourceHelper.scala

alexeykudinkin · 2022-03-14T20:22:33Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/HoodieHadoopFSUtils.scala

+                                            blockLocations: Array[SerializableBlockLocation])
+
+  /** Checks if we should filter out this path name. */
+  def shouldFilterOutPathName(pathName: String): Boolean = {


This is the only thing that changed as compared to Spark's HadoopFsUtils

alexeykudinkin · 2022-03-15T05:14:57Z

@hudi-bot run azure

alexeykudinkin · 2022-03-16T03:50:52Z

@hudi-bot run azure

alexeykudinkin · 2022-03-16T16:50:31Z

@hudi-bot run azure

Moved `buildScan` impl into `HoodieBaseRelation`

Tidying up

…eBaseRelation`

…ity, avoid duplication

Tidying up;

…uld be shared w/ COW impl

…ils` to override default behavior of InMemoryFileIndex filtering out all files stated w/ "."

… to be able to appropriately invoke super

…lation

…s properly

hudi-bot · 2022-03-16T21:10:58Z

CI report:

fa17c3d UNKNOWN
379c451 UNKNOWN
be26058 UNKNOWN
40e5a85 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xushiyan

LGTM direction-wise. The extracted BaseRelation logic a bit hard to exam line by line. Testing will be a more effective way to verify. If result ok, it's good to land this.

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java

xushiyan · 2022-03-18T18:50:46Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/HoodieHadoopFSUtils.scala

+    val exclude = (pathName.startsWith("_") && !pathName.contains("=")) || pathName.endsWith("._COPYING_")
+    val include = pathName.startsWith("_common_metadata") || pathName.startsWith("_metadata")
+    exclude && !include


should some utils from MDT be the source of truth to these rules instead? spark side does not own these, also can avoid copying it over different spark versions

Right now this is mostly about filtering out Spark-specific stuff. We can replace it with own utils when there will be a need for it, but for now the goal of borrowing this class was to override its behavior filtering out the files starting with "."

...source/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala

...atasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala

xushiyan · 2022-03-18T19:17:06Z

...k-common/src/main/scala/org/apache/spark/execution/datasources/HoodieInMemoryFileIndex.scala

+      hadoopConf = hadoopConf,
+      filter = new PathFilterWrapper(filter),
+      ignoreMissingFiles = sparkSession.sessionState.conf.ignoreMissingFiles,
+      // NOTE: We're disabling fetching Block Info to speed up file listing


we may need a special token here to indicate changed part in hudi's codebase for easier maintenance. // NOTE: is not special enough. what about // HUDI NOTE: ? this can apply to any other incoming code variation

Not sure understand your point here: what do you suggest this token to be used for?

I meant to say when we want to understand which part of code is modified in Hudi, we may search for a special token and find the relevant code. NOTE: might come from the original code base so wanted to make it special.

Gotcha. It's gonna be tough to identify all such places with markers, instead i'm referencing respective Spark release version this is borrowed from so that we can simply diff against it and see what has changed.

XuQianJin-Stars · 2022-03-19T00:34:31Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

+    sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled", "true")
+    // TODO(HUDI-3639) vectorized reader has to be disabled to make sure MORIncrementalRelation is working properly
+    sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader", "false")
+  }


spark.sql.parquet.enableVectorizedReader = true to enable vectorization acceleration ?

Please take a look at the TODO note i've added to it. We can't do that b/c MOR Incremental Relation relies on Parquet Filtering which doesn't work w/ vectorized reader

XuQianJin-Stars · 2022-03-19T01:18:46Z

+1 LGTM

xushiyan · 2022-03-19T05:31:41Z

manually tested this patch in spark 3.2.1, using quickstart examples, and passed. landing this.

…cation (apache#4877) Refactoring Spark DataSource Relations to avoid code duplication. Following Relations were in scope: - BaseFileOnlyViewRelation - MergeOnReadSnapshotRelaation - MergeOnReadIncrementalRelation

alexeykudinkin changed the title ~~[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication~~ [HUDI-3457][Stacked on 4818] Refactored Spark DataSource Relations to avoid code duplication Feb 23, 2022

alexeykudinkin force-pushed the ak/spkds-ref-2 branch from af2c977 to 2eb193a Compare February 24, 2022 21:30

alexeykudinkin commented Feb 24, 2022

View reviewed changes

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala Show resolved Hide resolved

alexeykudinkin force-pushed the ak/spkds-ref-2 branch from 2eb193a to d875e41 Compare February 24, 2022 22:24

alexeykudinkin force-pushed the ak/spkds-ref-2 branch from d875e41 to 2940f46 Compare March 11, 2022 02:40

xushiyan reviewed Mar 11, 2022

View reviewed changes

alexeykudinkin changed the title ~~[HUDI-3457][Stacked on 4818] Refactored Spark DataSource Relations to avoid code duplication~~ [Stacked on 4818] Refactored Spark DataSource Relations to avoid code duplication Mar 11, 2022

alexeykudinkin changed the title ~~[Stacked on 4818] Refactored Spark DataSource Relations to avoid code duplication~~ [HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication Mar 11, 2022

alexeykudinkin force-pushed the ak/spkds-ref-2 branch from 2940f46 to c71cfab Compare March 12, 2022 01:43

alexeykudinkin commented Mar 14, 2022

View reviewed changes

alexeykudinkin force-pushed the ak/spkds-ref-2 branch from be26058 to ec7e1b3 Compare March 16, 2022 16:53

Alexey Kudinkin added 16 commits March 16, 2022 12:06

Abstracted & unified buildScan functionality for COW/MOR Relations;

edeea57

Moved `buildScan` impl into `HoodieBaseRelation`

BaseFileOnlyViewRelation > BaseFileRelation

b5cf9f0

Fixing compilation

787b6a3

Extracted common converter utils to HoodieCommonUtils;

d5d3a3a

Tidying up

Abstracted common functionality;

1bf0933

Tidying up

Extracted common functionality to lists latest base files into `Hoodi…

bc639ed

…eBaseRelation`

Streamlined MergeOnReadSnapshotRelation to re-use common functional…

c86bba7

…ity, avoid duplication

Killing dead code;

70356e5

Tidying up;

Further simplified MergeOnReadSnapshotRelation

0b2d604

lint

1bc09d3

Cleaned up & streamlined MergeOnReadIncrementalRelation

f8aa085

Tidying up

b9fa316

Extract most of the incremental-specific aspects into a trait that co…

804bb96

…uld be shared w/ COW impl

Fixing compilation

899db46

Cleaning up unnecessary filtering

48af420

After rebase fixes

6027652

Alexey Kudinkin added 17 commits March 16, 2022 12:06

Scaffolded HoodieInMemoryFileIndex and replicated `HoodieHadoopFSUt…

1d45bf0

…ils` to override default behavior of InMemoryFileIndex filtering out all files stated w/ "."

Fixed usages

b7a4f8b

Moved tests

40b0c05

Missing licenses

0ab3b9b

Disabling linter

86e8fe3

Fixed compilation for Spark 2.x

83bd0ea

Added missing scala-docs

fe8c7a8

Fixed incorrect casting

dcd693d

Fixed partition path handling for MOR Incremental Relation

35eb6df

Fixed HoodieIncrementalRelationTrait to extend HoodieBaseRelation…

eee5151

… to be able to appropriately invoke super

Handle the case when there are no commits to handle in Incremental Re…

d85be0b

…lation

Return empty RDD in case there's no file-splits to handle

f966aec

Cleaned up listLatestBaseFiles

71b1435

Added TODO

e39f963

Fixing file handle leak

f37854b

Disabled vectorized reader to make sure MOR Incremental Relation work…

b0aa03e

…s properly

Fixed Parquet column-projection tests

40e5a85

alexeykudinkin force-pushed the ak/spkds-ref-2 branch from ec7e1b3 to 40e5a85 Compare March 16, 2022 19:06

nsivabalan added the priority:blocker label Mar 16, 2022

xushiyan approved these changes Mar 18, 2022

View reviewed changes

XuQianJin-Stars reviewed Mar 19, 2022

View reviewed changes

XuQianJin-Stars approved these changes Mar 19, 2022

View reviewed changes

xushiyan merged commit 099c2c0 into apache:master Mar 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication #4877

[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication #4877

alexeykudinkin commented Feb 23, 2022 •

edited

Loading

xushiyan Mar 11, 2022

alexeykudinkin Mar 14, 2022

YannByron Mar 15, 2022

alexeykudinkin Mar 14, 2022

alexeykudinkin Mar 14, 2022

alexeykudinkin commented Mar 15, 2022

alexeykudinkin commented Mar 16, 2022

alexeykudinkin commented Mar 16, 2022

hudi-bot commented Mar 16, 2022

xushiyan left a comment

xushiyan Mar 18, 2022

alexeykudinkin Mar 18, 2022

xushiyan Mar 18, 2022

alexeykudinkin Mar 18, 2022

xushiyan Mar 19, 2022

alexeykudinkin Mar 21, 2022

XuQianJin-Stars Mar 19, 2022

alexeykudinkin Mar 19, 2022 •

edited

Loading

XuQianJin-Stars commented Mar 19, 2022

xushiyan commented Mar 19, 2022

[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication #4877

[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication #4877

Conversation

alexeykudinkin commented Feb 23, 2022 • edited Loading

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeykudinkin commented Mar 15, 2022

alexeykudinkin commented Mar 16, 2022

alexeykudinkin commented Mar 16, 2022

hudi-bot commented Mar 16, 2022

CI report:

xushiyan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeykudinkin Mar 19, 2022 • edited Loading

Choose a reason for hiding this comment

XuQianJin-Stars commented Mar 19, 2022

xushiyan commented Mar 19, 2022

alexeykudinkin commented Feb 23, 2022 •

edited

Loading

alexeykudinkin Mar 19, 2022 •

edited

Loading