Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication #4877

Merged
merged 33 commits into from
Mar 19, 2022

Conversation

alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Feb 23, 2022

Tips

What is the purpose of the pull request

NOTE: This PR is stacked on top of #4818

Refactoring Spark DataSource Relations to avoid code duplication. Following Relations were in scope:

  • BaseFileOnlyViewRelation
  • MergeOnReadSnapshotRelaation
  • MergeOnReadIncrementalRelation

Brief change log

See above

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@alexeykudinkin alexeykudinkin changed the title [HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication [HUDI-3457][Stacked on 4818] Refactored Spark DataSource Relations to avoid code duplication Feb 23, 2022
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.filterPushdown", "true")
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled", "true")
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader", "true")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one was false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. There's no reason to disable vectorization.

Confirmed this with @YannByron

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. enableVectorizedReader was false. but as discussed with @alexeykudinkin before, we need to enable this to speed up.

@alexeykudinkin alexeykudinkin changed the title [HUDI-3457][Stacked on 4818] Refactored Spark DataSource Relations to avoid code duplication [Stacked on 4818] Refactored Spark DataSource Relations to avoid code duplication Mar 11, 2022
@alexeykudinkin alexeykudinkin changed the title [Stacked on 4818] Refactored Spark DataSource Relations to avoid code duplication [HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication Mar 11, 2022
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.filterPushdown", "true")
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled", "true")
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader", "true")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. There's no reason to disable vectorization.

Confirmed this with @YannByron

blockLocations: Array[SerializableBlockLocation])

/** Checks if we should filter out this path name. */
def shouldFilterOutPathName(pathName: String): Boolean = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only thing that changed as compared to Spark's HadoopFsUtils

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

2 similar comments
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM direction-wise. The extracted BaseRelation logic a bit hard to exam line by line. Testing will be a more effective way to verify. If result ok, it's good to land this.

Comment on lines +366 to +368
val exclude = (pathName.startsWith("_") && !pathName.contains("=")) || pathName.endsWith("._COPYING_")
val include = pathName.startsWith("_common_metadata") || pathName.startsWith("_metadata")
exclude && !include
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should some utils from MDT be the source of truth to these rules instead? spark side does not own these, also can avoid copying it over different spark versions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this is mostly about filtering out Spark-specific stuff. We can replace it with own utils when there will be a need for it, but for now the goal of borrowing this class was to override its behavior filtering out the files starting with "."

hadoopConf = hadoopConf,
filter = new PathFilterWrapper(filter),
ignoreMissingFiles = sparkSession.sessionState.conf.ignoreMissingFiles,
// NOTE: We're disabling fetching Block Info to speed up file listing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need a special token here to indicate changed part in hudi's codebase for easier maintenance. // NOTE: is not special enough. what about // HUDI NOTE: ? this can apply to any other incoming code variation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure understand your point here: what do you suggest this token to be used for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say when we want to understand which part of code is modified in Hudi, we may search for a special token and find the relevant code. NOTE: might come from the original code base so wanted to make it special.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. It's gonna be tough to identify all such places with markers, instead i'm referencing respective Spark release version this is borrowed from so that we can simply diff against it and see what has changed.

sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled", "true")
// TODO(HUDI-3639) vectorized reader has to be disabled to make sure MORIncrementalRelation is working properly
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader", "false")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql.parquet.enableVectorizedReader = true to enable vectorization acceleration ?

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin Mar 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the TODO note i've added to it. We can't do that b/c MOR Incremental Relation relies on Parquet Filtering which doesn't work w/ vectorized reader

@XuQianJin-Stars
Copy link
Contributor

+1 LGTM

@xushiyan
Copy link
Member

manually tested this patch in spark 3.2.1, using quickstart examples, and passed. landing this.

@xushiyan xushiyan merged commit 099c2c0 into apache:master Mar 19, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
…cation (apache#4877)

Refactoring Spark DataSource Relations to avoid code duplication. 

Following Relations were in scope:

- BaseFileOnlyViewRelation
- MergeOnReadSnapshotRelaation
- MergeOnReadIncrementalRelation
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
…cation (apache#4877)

Refactoring Spark DataSource Relations to avoid code duplication. 

Following Relations were in scope:

- BaseFileOnlyViewRelation
- MergeOnReadSnapshotRelaation
- MergeOnReadIncrementalRelation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants