Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-3396] Make sure BaseFileOnlyViewRelation only reads projected columns #4818

Merged
merged 40 commits into from
Mar 10, 2022

Conversation

alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Feb 15, 2022

Tips

What is the purpose of the pull request

NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant

This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow

Brief change log

  • Introduced HoodieBaseRDD as a base for all custom RDD impls
  • Extracted common fields/methods to HoodieBaseRelation
  • Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
  • Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@alexeykudinkin alexeykudinkin changed the title [WIP][HUDI-3396] Make sure BaseFileOnlyViewRelation only reads projected columns [WIP][HUDI-3396][Stacked on 4789] Make sure BaseFileOnlyViewRelation only reads projected columns Feb 15, 2022
@alexeykudinkin alexeykudinkin force-pushed the ak/spkds-ref-1 branch 2 times, most recently from 37d994c to acfa5bc Compare February 19, 2022 00:56
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@alexeykudinkin alexeykudinkin changed the title [WIP][HUDI-3396][Stacked on 4789] Make sure BaseFileOnlyViewRelation only reads projected columns [HUDI-3396] Make sure BaseFileOnlyViewRelation only reads projected columns Feb 24, 2022
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job on the patch!
mostly minor comments and some clarifications.

@@ -348,6 +355,21 @@ protected HoodieWriteConfig getConfig(Boolean autoCommit, Boolean rollbackUsingM
.withRollbackUsingMarkers(rollbackUsingMarkers);
}

protected Dataset<Row> toDataset(List<HoodieRecord> records) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we take in AvroSchema as an argument. may be create another overloaded method which calls into this w/ some default for avro schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this addressed ?

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@@ -348,6 +355,21 @@ protected HoodieWriteConfig getConfig(Boolean autoCommit, Boolean rollbackUsingM
.withRollbackUsingMarkers(rollbackUsingMarkers);
}

protected Dataset<Row> toDataset(List<HoodieRecord> records) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this addressed ?

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left couple of comments

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link

hudi-bot commented Mar 8, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But I will test this patch out and report the results. since this touches core read path, wanted to ensure things are ok.

@nsivabalan nsivabalan merged commit 034adda into apache:master Mar 10, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
… columns (apache#4818)

NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant

This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow

Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
… columns (apache#4818)

NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant

This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow

Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants