merge master in #2

fengjian428 · 2022-01-24T06:56:26Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

)

…ssed t/h properties file (#4090) * Rebased `DFSPropertiesConfiguration` to access Hadoop config in liue of FS to avoid confusion * Fixed `readConfig` to take Hadoop's `Configuration` instead of FS; Fixing usages * Added test for local FS access * Rebase to use `FSUtils.getFs` * Combine properties provided as a file along w/ overrides provided from the CLI * Added helper utilities to `HoodieClusteringConfig`; Make sure corresponding config methods fallback to defaults; * Fixed DeltaStreamer usage to respect properly combined configuration; Abstracted `HoodieClusteringConfig.from` convenience utility to init Clustering config from `Properties` * Tidying up * `lint` * Reverting changes to `HoodieWriteConfig` * Tdiying up * Fixed incorrect merge of the props * Converted `HoodieConfig` to wrap around `Properties` into `TypedProperties` * Fixed compilation * Fixed compilation

…nk bundle jar (#4104)

HUDI-2801 makes this jar as required.

* [HUDI-2852] Table metadata returns empty for non-exist partition * add unit test * fix code checkstyle Co-authored-by: wangminchao <[email protected]>

…t.page.size' (#4128)

… avoid NumberFormatException (#4101)

* `ZCurveOptimizeHelper` > `ZOrderingIndexHelper`; Moved Z-index helper under `hudi.index.zorder` package * Tidying up `ZOrderingIndexHelper` * Fixing compilation * Fixed index new/original table merging sequence to always prefer values from new index; Cleaned up `HoodieSparkUtils` * Added test for `mergeIndexSql` * Abstracted Z-index name composition w/in `ZOrderingIndexHelper`; * Fixed `DataSkippingUtils` to interrupt prunning in case data filter contains non-indexed column reference * Properly handle exceptions origination during pruning in `HoodieFileIndex` * Make sure no errors are logged upon encountering `AnalysisException` * Cleaned up Z-index updating sequence; Tidying up comments, java-docs; * Fixed Z-index to properly handle changes of the list of clustered columns * Tidying up * `lint` * Suppressing `JavaDocStyle` first sentence check * Fixed compilation * Fixing incorrect `DecimalType` conversion * Refactored test `TestTableLayoutOptimization` - Added Z-index table composition test (against fixtures) - Separated out GC test; Tidying up * Fixed tests re-shuffling column order for Z-Index table `DataFrame` to align w/ the one by one loaded from JSON * Scaffolded `DataTypeUtils` to do basic checks of Spark types; Added proper compatibility checking b/w old/new index-tables * Added test for Z-index tables merging * Fixed import being shaded by creating internal `hudi.util` package * Fixed packaging for `TestOptimizeTable` * Revised `updateMetadataIndex` seq to provide Z-index updating process w/ source table schema * Make sure existing Z-index table schema is sync'd to source table's one * Fixed shaded refs * Fixed tests * Fixed type conversion of Parquet provided metadata values into Spark expected schemas * Fixed `composeIndexSchema` utility to propose proper schema * Added more tests for Z-index: - Checking that Z-index table is built correctly - Checking that Z-index tables are merged correctly (during update) * Fixing source table * Fixing tests to read from Parquet w/ proper schema * Refactored `ParquetUtils` utility reading stats from Parquet footers * Fixed incorrect handling of Decimals extracted from Parquet footers * Worked around issues in javac failign to compile stream's collection * Fixed handling of `Date` type * Fixed handling of `DateType` to be parsed as `LocalDate` * Updated fixture; Make sure test loads Z-index fixture using proper schema * Removed superfluous scheme adjusting when reading from Parquet, since Spark is actually able to perfectly restore schema (given Parquet was previously written by Spark as well) * Fixing race-condition in Parquet's `DateStringifier` trying to share `SimpleDataFormat` object which is inherently not thread-safe * Tidying up * Make sure schema is used upon reading to validate input files are in the appropriate format; Tidying up; * Worked around javac (1.8) inability to infer expression type properly * Updated fixtures; Tidying up * Fixing compilation after rebase * Assert clustering have in Z-order layout optimization testing * Tidying up exception messages * XXX * Added test validating Z-index lookup filter correctness * Added more test-cases; Tidying up * Added tests for string expressions * Fixed incorrect Z-index filter lookup translations * Added more test-cases * Added proper handling on complex negations of AND/OR expressions by pushing NOT operator down into inner expressions for appropriate handling * Added `-target:jvm-1.8` for `hudi-spark` module * Adding more tests * Added tests for non-indexed columns * Properly handle non-indexed columns by falling back to a re-write of containing expression as `TrueLiteral` instead * Fixed tests * Removing the parquet test files and disabling corresponding tests Co-authored-by: Vinoth Chandar <[email protected]>

…4123)

- Changes the default config of marker type (HoodieWriteConfig.MARKERS_TYPE or hoodie.write.markers.type) from DIRECT to TIMELINE_SERVER_BASED for Spark Engine. - Adds engine-specific marker type configs: Spark -> TIMELINE_SERVER_BASED, Flink -> DIRECT, Java -> DIRECT. - Uses DIRECT markers as well for Spark structured streaming due to timeline server only available for the first mini-batch. - Fixes the marker creation method for non-partitioned table in TimelineServerBasedWriteMarkers. - Adds the fallback to direct markers even when TIMELINE_SERVER_BASED is configured, in WriteMarkersFactory: when HDFS is used, or embedded timeline server is disabled, the fallback to direct markers happens. - Fixes the closing of timeline service. - Fixes tests that depend on markers, mainly by starting the timeline service for each test.

…alidate option (#4092) - Co-authored-by: Sivabalan Narayanan <[email protected]>

…ync (#4129) * Fix README with current limitations of hive sync * Fix README with current limitations of hive sync * Fix dep issue * Fix Copy on Write flow Co-authored-by: Rajesh Mahindra <[email protected]>

* modified BitCaskDiskMap_close_function * change iterators location to finally * Update BitCaskDiskMap.java

…ailed rollback) (#4133)

- Co-authored-by: Yann Byron <[email protected]>

…ng race for write client & add locking for upgrade (#4114) Co-authored-by: Sivabalan Narayanan <[email protected]>

Co-authored-by: Y Ethan Guo <[email protected]>

…4142)

#4147)

…titioning' (#4130)

…4161)

…to 'DefaultHoodieRecordPayload' (#4115)" (#4169) This reverts commit 88067f5.

…red across engines (#4520)

…4616)

…ommit method (#4515)

Co-authored-by: Hui An <[email protected]>

…ViewState to avoid NPE (#4625)

…4180)

…#4602)

…FileIndex` (#4531)

…definitely. (#4078) Co-authored-by: yuezhang <[email protected]>

* [MINOR] Add instructions to build and upload Docker Demo images * Add local test instruction

#4587)

…dleMetadataBootstrap (#4653)

…I in FileSystemBackedTableMetadata (#4643) Co-authored-by: yuezhang <[email protected]>

…SqlWriter (#4631)

…4083)

…udi spark datasource (#4670)

…rite (#2903)

…ernalWriterHelper::write(...) (apache#10272) Issue: There are two configs which when set in a certain manner throws exceptions or asserts 1. Configs to disable populating metadata fields (for each row) 2. Configs to drop partition columns (to save storage space) from a row With #1 and #2, partition paths cannot be deduced using partition columns (as the partition columns are dropped higher up the stack. BulkInsertDataInternalWriterHelper::write(...) relied on metadata fields to extract partition path in such cases. But with #1 it is not possible resulting in asserts/exceptions. The fix is to push down the dropping of partition columns down the stack after partition path is computed. The fix manipulates the raw 'InternalRow' row structure by only copying the relevent fields into a new 'InternalRow' structure. Each row is processed individually to drop the partition columns and copy it a to new 'InternalRow' Co-authored-by: Vinaykumar Bhat <[email protected]>

There are a couple of issues in how functional indexes are managed. 1. HoodieSparkFunctionalIndexClient::create(...) was failing a register a functional index iff a (different) functional index was already created. Fixed this check by looking up the index-name in the FunctionalIndexMetadata 2. HoodieTableConfig `TABLE_METADATA_PARTITIONS` and `TABLE_METADATA_PARTITIONS_INFLIGHT` should actually store the Metadata partition path. While the path is contained in the `MeatadatPartitionType` for most of the indexes, it is not correct for functional-index. MeatadatPartitionType.FUNCTIONAL_INDEX only stores the prefix (i.e func_index_). The actual partition path needs to be extracted from the index-name. 3. Because of #2, most of the helper methods that operate on metadata-partitions, should take partition-path (and not partition-type) This PR addresses the problem listed above. This fix is required to add SQL support for secondary-indexes (the configs for which will be based on functional-index-config). Note that there are still issues with some functional-index operations (like drop index / delete partition) because of the issues listed here. Those will be fixed in a subsequent PR. Co-authored-by: Vinaykumar Bhat <[email protected]>

nsivabalan and others added 30 commits November 25, 2021 16:06

[HUDI-2841] Fixing lazy rollback for MOR with list based strategy (#4110

8e13793

)

[HUDI-2801] Add Amazon CloudWatch metrics reporter (#4081)

e0125a7

[HUDI-2005] Removing direct fs call in HoodieLogFileReader (#3865)

8340ccb

[HUDI-2851] Shade org.apache.hadoop.hive.ql.optimizer package for fli…

38585e4

…nk bundle jar (#4104)

[MINOR] Include hudi-aws in flink bundle jar (#4127)

f5da9b5

HUDI-2801 makes this jar as required.

[HUDI-2852] Table metadata returns empty for non-exist partition (#4117)

e554c7f

* [HUDI-2852] Table metadata returns empty for non-exist partition * add unit test * fix code checkstyle Co-authored-by: wangminchao <[email protected]>

[HUDI-2863] Rename option 'hoodie.parquet.page.size' to 'write.parque…

e9efbdb

…t.page.size' (#4128)

[HUDI-2850] Fixing Clustering CLI - schedule and run command fixes to…

3d75aca

… avoid NumberFormatException (#4101)

[MINOR] Fixing test failure to fix CI build failure (#4132)

a88691f

[HUDI-2861] Re-use same rollback instant time for failed rollbacks (#…

f8e0176

…4123)

[HUDI-2845] Metadata CLI - files/partition file listing fix and new v…

445208a

…alidate option (#4092) - Co-authored-by: Sivabalan Narayanan <[email protected]>

[HUDI-2848] Excluse guava from hudi-cli pom (#4100)

8402cac

[HUDI-2864] Fix README and scripts with current limitations of hive s…

9028e6e

…ync (#4129) * Fix README with current limitations of hive sync * Fix README with current limitations of hive sync * Fix dep issue * Fix Copy on Write flow Co-authored-by: Rajesh Mahindra <[email protected]>

[HUDI-2856] Bit cask disk map delete modified (#4116)

257a6a7

* modified BitCaskDiskMap_close_function * change iterators location to finally * Update BitCaskDiskMap.java

[MINOR] Follow ups from HUDI-2861 (re-use same rollback instant for f…

9c059ef

…ailed rollback) (#4133)

[HUDI-2868] Fix skipped HoodieSparkSqlWriterSuite (#4125)

3a8d64e

- Co-authored-by: Yann Byron <[email protected]>

[HUDI-2475] [HUDI-2862] Metadata table creation and avoid bootstrappi…

2c7656c

…ng race for write client & add locking for upgrade (#4114) Co-authored-by: Sivabalan Narayanan <[email protected]>

[HUDI-2102] Support hilbert curve for hudi (#3952)

780a2ac

Co-authored-by: Y Ethan Guo <[email protected]>

Moving to 0.11.0-SNAPSHOT on master branch.

a1d0ff4

[MINOR] fix typo (#4140)

eca1693

[MINOR] Fixing integ test suite for hudi-aws and archival validation (#…

52aae36

…4142)

Removing rfc from release package and fixing release validation script (

38e75ea

#4147)

[MINOR] Fix syntax error in create_source_release.sh (#4150)

536af4b

[MINOR] Fix typo,rename 'getUrlEncodePartitoning' to 'getUrlEncodePar…

3433f00

…titioning' (#4130)

[HUDI-2642] Add support ignoring case in update sql operation (#3882)

a398aad

[HUDI-2891] Fix write configs for Java engine in Kafka Connect Sink (#…

ea009b5

…4161)

Revert "[HUDI-2855] Change the default value of 'PAYLOAD_CLASS_NAME' …

24380c2

…to 'DefaultHoodieRecordPayload' (#4115)" (#4169) This reverts commit 88067f5.

Alexey Kudinkin and others added 27 commits January 16, 2022 22:46

[HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be sha…

75caa7d

…red across engines (#4520)

[HUDI-3257] Excluding clustering instants from pending rollback info (#…

36a9f63

…4616)

[HUDI-3194] fix MOR snapshot query during compaction (#4540)

d365337

[HUDI-3252] Avoid creating empty requestedReplaceCommit in the startC…

20e7983

…ommit method (#4515)

[HUDI-1558] Struct Stream Source Support Spark3 (#4586)

f184474

Co-authored-by: Hui An <[email protected]>

[MINOR] Minor improvement in JsonkafkaSource (#4620)

3d93e85

[HUDI-3261] Read rt table by hive cli throw NoSuchMethodError (#4624)

3b56320

[HUDI-3263] Do not nullify members in HoodieTableFileSystemView#reset…

45f054f

…ViewState to avoid NPE (#4625)

[HUDI-2903] get table schema from the last commit with data written (#…

a09c231

…4180)

[HUDI-3245] Convert uppercase letters to lowercase in storage configs (…

caeea94

…#4602)

[HUDI-3191] Rebasing Hive's FileInputFormat onto `AbstractHoodieTable…

4bea758

…FileIndex` (#4531)

[HUDI-2833][Design] Merge small archive files instead of expanding in…

7647562

…definitely. (#4078) Co-authored-by: yuezhang <[email protected]>

[HUDI-3277] Filter non-parquet files in bootstrap procedure (#4639)

db93ad2

[MINOR] Add instructions to build and upload Docker Demo images (#4612)

a08a2b7

* [MINOR] Add instructions to build and upload Docker Demo images * Add local test instruction

[HUDI-3236] use fields'comments persisted in catalog to fill in schema (

31b57a2

#4587)

[HUDI-3283] Bootstrap support overwrite existing table (#4647)

b7a79aa

[MINOR] Fix typo in the doc of BULK_INSERT_SORT_MODE (#4652)

14d08bb

[HUDI-3285] Drop unused method SparkBootstrapCommitActionExecutor#han…

a66004a

…dleMetadataBootstrap (#4653)

[HUDI-3250] Upgrade Presto docker image (#4646)

2071e3b

[HUDI-3281][Performance]Tuning performance of getAllPartitionPaths AP…

79bf6ab

…I in FileSystemBackedTableMetadata (#4643) Co-authored-by: yuezhang <[email protected]>

[HUDI-3271] Code optimization and clean up unused code in HoodieSpark…

8547f11

…SqlWriter (#4631)

[HUDI-3268] Fix NPE while reading table with Spark datasource (#4630)

4b90850

[minor] Fix hive-exec scope of flink bundle jar (#4664)

64b1426

[HUDI-2837] Add support for using database name in incremental query (#…

56cd8ff

…4083)

[HUDI-3262] Fixing utilities and integ test suite bundle to include h…

e72553a

…udi spark datasource (#4670)

[HUDI-1850][HUDI-3234] Fixing read of a empty table but with failed w…

f7a7796

…rite (#2903)

[HUDI-3282] Fix delete exception for Spark SQL when sync Hive (#4644)

cfde45b

fengjian428 merged commit c281194 into fengjian428:master Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge master in #2

merge master in #2

fengjian428 commented Jan 24, 2022

merge master in #2

merge master in #2

Conversation

fengjian428 commented Jan 24, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist