merge upstream master #1

pushpavanthar · 2023-02-02T18:00:18Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…ith hudi incr source (#7132)

…sabled (#7272)

* [HUDI-5252] ClusteringCommitSink supports to rollback clustering

…7269) * For COW table STRICT insert mode, PK uniqueness should be honored irrespective of precombine field.

… path (#6358) Currently, HoodieParquetReader is not specifying projected schema properly when reading Parquet files which ends up failing in cases when the provided schema is not equal to the schema of the file being read (even though it might be a proper projection, ie subset of it), like in common CDC case when column is being dropped from the RDS and the schema of the new batch is lacking the old column. To address the original issue described in HUDI-4588, we also have to relax the constraints imposed by TableSchemaResolver.isSchemaCompatible method not allowing columns to be evolved by the way of dropping columns. After addressing the original problem a considerable amount of new issues have been discovered including following: - Writer's schemas were deduced differently for different flows (for ex, Bulk Insert vs other ops) - Writer's schema was not reconciled against table's schemas in terms of nullability (this is asserted w/in AvroSchemaConverter, that is now invoked as part of the projection) - (After enabling schema validations) Incorrect schema handling w/in HoodieWriteHandle has been detected (there were 2 ambiguous schema references, set incorrectly, creating confusion) - (After enabling schema validations) Incorrect schema handling w/in MergeHelper impls, where writer's schema was used as a existing file's reader-schema (failing in cases when these 2 diverge) Changes: - Adding missing schema projection when reading Parquet file (using AvroParquetReader) - Relaxing schema evolution constraints to allow columns to be dropped - Revisiting schema reconciliation logic to make sure it's coherent - Streamlining schema handling in HoodieSparkSqlWriter to make sure it's uniform for all operations (it isn't applied properly for Bulk-insert at the moment) - Added comprehensive test for Basic Schema Evolution (columns being added, dropped) - Fixing HoodieWriteHandle impls to properly handle writer schema and avoid duplication - Fixing MergeHelper impls to properly handle schema evolution

Co-authored-by: zhuanshenbsj1 <[email protected]>

…dure (#7300)

…#7308) Co-authored-by: slfan1989 <louj1988@@>

…stem retry (#7313)

…service fails (#7243) After the files are written, table services like clustering and compaction can fail. This causes the sync to the metaserver to not happen. This patch adds a config that when set to false, the deltastreamer will not fail and the sync to the metaserver will occur. A warning will be logged with the exception that occurred. To use this new behavior, set hoodie.fail.writes.on.inline.table.service.exception to false. Co-authored-by: Jonathan Vexler <=>

…orming a LATEST streaming read (#6920)

…7311)

…cords issue if it contains delta files while still splittable (#7264)

…tFoundException of InLineFileSystem (#7124)

… flows (#7230) Add good test coverage for some of the core user flows w/ spark data source writes.

…5830)

…eline (#7196)

) Addressing an invalid semantic of MOR iterators not being actually idempotent: ie, calling `hasNext` multiple times was actually leading to iterator advancing, therefore potentially skipping the elements for ex in cases like: ``` // [[isEmpty]] will invoke [[hasNext]] to check whether Iterator has any elements if (iter.isEmpty) { // ... } else { // Here [[map]] will invoke [[hasNext]] again, therefore skipping the elements iter.map { /* ... */ } } ```

…ted soon (#7347)

…eIterator (#7340) * Unify RecordIterator and HoodieParquetReader with ClosableIterator * Add a factory clazz for RecordIterator * Add more documents

Fixes deploy_staging_jars.sh to generate all hudi-utilities-slim-bundle.

…eaming ingest (#7783)

Co-authored-by: hbg <[email protected]>

Cleaning up some of the recently introduced configs: Shortening file-listing mode override for Spark's FileIndex Fixing Disruptor's write buffer limit config Scoped CANONICALIZE_NULLABLE config to HoodieSparkSqlWriter

) - Ensures that Hudi CLI commands which require launching Spark can be executed with hudi-cli-bundle

…esystem instance instead (#7685) Should use `writeClient. getHoodieTable(). getHoodieView()` to determine the fileSystemView

…cluding exception) (#7799) Looks like we miss to close the writeClient on some of the failure cases while writing via spark-ds and spark-sql writes.

…m data table timeline w/ metadata reads (#7798) Fixing metadata table to read rollback info even w/ empty rollback completed meta file.

Co-authored-by: jameswei <[email protected]>

…ing… (#7669)

This change addresses a few performance regressions in `HoodieSparkRecord` identified during our recent benchmarking:: 1. `HoodieSparkRecord` rewrites records using `rewriteRecord` and `rewriteRecordWithNewSchema` which do Schema traversals for every record. Instead we should do schema traversal only once and produce a transformer that will directly create new record from the old one. 2. `HoodieRecord`s currently could be rewritten multiple times even in cases when just meta-fields need to be mixed into the schema (in that case, `HoodieSparkRecord` simply wraps source `InternalRow` into `HoodieInternalRow` holding the meta-fields). This is problematic due to a) `UnsafeProjection` re-using mutable row (as a buffer) to avoid allocation of small objects leading to b) recursive overwriting of the same row. 3. Records are currently copied for every Executor even for Simple one which actually is not buffering any records and therefore doesn't require records to be copied. To address aforementioned gaps following changes have been implemented: 1. Row writing utils have been revisited to decouple `RowWriter` generation from actual application (to the source row; that way actual application is much more efficient). Additionally, considerable number of row-writing utilities have been eliminated as these are purely duplicative. 2. `HoodieRecord.rewriteRecord` API is renamed into `prependMetaFields` to clearly disambiguate it from `rewriteRecordWithSchema` 3. `WriteHandle` and `HoodieMergeHelper` implementations are substantially simplified and streamlined accommodating being rebased onto `prependMetaFields`

…alter table reports an error (#7706) Co-authored-by: danny0405 <[email protected]>

…InstantTime/RunClean/RunCompactionProcedure (#7655)

…lt (#7787) * [HUDI-5646] Guard dropping columns by a config, do not allow by default * Replaced superfluous `isSchemaCompatible` override by explicitly specifying whether column drop should be allowed; * Revisited `HoodieSparkSqlWriter` to avoid (unnecessary) schema handling for delete operations * Remove meta-fields from latest table schema during analysis * Disable schema validation when partition columns are dropped --------- Co-authored-by: Alexey Kudinkin <[email protected]> Co-authored-by: sivabalan <[email protected]>

…ource (#7810) This is restoring existing behavior for DeltaStreamer Incremental Source, as the change in #7769 removed _hoodie_partition_path field from the dataset making it impossible to be accessed from the DS Transformers for ex

) This is addressing misconfiguration of the Kryo object used specifically to serialize Spark's internal structures (like `Expression`s): previously we're using default `SparkConf` instance to configure it, while instead we should have used the one provided by `SparkEnv`

…fault (#7813) * Remove `COMBINE_BEFORE_INSERT` config being overridden for insert operations * Revisited Spark SQL feature configuration to allow dichotomy of having: - (Feature-)specific "default" configuration (that could be overridden by the user) - "Overriding" configuration (that could NOT be overridden by the user) * Restoring existing behavior for Insert Into to deduplicate by default (if pre-combine is specified) * Fixing compilation * Fixing compilation (one more time) * Fixing options combination ordering

… helper (#7818) `deduceShuffleParallelism` returns 0 in some situations which should never occur.

nsivabalan and others added 30 commits November 22, 2022 22:34

[HUDI-5157] Support dropping all meta fields from source hudi table w…

ceb94b4

…ith hudi incr source (#7132)

[MINOR] Fix typos in HoodieTimelineArchiver (#7268)

dd1a2f5

[MINOR] Use direct marker for spark engine when timeline server is di…

91e0db5

…sabled (#7272)

[HUDI-5252] ClusteringCommitSink supports to rollback clustering (#7263)

3109d89

* [HUDI-5252] ClusteringCommitSink supports to rollback clustering

[HUDI-5258] Fix checkstyle issues in hudi-common (#7270)

3f37d4f

[HUDI-5260] Fix insert into sql command with strict sql insert mode (#…

7846b50

…7269) * For COW table STRICT insert mode, PK uniqueness should be honored irrespective of precombine field.

[HUDI-5234] Streaming read skip clustering (#7296)

431ade0

Co-authored-by: zhuanshenbsj1 <[email protected]>

[HUDI-5277] Close HoodieWriteClient before exiting RunClusteringProce…

d10d987

…dure (#7300)

[HUDI-5283] Replace deprecated method Schema.parse with Schema.Parser (…

ddcd045

…#7308) Co-authored-by: slfan1989 <louj1988@@>

[HUDI-5241] Optimize HoodieDefaultTimeline API (#7241)

78a0047

[HUDI-4209] Avoid using HDFS in HoodieClientTestHarness (#7305)

62b5989

[HUDI-5286] UnsupportedOperationException throws when enabling filesy…

e88b474

…stem retry (#7313)

[MINOR] Re-enable Cleaner tests (#7317)

b31ef12

[HUDI-5007] Prevent Hudi from reading the entire timeline's when perf…

6baf733

…orming a LATEST streaming read (#6920)

[MINOR] Fix IncrementalInputSplits compilation failure (#7319)

0d70df8

[HUDI-5285] Exclude *-site.xml files from jar packaging (#7310)

5b39ac1

[MINOR] Remove the log refrence and imports in HoodieHFileDataBlock (#…

98c0504

…7311)

[HUDI-5253] HoodieMergeOnReadTableInputFormat could have duplicate re…

fe43e6f

…cords issue if it contains delta files while still splittable (#7264)

[HUDI-5151] Fix bug with broken flink data skipping caused by ClassNo…

88db1ca

…tFoundException of InLineFileSystem (#7124)

[HUDI-5269] Enhancing spark-sql write tests for some of the core user…

bf2ca54

… flows (#7230) Add good test coverage for some of the core user flows w/ spark data source writes.

[HUDI-5278] Support more conf to cluster procedure (#7304)

418091b

[MINOR] Reuse empty InternalSchema instance (#7287)

2cdaa60

[HUDI-3981] Flink engine support for comprehensive schema evolution (#…

da89e12

…5830)

[HUDI-5279] move logic for deleting active instant to HoodieActiveTim…

fdce2b8

…eline (#7196)

[MINOR] Bumping Azure Ubuntu image to 22.04, as 18.04 will be depreca…

eb6da96

…ted soon (#7347)

[HUDI-5304] Disabling spark-sql core flow tests to unblock CI (#7346)

f23bed8

[HUDI-5306] Unify RecordIterator and HoodieParquetReader with Closabl…

07cc3e8

…eIterator (#7340) * Unify RecordIterator and HoodieParquetReader with ClosableIterator * Add a factory clazz for RecordIterator * Add more documents

yihua and others added 29 commits January 28, 2023 22:33

[HUDI-5640] Add missing profiles in deploy_staging_jars.sh (#7784)

3b301aa

Fixes deploy_staging_jars.sh to generate all hudi-utilities-slim-bundle.

[HUDI-5639] Fixing stream identifier for single writer with spark str…

a1ba929

…eaming ingest (#7783)

[MINOR] Fix HoodieCDCRDD setting flag usesVirtualKeys (#7777)

44b245d

[HUDI-5503] Optimize flink table factory option check (#7608)

b00dac5

Co-authored-by: hbg <[email protected]>

[MINOR] Cleaning up recently introduced configs (#7772)

88d8e5e

Cleaning up some of the recently introduced configs: Shortening file-listing mode override for Spark's FileIndex Fixing Disruptor's write buffer limit config Scoped CANONICALIZE_NULLABLE config to HoodieSparkSqlWriter

[HUDI-5629] Clean CDC log files for enable/disable scenario (#7786)

53b813a

[HUDI-5637] Add Kryo for hive sync bundle (#7781)

e6c0bd6

[HUDI-5634] Rename CDC related classes (#7410)

32f45f0

[HUDI-5632] Fix failure launching Spark jobs from hudi-cli-bundle (#7790

22eab39

) - Ensures that Hudi CLI commands which require launching Spark can be executed with hudi-cli-bundle

[MINOR] Make data_before_after the default cdc logging mode (#7797)

d968f39

[HUDI-5563] Check table exist before drop table (#7679)

1a72f50

[HUDI-5568] Fix the BucketStreamWriteFunction to rebase the local fil…

d90f286

…esystem instance instead (#7685) Should use `writeClient. getHoodieTable(). getHoodieView()` to determine the fileSystemView

[HUDI-5655] Closing write client for spark ds writer in all cases (in…

9906df4

…cluding exception) (#7799) Looks like we miss to close the writeClient on some of the failure cases while writing via spark-ds and spark-sql writes.

[HUDI-5654] Fixing read of an empty rollback completed meta files fro…

5acc6fe

…m data table timeline w/ metadata reads (#7798) Fixing metadata table to read rollback info even w/ empty rollback completed meta file.

[HUDI-5487] Reduce duplicate logs in ExternalSpillableMap (#7579)

1377143

[MINOR] Standardise schema concepts on Flink Engine (#7761)

252c403

[HUDI-5567] Make the bootstrapping exception message more clear (#7684)

d55a1ce

Co-authored-by: jameswei <[email protected]>

[HUDI-5553] Prevent partition(s) from being dropped if there are pend…

d857693

…ing… (#7669)

[HUDI-5585][flink] Fix flink creates and writes the table, the spark …

9469882

…alter table reports an error (#7706) Co-authored-by: danny0405 <[email protected]>

[HUDI-5540] Close write client after usage of DeleteMarker/RollbackTo…

255d40c

…InstantTime/RunClean/RunCompactionProcedure (#7655)

[HUDI-5317] Fix insert overwrite table for partitioned table (#7793)

0a9a6d2

[HUDI-5676] Fix BigQuerySyncTool standalone mode (#7816)

abe26d4

[HUDI-5647] Automate savepoint and restore tests (#7796)

6fbf9d4

[HUDI-5678] Fix deduceShuffleParallelism in row-writing Bulk Insert…

e3b95e8

… helper (#7818) `deduceShuffleParallelism` returns 0 in some situations which should never occur.

pushpavanthar merged commit 49ab579 into pushpavanthar:master Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge upstream master #1

merge upstream master #1

pushpavanthar commented Feb 2, 2023

merge upstream master #1

merge upstream master #1

Conversation

pushpavanthar commented Feb 2, 2023

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist