Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-4303] Use Hive sentinel value as partition default to avoid casting err #5954

Merged
merged 1 commit into from
Jul 23, 2022

Conversation

codope
Copy link
Member

@codope codope commented Jun 24, 2022

What is the purpose of the pull request

When we define a table in Hive with a partitioning column, all NULL values within that column appear as __HIVE_DEFAULT_PARTITION__. It is the sentinel value which is understood by both Hive and Spark. Previously we used default string as the default partition value. That would result in casting errors as mentioned in HUDI-4303 or HUDI-4159. This PR fixes that behavior.

Brief change log

(for example:)

  • Change default partition value to __HIVE_DEFAULT_PARTITION__.
  • Fix tests.

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codope codope force-pushed the HUDI-4303-default-partition branch from 2fec052 to 3ac6065 Compare June 24, 2022 08:12
@codope
Copy link
Member Author

codope commented Jun 24, 2022

@danny0405 Can you please review the flink side of config change? It used to be a different value previously https://github.com/apache/hudi/blob/release-0.9.0/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java#L75

@nsivabalan
Copy link
Contributor

this might be a backwards incompatible change right. until 0.11.1, null partition value records will go into "defualt" partition and after this patch, they will all go into "HIVE_DEFAULT_PARTITION". Did we think of an upgrade step to rewrite those partitions if any?
or provide a hudi-cli command to fix those.

@nsivabalan nsivabalan added priority:blocker hive Issues related to hive labels Jun 24, 2022
@danny0405
Copy link
Contributor

this might be a backwards incompatible change right. until 0.11.1, null partition value records will go into "defualt" partition and after this patch, they will all go into "HIVE_DEFAULT_PARTITION". Did we think of an upgrade step to rewrite those partitions if any? or provide a hudi-cli command to fix those.

I would consider it as a bug.

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@xushiyan
Copy link
Member

xushiyan commented Jul 14, 2022

@codope could you rebase and let CI run again?

important matter: users would need to migrate the tables if they upgrade to 0.12, right? otherwise, data will be corrupted. need to think of some migration strategy. maybe via upgrade handler?

How would presto/trino handle this partition path? this default value would make hudi tables coupled with hive and spark. i felt we should keep engine-agnostic default values, while configure query engines to interpret it (somehow, if doable)

@codope codope force-pushed the HUDI-4303-default-partition branch from 3ac6065 to db14123 Compare July 15, 2022 10:28
@codope
Copy link
Member Author

codope commented Jul 15, 2022

Rebased.
@xushiyan @nsivabalan This change is an incompatible one. But, it restores to the correct default value that we had in previous versions
So, the default value has changed in the past as well. So, it appears that it does not affect users in a critical way. Keeping the default value consistent as Hive is important as Presto, Trino and Spark all use same default partition value. As for incompatibility, I think it would be better to keep this put of upgrade path. Instead we can write a hudi-cli command.

@danny0405
Copy link
Contributor

There are test failures, maybe we can rebase the latest master code and run the CI again.

@codope codope force-pushed the HUDI-4303-default-partition branch from db14123 to 61c0eb6 Compare July 18, 2022 04:56
@nsivabalan
Copy link
Contributor

So, do we know why did we change the default after 0.9.0? what change was made and any rational added ?

@codope codope force-pushed the HUDI-4303-default-partition branch 2 times, most recently from a9afcb8 to 4afe206 Compare July 21, 2022 12:09
@yihua
Copy link
Contributor

yihua commented Jul 21, 2022

@codope looks like IT still fails after rerunning. Could you check the failure?

@apache apache deleted a comment from hudi-bot Jul 21, 2022
@danny0405
Copy link
Contributor

@codope looks like IT still fails after rerunning. Could you check the failure?

The hudi-flink test case is flaky and i'm fixing it in #6181, someone need to take are of the other flaky tests.

@codope
Copy link
Member Author

codope commented Jul 22, 2022

@codope looks like IT still fails after rerunning. Could you check the failure?

The hudi-flink test case is flaky and i'm fixing it in #6181, someone need to take are of the other flaky tests.

@danny0405 The test failure in this PR testHoodieFlinkCompactorWithPlanSelectStrategy started flaking after #6066 landed. Should we revert the other one?

@codope codope force-pushed the HUDI-4303-default-partition branch from 4afe206 to 1f1b6c6 Compare July 22, 2022 18:39
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit a36762a into apache:master Jul 23, 2022
@xushiyan
Copy link
Member

@codope @danny0405 let's figure out a smooth upgrade plan. this will severely affect user experience if we just roll out as is.

@codope
Copy link
Member Author

codope commented Jul 27, 2022

So, do we know why did we change the default after 0.9.0? what change was made and any rational added ?

Please check this discussion #3693 (comment)
The intent of the author of #3693 was to make the default partition value same in KeygenUtils and PartitionPathEncodeUtils but __HIVE_DEFAULT_PARTITION__ should have been chosen instead of default as the default value if partition path is null.

@codope
Copy link
Member Author

codope commented Jul 27, 2022

@codope @danny0405 let's figure out a smooth upgrade plan. this will severely affect user experience if we just roll out as is.

Working on it. Reopened HUDI-4303

fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023
* Revert master (apache#5925)

* Revert "udate"

This reverts commit 092e35c.

* Revert "[HUDI-3475] Initialize hudi table management module."

This reverts commit 4640a3b.

* [HUDI-4279] Strength the remote fs view lagging check when latest commit refresh is enabled (apache#5917)

Signed-off-by: LinMingQiang <[email protected]>

* [minor] following 4270, add unit tests for the keys lost case (apache#5918)

* [HUDI-3508] Add call procedure for FileSystemViewCommand (apache#5929)

* [HUDI-3508] Add call procedure for FileSystemView

* minor

Co-authored-by: jiimmyzhan <[email protected]>

* [HUDI-4299] Fix problem about hudi-example-java run failed on idea. (apache#5936)

* [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups (apache#5941)

* [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups

* Separate out incremental sync fsview test with clustering

* [HUDI-3509] Add call procedure for HoodieLogFileCommand (apache#5949)

Co-authored-by: zhanshaoxiong <[email protected]>

* [HUDI-4273] Support inline schedule clustering for Flink stream (apache#5890)

* [HUDI-4273] Support inline schedule clustering for Flink stream

* delete deprecated clustering plan strategy and add clustering ITTest

* [HUDI-3735] TestHoodieSparkMergeOnReadTableRollback is flaky (apache#5874)

* [HUDI-4260] Change KEYGEN_CLASS_NAME without default value (apache#5877)

* Change KEYGEN_CLASS_NAME without default value

Co-authored-by: [email protected] <loukey_7821>

* [HUDI-3512] Add call procedure for StatsCommand (apache#5955)

Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>

* [TEST][DO_NOT_MERGE]fix random failed for ci (apache#5948)

* Revert "[TEST][DO_NOT_MERGE]fix random failed for ci (apache#5948)" (apache#5971)

This reverts commit e8fbd4d.

* [HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting (apache#5966)

* Fixed Dictionary encoding config not being properly propagated to Parquet writer (making it unable to apply it, substantially bloating the storage footprint)

* [HUDI-4296] Fix the bug that TestHoodieSparkSqlWriter.testSchemaEvolutionForTableType is flaky (apache#5973)

* [HUDI-3502] Support hdfs parquet import command based on Call Produce Command (apache#5956)

* [MINOR] Remove -T option from CI build (apache#5972)

* [HUDI-5246] Bumping mysql connector version due to security vulnerability (apache#5851)

* [HUDI-4309] Spark3.2 custom parser should not throw exception (apache#5947)

* [HUDI-4316] Support for spillable diskmap configuration when constructing HoodieMergedLogRecordScanner (apache#5959)

* [HUDI-4315] Do not throw exception in BaseSpark3Adapter#toTableIdentifier (apache#5957)

* [HUDI-3504] Support bootstrap command based on Call Produce Command (apache#5977)

* [HUDI-4311] Fix Flink lose data on some rollback scene (apache#5950)

* [HUDI-4291] Fix flaky TestCleanPlanExecutor#testKeepLatestFileVersions (apache#5930)

* [HUDI-3506] Add call procedure for CommitsCommand (apache#5974)

* [HUDI-3506] Add call procedure for CommitsCommand

Co-authored-by: superche <[email protected]>

* [HUDI-4325] fix spark sql procedure cause ParseException with semicolon (apache#5982)

* [HUDI-4325] fix saprk sql procedure cause ParseException with semicolon

* [HUDI-4333] fix HoodieFileIndex's listFiles method log print skipping percent NaN (apache#5990)

* [HUDI-4332] The current instant may be wrong under some extreme conditions in AppendWriteFunction. (apache#5988)

* [HUDI-4320] Make sure `HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED` could be specified by the writer (apache#5970)

Fixed sequence determining whether Parquet's legacy-format writing property should be overridden to only kick in when it has not been explicitly specified by the caller

* [HUDI-1176] Upgrade hudi to log4j2 (apache#5366)

* Move to log4j2

cr: https://code.amazon.com/reviews/CR-71010705

* Upgrade unit tests to log4j2

* update exclusion

Co-authored-by: Brandon Scheller <[email protected]>

* [HUDI-4334] close SparkRDDWriteClient after usage in Create/Delete/RollbackSavepointsProcedure (apache#5994)

* [HUDI-1575] Claim RFC-56: Early Conflict Detection For Multi-writer (apache#6002)

Co-authored-by: yuezhang <[email protected]>

* [MINOR] Make CLI 'commit rollback' using rollbackUsingMarkers false as default (apache#5174)

Co-authored-by: yuezhang <[email protected]>

* [HUDI-4331] Allow loading external config file from class loader (apache#5987)

Co-authored-by: Wenning Ding <[email protected]>

* [HUDI-4336] Fix records overwritten bug with binary primary key (apache#5996)

* [MINOR] Following apache#2070, Fix BindException when running tests on shared machines. (apache#5951)

* [HUDI-4346] Fix params not update BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED (apache#5999)

* [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeseria… (apache#5907)

* [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer

* add ut

Co-authored-by: wangzixuan.wzxuan <[email protected]>

* [HUDI-3984] Remove mandatory check of partiton path for cli command (apache#5458)

* [HUDI-3634] Could read empty or partial HoodieCommitMetaData in downstream if using HDFS (apache#5048)

Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file

* [HUDI-3953]Flink Hudi module should support low-level source and sink api (apache#5445)

Co-authored-by: jerryyue <[email protected]>

* [HUDI-4353] Column stats data skipping for flink (apache#6026)

* [HUDI-3505] Add call procedure for UpgradeOrDowngradeCommand (apache#6012)

Co-authored-by: superche <[email protected]>

* [HUDI-3730] Improve meta sync class design and hierarchies (apache#5854)

* [HUDI-3730] Improve meta sync class design and hierarchies (apache#5754)
* Implements class design proposed in RFC-55

Co-authored-by: jian.feng <[email protected]>
Co-authored-by: jian.feng <[email protected]>

* [HUDI-3511] Add call procedure for MetadataCommand (apache#6018)

* [HUDI-3730] Add ConfigTool#toMap UT (apache#6035)

Co-authored-by: voonhou.su <[email protected]>

* [MINOR] Improve variable names (apache#6039)

* [HUDI-3116]Add a new HoodieDropPartitionsTool to let users drop table partitions through a standalone job. (apache#4459)

Co-authored-by: yuezhang <[email protected]>

* [HUDI-4360] Fix HoodieDropPartitionsTool based on refactored meta sync (apache#6043)

* [HUDI-3836] Improve the way of fetching metadata partitions from table (apache#5286)

Co-authored-by: xicm <[email protected]>

* [HUDI-4359] Support show_fs_path_detail command on Call Produce Command (apache#6042)

* [HUDI-4356] Fix the error when sync hive in CTAS (apache#6029)

* [HUDI-4219] Merge Into when update expression "col=s.col+2" on precombine cause exception (apache#5828)

* [HUDI-4357] Support flink 1.15.x (apache#6050)

* [HUDI-4152] Flink offline compaction support compacting multi compaction plan at once (apache#5677)

* [HUDI-4152] Flink offline compaction allow compact multi compaction plan at once

* [HUDI-4152] Fix exception for duplicated uid when multi compaction plan are compacted

* [HUDI-4152] Provider UT & IT for compact multi compaction plan

* [HUDI-4152] Put multi compaction plans into one compaction plan source

* [HUDI-4152] InstantCompactionPlanSelectStrategy allow multi instant by using comma

* [HUDI-4152] Add IT for InstantCompactionPlanSelectStrategy

* [HUDI-4309] fix spark32 repartition error (apache#6033)

* [HUDI-4366] Synchronous cleaning for flink bounded source (apache#6051)

* [minor] following 4152, refactor the clazz about plan selection strategy (apache#6060)

* [HUDI-4367] Support copyToTable on call (apache#6054)

* [HUDI-4335] Bug fixes in AWSGlueCatalogSyncClient post schema evolution. (apache#5995)

* fix for updateTableParameters which is not excluding partition columns and updateTableProperties boolean check

* Fix - serde parameters getting overrided on table property update

* removing stale syncConfig

* [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (apache#6017)

* [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields.

* fix comments

Co-authored-by: public (bdcee5037027) <[email protected]>

* [HUDI-3500] Add call procedure for RepairsCommand (apache#6053)

* [HUDI-2150] Rename/Restructure configs for better modularity (apache#6061)

- Move clean related configuration to HoodieCleanConfig
- Move Archival related configuration to HoodieArchivalConfig
- hoodie.compaction.payload.class move this to HoodiePayloadConfig

* [MINOR] Bump xalan from 2.7.1 to 2.7.2 (apache#6062)

Bumps xalan from 2.7.1 to 2.7.2.

---
updated-dependencies:
- dependency-name: xalan:xalan
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072)

* [HUDI-4324] Remove use_jdbc config from hudi sync
* Users should use HIVE_SYNC_MODE instead

* [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs (apache#5695)

* [HUDI-4146] RFC for Improve Hive/Meta sync class design and hierarchies

Co-authored-by: jian.feng <[email protected]>
Co-authored-by: Raymond Xu <[email protected]>

* [HUDI-4323] Make database table names optional in sync tool (apache#6073)

* [HUDI-4323] Make database table names optional in sync tool
* Infer from these properties from the table config

* [MINOR] Update RFCs status (apache#6078)

* [HUDI-4298] When reading the mor table with QUERY_TYPE_SNAPSHOT,Unabl… (apache#5937)

* [HUDI-4298] Add test case for reading mor table

Signed-off-by: LinMingQiang <[email protected]>

* [HUDI-4379] Bump Flink versions to 1.14.5 and 1.15.1 (apache#6080)

* [HUDI-4391] Incremental read from archived commits for flink (apache#6096)

* [RFC-51] [HUDI-3478] Hudi to support Change-Data-Capture (apache#5436)



Co-authored-by: Raymond Xu <[email protected]>

* [HUDI-4393] Add marker file for target file when flink merge handle rolls over (apache#6103)

* [HUDI-4399][RFC-57] Claim RFC 57 for DeltaStreamer proto support (apache#6112)

* [HUDI-4397] Flink Inline Cluster and Compact plan distribute strategy changed from rebalance to hash to avoid potential multiple threads accessing the same file (apache#6106)

Co-authored-by: jerryyue <[email protected]>

* [MINOR] Disable TestHiveSyncGlobalCommitTool (apache#6119)

* [HUDI-4403] Fix the end input metadata for bounded source (apache#6116)

* [HUDI-4408] Reuse old rollover file as base file for flink merge handle (apache#6120)

* [HUDI-3503]  Add call procedure for CleanCommand (apache#6065)

* [HUDI-3503] Add call procedure for CleanCommand
Co-authored-by: simonssu <[email protected]>

* [HUDI-4249] Fixing in-memory `HoodieData` implementation to operate lazily  (apache#5855)

* [HUDI-4170] Make user can use hoodie.datasource.read.paths to read necessary files (apache#5722)

* Rebase codes

* Move listFileSlices to HoodieBaseRelation

* Fix review

* Fix style

* Fix bug

* Fix file group count issue with metadata partitions (apache#5892)

* [HUDI-4098] Support HMS for flink HudiCatalog (apache#6082)

* [HUDI-4098]Support HMS for flink HudiCatalog

* [HUDI-4409] Improve LockManager wait logic when catch exception (apache#6122)

* [HUDI-4065] Add FileBasedLockProvider (apache#6071)

* [HUDI-4416] Default database path for hoodie hive catalog (apache#6136)

* [HUDI-4372] Enable matadata table by default for flink (apache#6066)

* [HUDI-4401] Skip HBase version check (apache#6114)

* Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

* [HUDI-4427] Add a computed column IT test (apache#6150)

* [HUDI-4146][RFC-55] Update config changes proposal (apache#6162)

* [HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (apache#5428)

Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation 
and the reason to that is SchemaPruning optimization rule (pruning nested schemas) 
that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's 
not applied in cases when any other relation is used.

This change is porting this rule to Hudi relations (MOR, Incremental, etc) 
by the virtue of leveraging HoodieSparkSessionExtensions mechanism 
injecting modified version of the original SchemaPruning rule 
that is adopted to work w/ Hudi's custom relations.

- Added customOptimizerRules to HoodieAnalysis
- Added NestedSchemaPrunning Spark's Optimizer rule
- Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas)
- Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions
- Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource
- Disabled fallback to HadoopFsRelation

* [HUDI-3764] Allow loading external configs while querying Hudi tables with Spark (apache#4915)

Currently when doing Hudi queries w/ Spark, it won't 
load the external configurations. Say if customers enabled 
metadata listing in their global config file, then this would 
let them actually query w/o metadata feature enabled. 
This PR fixes this issue and allows loading global 
configs during the Hudi reading phase.

Co-authored-by: Wenning Ding <[email protected]>

* [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (apache#5470)

* [MINOR] Add logger for HoodieCopyOnWriteTableInputFormat (apache#6161)

Co-authored-by: Wenning Ding <[email protected]>

* [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table (apache#6113)

Fixes the missing bloom filters in metadata table in the non-partitioned table due to incorrect record key generation, because of wrong file names when generating the metadata payload for the bloom filter.

* [HUDI-4204] Fixing NPE with row writer path and with OCC (apache#5850)

* [HUDI-4247] Upgrading protocol buffers version for presto bundle (apache#5852)

* [MINOR] Fix result missing information issue in commits_compare Procedure (apache#6165)

Co-authored-by: superche <[email protected]>

* [HUDI-4404] Fix insert into dynamic partition write misalignment (apache#6124)

* [MINOR] Fallback to default for hive-style partitioning, url-encoding configs (apache#6175)

- Fixes broken ITTestHoodieDemo#testParquetDemo

* [MINOR] Fix CI issue with TestHiveSyncTool (apache#6110)

* [HUDI-4039] Make sure all builtin `KeyGenerator`s properly implement Spark specific APIs (apache#5523)

This set of changes makes sure that all builtin KeyGenerators properly implement Spark-specific APIs in a performant way (minimizing key-generators overhead)

* [MINOR] Disable Flink compactor IT test (apache#6189)

* Revert "[MINOR] Fix CI issue with TestHiveSyncTool (apache#6110)" (apache#6192)

This reverts commit d5c904e.

* [HUDI-3979] Optimize out mandatory columns when no merging is performed (apache#5430)

For MOR, when no merging is performed there is no point in reading either primary-key or pre-combine-key values (unless query is referencing these). Avoiding reading these allows to potentially save substantial resources wasted for reading it out.

* [HUDI-4303] Use Hive sentinel value as partition default to avoid type caste issues (apache#5954)

* Revert "[HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072)" (apache#6160)

This reverts commit 046044c.

* [HUDI-4435] Fix Avro field not found issue introduced by Avro 1.10 (apache#6155)

Co-authored-by: Wenning Ding <[email protected]>

* [HUDI-4437] Fix test conflicts by clearing file system cache (apache#6123)

Co-authored-by: jian.feng <[email protected]>
Co-authored-by: jian.feng <[email protected]>
Co-authored-by: Raymond Xu <[email protected]>

* [HUDI-4436] Invalidate cached table in Spark after write (apache#6159)

Co-authored-by: Ryan Pifer <[email protected]>

* [MINOR] Fix Call Procedure code style (apache#6186)

* Fix Call Procedure code style.
Co-authored-by: superche <[email protected]>

* [MINOR] Bump CI timeout to 150m (apache#6198)

* [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partition column is missing from schema (apache#6163)

Co-authored-by: Ryan Pifer <[email protected]>

* [HUDI-4071] Make NONE sort mode as default for bulk insert (apache#6195)

* [HUDI-4420] Fixing table schema delineation on partition/data schema for Spark relations  (apache#5708)

* [HUDI-4448] Remove the latest commit refresh for timeline server (apache#6179)

* [HUDI-4450] Revert the checkpoint abort notification (apache#6181)

* [HUDI-4439] Fix Amazon CloudWatch reporter for metadata enabled tables (apache#6164)

Co-authored-by: Udit Mehrotra <[email protected]>
Co-authored-by: Y Ethan Guo <[email protected]>

* [HUDI-4348] fix merge into sql data quality in concurrent scene (apache#6020)

* [HUDI-3510] Add sync validate procedure (apache#6200)

* [HUDI-3510] Add sync validate procedure

Co-authored-by: simonssu <[email protected]>

* [MINOR] Fix typos in Spark client related classes (apache#6204)

* [HUDI-4456] Close FileSystem in SparkClientFunctionalTestHarness  (apache#6201)

* [MINOR] Only log stdout output for non-zero exit from commands in IT (apache#6199)

* [HUDI-4458] Add a converter cache for flink ColumnStatsIndices (apache#6205)

* [HUDI-4071] Match ROLLBACK_USING_MARKERS_ENABLE in sql as datasource (apache#6206)

Co-authored-by: superche <[email protected]>

* [HUDI-4455] Improve test classes for TestHiveSyncTool (apache#6202)

Improve HiveTestService, HiveTestUtil, and related classes.

* [HUDI-4456] Clean up test resources (apache#6203)

* [HUDI-3884] Support archival beyond savepoint commits (apache#5837)


Co-authored-by: sivabalan <[email protected]>

* [HUDI-4250][HUDI-4202] Optimize performance of Column Stats Index reading in Data Skipping  (apache#5746)

We provide an alternative way of fetching Column Stats Index within the reading process to avoid the penalty of a more heavy-weight execution scheduled through a Spark engine.

* [HUDI-4471] Relocate AWSDmsAvroPayload class to hudi-common

* [HUDI-4474] Infer metasync configs (apache#6217)

- infer repeated sync configs from original configs
  - `META_SYNC_BASE_FILE_FORMAT`
    - infer from `org.apache.hudi.common.table.HoodieTableConfig.BASE_FILE_FORMAT`
  - `META_SYNC_ASSUME_DATE_PARTITION`
    - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ASSUME_DATE_PARTITIONING`
  - `META_SYNC_DECODE_PARTITION`
    - infer from `org.apache.hudi.common.table.HoodieTableConfig.URL_ENCODE_PARTITIONING`
  - `META_SYNC_USE_FILE_LISTING_FROM_METADATA`
    - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ENABLE`

As proposed in https://github.com/apache/hudi/blob/master/rfc/rfc-55/rfc-55.md#compatible-changes

* [HUDI-4210] Create custom hbase index to solve data skew issue on hbase regions (apache#5797)

* [HUDI-3730] Keep metasync configs backward compatible (apache#6221)

* [HUDI-4469] Flip reuse flag to true in HoodieBackedTableMetadata to improve file listing (apache#6214)

* [HUDI-4186] Support Hudi with Spark 3.3.0 (apache#5943)

Co-authored-by: Shawn Chang <[email protected]>

* [HUDI-4126] Disable file splits for Bootstrap real time queries (via InputFormat) (apache#6219)


Co-authored-by: Udit Mehrotra <[email protected]>
Co-authored-by: Raymond Xu <[email protected]>

* [HUDI-4490] Make AWSDmsAvroPayload class backwards compatible (apache#6229)

Co-authored-by: Rahil Chertara <[email protected]>

* [HUDI-4484] Add default lock config options for flink metadata table (apache#6222)

* [HUDI-4494] keep the fields' order when data is written out of order (apache#6233)

* [MINOR] Minor changes around Spark 3.3 support (apache#6231)

Co-authored-by: Shawn Chang <[email protected]>

* [HUDI-4081][HUDI-4472] Addressing Spark SQL vs Spark DS performance gap (apache#6213)

* [HUDI-4495] Fix handling of S3 paths incompatible with java URI standards (apache#6237)

* [HUDI-4499] Tweak default retry times for flink metadata table lock (apache#6238)

* [HUDI-4221] Optimzing getAllPartitionPaths  (apache#6234)

- Levering spark par for dir processing

* Moving to 0.13.0-SNAPSHOT on master branch.

* [HUDI-4504] Disable metadata table by default for flink (apache#6241)

* [HUDI-4505] Returns instead of throws if lock file exists for FileSystemBasedLockProvider (apache#6242)

To avoid unnecessary exception throws

* [HUDI-4507] Improve file name extraction logic in metadata utils (apache#6250)

* [MINOR] Fix convertPathWithScheme tests (apache#6251)

* [MINOR] Add license header (apache#6247)

Add license header to TestConfigUtils

* [HUDI-4025] Add Presto and Trino query node to validate queries (apache#5578)

* Add Presto and Trino query nodes to hudi-integ-test
* Add yamls for query validation
* Add presto-jdbc and trino-jdbc to integ-test-bundle

* [HUDI-4518] Free lock if allocated but not acquired (apache#6272)

If the lock is not null but its state has not yet transitioned to 
ACQUIRED, retry fails because the lock is not de-allocated. 
See issue apache#5702

* [HUDI-4510] Repair config "hive_sync.metastore.uris" in flink sql hive schema sync is not effective (apache#6257)

* [HUDI-3848] Fixing minor bug in listing based rollback request generation (apache#6244)

* [HUDI-4512][HUDI-4513] Fix bundle name for spark3 profile (apache#6261)

* [HUDI-4501] Throwing exception when restore is attempted with hoodie.arhive.beyond.savepoint is enabled (apache#6239)

* [HUDI-4516] fix Task not serializable error when run HoodieCleaner after one failure (apache#6265)


Co-authored-by: jian.feng <[email protected]>

* remove test resources (apache#6147)

Co-authored-by: root <[email protected]>

* [HUDI-4477] Adjust partition number of flink sink task (apache#6218)

Co-authored-by: lewinma <[email protected]>

* [HUDI-4298] Mor table reading for base and log files lost sequence of events (apache#6286)

* [HUDI-4298] Mor table reading for base and log files lost sequence of events

Signed-off-by: HunterXHunter <[email protected]>

* [HUDI-4525] Fixing Spark 3.3 `AvroSerializer` implementation (apache#6279)

* [HUDI-4447] fix no partitioned path extractor error when sync meta (apache#6263)

* [HUDI-4520] Support qualified table 'db.table' in call procedures (apache#6274)

* [HUDI-4531] Wrong partition path for flink hive catalog when the partition fields are not in the last (apache#6292)

* [HUDI-4487] support to create ro/rt table by spark sql (apache#6262)

* [HUDI-4533] Fix RunCleanProcedure's ArrayIndexOutOfBoundsException (apache#6293)

* [HUDI-4536] ClusteringOperator causes the NullPointerException when writing with BulkInsertWriterHelper in clustering (apache#6298)

* [HUDI-4385] Support online compaction in the flink batch mode write (apache#6093)

* [HUDI-4385] Support online compaction in the flink batch mode write

Signed-off-by: HunterXHunter <[email protected]>

* [HUDI-4530] fix default payloadclass in mor is different with cow (apache#6288)

* [HUDI-4545] Do not modify the current record directly for OverwriteNonDefaultsWithLatestAvroPayload (apache#6306)

* [HUDI-4544] support retain hour cleaning policy for flink (apache#6300)

* [HUDI-4547] Fix SortOperatorGen sort indices (apache#6309)

Signed-off-by: HunterXHunter <[email protected]>

* [HUDI-4470] Remove spark dataPrefetch disabled prop in DefaultSource

* [HUDI-4540] Cover different table types in functional tests of Spark structured streaming (apache#6317)

* [HUDI-4514] optimize CTAS to adapt to saveAsTable api in different modes (apache#6295)

* [HUDI-4474] Fix inferring props for meta sync (apache#6310)

- HoodieConfig#setDefaults looks up declared fields, so 
  should pass static class for reflection, otherwise, subclasses 
  of HoodieSyncConfig won't set defaults properly.
- Pass all write client configs of deltastreamer to meta sync
- Make org.apache.hudi.hive.MultiPartKeysValueExtractor 
  default for deltastreamer, to align with SQL and flink

* [HUDI-4550] Fallback to listing based rollback for completed instant (apache#6313)

Ideally, rollback is not triggered for completed instants. 
However, if it gets triggered due to some extraneous condition 
or forced while rollback strategy still configured to be marker-based, 
then fallback to listing-based rollback instead of failing.

- CTOR changes in rollback plan and action executors.
- Change in condition to determine whether to use marker-based rollback.
- Added UT to cover the scenario.

* [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value (apache#6248)

- Added FourToFiveUpgradeHandler to detect hudi tables with "default" partition and throwing exception.
- Added a new write config ("hoodie.skip.default.partition.validation") when enabled, will bypass the above validation. If users have a hudi table where "default" partition was created intentionally and not as sentinel, they can enable this config to get past the validation.

* [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis (apache#6307)

* [HUDI-4534] Fixing upgrade to reload Metaclient for deltastreamer writes (apache#6296)

* [HUDI-4517] If no marker type file, fallback to timeline based marker (apache#6266)

- If MARKERS.type file is not present, the logic assumes that the direct markers are stored, which causes the read failure in certain cases even where timeline server based marker is enabled. This PR handles the failure by falling back to timeline based marker in such cases.

* [HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles… (apache#5884)

- Adding request retry to RemoteHoodieTableFileSystemView. Users can enable using the new configs added.

* [HUDI-4464] Clear warnings in Azure CI (apache#6210)


Co-authored-by: jian.feng <[email protected]>

* [MINOR] Update PR description template (apache#6323)

* [HUDI-4508] Repair the exception when reading optimized query for mor in hive and presto/trino (apache#6254)

In MOR table, file slice may just have log file but no base file, 
before the file slice is compacted. In this case, read-optimized 
query will match the condition !baseFileOpt.isPresent() in HoodieCopyOnWriteTableInputFormat.createFileStatusUnchecked() 
and throw IllegalStateException.

Instead of throwing exception, 
it is more suitable to query nothing in the file slice.

Co-authored-by: sivabalan <[email protected]>

* [HUDI-4548] Unpack the column max/min to string instead of Utf8 for Mor table (apache#6311)

* [HUDI-4447] fix SQL metasync when perform delete table operation (apache#6180)

* [HUDI-4424] Add new compactoin trigger stratgy: NUM_COMMITS_AFTER_REQ… (apache#6144)

* [MINOR] improve flink dummySink's parallelism (apache#6325)

* [HUDI-4568] Shade dropwizard metrics-core in hudi-aws-bundle (apache#6327)

* [HUDI-4572] Fix 'Not a valid schema field: ts' error in HoodieFlinkCompactor if precombine field is not ts (apache#6331)

Co-authored-by: jian.feng <[email protected]>

* [HUDI-4570] Fix hive sync path error due to reuse of storage descriptors. (apache#6329)

* [HUDI-4571] Fix partition extractor infer function when partition field mismatch (apache#6333)

Infer META_SYNC_PARTITION_FIELDS and 
META_SYNC_PARTITION_EXTRACTOR_CLASS 
from hoodie.table.partition.fields first. 
If not set, then from hoodie.datasource.write.partitionpath.field.

Co-authored-by: Raymond Xu <[email protected]>

* [HUDI-4570] Add test for updating multiple partitions in hive sync (apache#6340)

* [MINOR] Fix wrong key to determine sync sql cascade (apache#6339)

* [HUDI-4581] Claim RFC-58 for data skipping integration with query engines (apache#6346)

* [HUDI-4577] Adding test coverage for `DELETE FROM`, Spark Quickstart guide (apache#6318)

* [HUDI-4556] Improve functional test coverage of column stats index (apache#6319)

* [HUDI-4558] lost 'hoodie.table.keygenerator.class' in hoodie.properties (apache#6320)

Co-authored-by: 吴文池 <[email protected]>

* [HUDI-4543] Support natural order when table schema contains a field named 'ts' (apache#6246)

* be able to disable precombine field when table schema contains a field named ts

Co-authored-by: jian yonghua <[email protected]>

* [HUDI-4569][RFC-58] Claim RFC-58 for adding a new feature named 'Multiple event_time Fields Latest Verification in a Single Table' for Hudi (apache#6328)

Co-authored-by: XinyaoTian <[email protected]>

* [HUDI-3503] Support more feature to call procedure CleanCommand (apache#6353)

* [HUDI-4590] Add hudi-aws dependency to hudi-flink-bundle. (apache#6356)

* [MINOR] fix potential npe in spark writer (apache#6363)

Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>

* fix bug in cli show fsview all (apache#6314)

* [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency (apache#6228)

* [HUDI-4611] Fix the duplicate creation of config in HoodieFlinkStreamer (apache#6369)

Co-authored-by: linfey <[email protected]>

* [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table (apache#6141)

* Spark support MOR read archived commits for incremental query

* [MINOR] fix progress field calculate logic in HoodieLogRecordReader (apache#6291)

* [HUDI-4608] Fix upgrade command in Hudi CLI (apache#6374)

* [HUDI-4609] Improve usability of upgrade/downgrade commands in Hudi CLI (apache#6377)

* [HUDI-4574] Fixed timeline based marker thread safety issue (apache#6383)

* fixed timeline based markers thread safety issue
* add document for TimelineBasedMarkers thread safety issues

* [HUDI-4621] Add validation that bucket index fields should be subset of primary keys (apache#6396)

* check bucket index fields

Co-authored-by: 吴文池 <[email protected]>

* [HUDI-4354] Add --force-empty-sync flag to deltastreamer (apache#6027)

* [HUDI-4601] Read error from MOR table after compaction with timestamp partitioning (apache#6365)

* read error from mor after compaction

Co-authored-by: 吴文池 <[email protected]>

* [MINOR] Update DOAP with 0.12.0 Release (apache#6413)

* [HUDI-4529] Tweak some default config options for flink (apache#6287)

* [HUDI-4632] Remove the force active property for flink1.14 profile (apache#6415)

* [HUDI-4551] Tweak the default parallelism of flink pipeline to execution env  parallelism (apache#6312)

* [MINOR] Improve code style of CLI Command classes (apache#6427)

* [HUDI-3625] Claim RFC-60 for Federated Storage Layer (apache#6440)

* [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar (apache#6386)

- Adding PulsarSource to DeltaStreamer to support ingesting from Apache Pulsar.
- Current implementation of PulsarSource is relying on "pulsar-spark-connector" to ingest using Spark instead of building similar pipeline from scratch.

* [HUDI-3579] Add timeline commands in hudi-cli (apache#5139)

* [HUDI-4638] Rename payload clazz and preCombine field options for flink sql (apache#6434)

* Revert "[HUDI-4632] Remove the force active property for flink1.14 profile (apache#6415)" (apache#6449)

This reverts commit 9055b2f.

* [HUDI-4643] MergeInto syntax WHEN MATCHED is optional but must be set (apache#6443)

* [HUDI-4644] Change default flink profile to 1.15.x (apache#6445)

* [HUDI-4678] Claim RFC-61 for Snapshot view management (apache#6461)

Co-authored-by: jian.feng <[email protected]>

* [HUDI-4676] infer cleaner policy when write concurrency mode is OCC (apache#6459)

* [HUDI-4676] infer cleaner policy when write concurrency mode is OCC
Co-authored-by: jian.feng <[email protected]>

* [HUDI-4683] Use enum class value for default value in flink options (apache#6453)

* [HUDI-4584] Cleaning up Spark utilities (apache#6351)

Cleans up Spark utilities and removes duplication

* [HUDI-4686] Flip option 'write.ignore.failed' to default false (apache#6467)

Also fix the flaky test

* [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy (apache#6267)

* [HUDI-4637] Release thread in RateLimiter doesn't been terminated (apache#6433)

* [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid conflicts with flink table core (apache#6481)

* HUDI-4687 add show_invalid_parquet procedure (apache#6480)

Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>

* [HUDI-4584] Fixing `SQLConf` not being propagated to executor (apache#6352)

Fixes `HoodieSparkUtils.createRDD` to make sure `SQLConf` is properly propagated to the executor (required by `AvroSerializer`)

* [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies (apache#6170)

* Merge OSS master

* resolve build issues

* fix checkstyle issue

* [HUDI-4665] Flipping default for "ignore failed batch" config in streaming sink to false (apache#6450)

* [HUDI-4713] Fix flaky ITTestHoodieDataSource#testAppendWrite (apache#6490)

* add back in internal customization for s3EventsHoodieIncrSource

* [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat (apache#6494)

* Revert "[HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles… (apache#5884)" (apache#6501)

This reverts commit 660177b.

Signed-off-by: LinMingQiang <[email protected]>
Signed-off-by: HunterXHunter <[email protected]>
Co-authored-by: Zhaojing Yu <[email protected]>
Co-authored-by: LinMingQiang <[email protected]>
Co-authored-by: Danny Chan <[email protected]>
Co-authored-by: jiz <[email protected]>
Co-authored-by: jiimmyzhan <[email protected]>
Co-authored-by: Forus <[email protected]>
Co-authored-by: Sagar Sumit <[email protected]>
Co-authored-by: xi chaomin <[email protected]>
Co-authored-by: luokey <[email protected]>
Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>
Co-authored-by: xiarixiaoyao <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: ForwardXu <[email protected]>
Co-authored-by: Shiyan Xu <[email protected]>
Co-authored-by: Sivabalan Narayanan <[email protected]>
Co-authored-by: cxzl25 <[email protected]>
Co-authored-by: leesf <[email protected]>
Co-authored-by: 吴祥平 <[email protected]>
Co-authored-by: superche <[email protected]>
Co-authored-by: superche <[email protected]>
Co-authored-by: KnightChess <[email protected]>
Co-authored-by: BruceLin <[email protected]>
Co-authored-by: bschell <[email protected]>
Co-authored-by: Brandon Scheller <[email protected]>
Co-authored-by: Teng <[email protected]>
Co-authored-by: YueZhang <[email protected]>
Co-authored-by: yuezhang <[email protected]>
Co-authored-by: yuezhang <[email protected]>
Co-authored-by: wenningd <[email protected]>
Co-authored-by: Wenning Ding <[email protected]>
Co-authored-by: luoyajun <[email protected]>
Co-authored-by: RexAn <[email protected]>
Co-authored-by: komao <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: miomiocat <[email protected]>
Co-authored-by: JerryYue-M <[email protected]>
Co-authored-by: jerryyue <[email protected]>
Co-authored-by: jian.feng <[email protected]>
Co-authored-by: jian.feng <[email protected]>
Co-authored-by: voonhous <[email protected]>
Co-authored-by: voonhou.su <[email protected]>
Co-authored-by: Y Ethan Guo <[email protected]>
Co-authored-by: xicm <[email protected]>
Co-authored-by: 董可伦 <[email protected]>
Co-authored-by: shenjiayu17 <[email protected]>
Co-authored-by: Lanyuanxiaoyao <[email protected]>
Co-authored-by: 苏承祥 <[email protected]>
Co-authored-by: Kumud Kumar Srivatsava Tirupati <[email protected]>
Co-authored-by: liujinhui <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: 冯健 <[email protected]>
Co-authored-by: Luning (Lucas) Wang <[email protected]>
Co-authored-by: Yann Byron <[email protected]>
Co-authored-by: simonsssu <[email protected]>
Co-authored-by: Bo Cui <[email protected]>
Co-authored-by: Rahil Chertara <[email protected]>
Co-authored-by: Rahil C <[email protected]>
Co-authored-by: Ryan Pifer <[email protected]>
Co-authored-by: Udit Mehrotra <[email protected]>
Co-authored-by: simonssu <[email protected]>
Co-authored-by: Vander <[email protected]>
Co-authored-by: Dongwook Kwon <[email protected]>
Co-authored-by: Shawn Chang <[email protected]>
Co-authored-by: Shawn Chang <[email protected]>
Co-authored-by: 5herhom <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: F7753 <[email protected]>
Co-authored-by: lewinma <[email protected]>
Co-authored-by: Nicholas Jiang <[email protected]>
Co-authored-by: Yonghua Jian_deepnova <[email protected]>
Co-authored-by: 5herhom <[email protected]>
Co-authored-by: RexXiong <[email protected]>
Co-authored-by: Pratyaksh Sharma <[email protected]>
Co-authored-by: wuwenchi <[email protected]>
Co-authored-by: 吴文池 <[email protected]>
Co-authored-by: jian yonghua <[email protected]>
Co-authored-by: Xinyao Tian (Richard) <[email protected]>
Co-authored-by: XinyaoTian <[email protected]>
Co-authored-by: vamshigv <[email protected]>
Co-authored-by: feiyang_deepnova <[email protected]>
Co-authored-by: linfey <[email protected]>
Co-authored-by: novisfff <[email protected]>
Co-authored-by: Qi Ji <[email protected]>
Co-authored-by: hehuiyuan <[email protected]>
Co-authored-by: Zouxxyy <[email protected]>
vinishjail97 added a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023
* [HUDI-3984] Remove mandatory check of partiton path for cli command (apache#5458)

* [HUDI-3634] Could read empty or partial HoodieCommitMetaData in downstream if using HDFS (apache#5048)

Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file

* [HUDI-3953]Flink Hudi module should support low-level source and sink api (apache#5445)

Co-authored-by: jerryyue <[email protected]>

* [HUDI-4353] Column stats data skipping for flink (apache#6026)

* [HUDI-3505] Add call procedure for UpgradeOrDowngradeCommand (apache#6012)

Co-authored-by: superche <[email protected]>

* [HUDI-3730] Improve meta sync class design and hierarchies (apache#5854)

* [HUDI-3730] Improve meta sync class design and hierarchies (apache#5754)
* Implements class design proposed in RFC-55

Co-authored-by: jian.feng <[email protected]>
Co-authored-by: jian.feng <[email protected]>

* [HUDI-3511] Add call procedure for MetadataCommand (apache#6018)

* [HUDI-3730] Add ConfigTool#toMap UT (apache#6035)

Co-authored-by: voonhou.su <[email protected]>

* [MINOR] Improve variable names (apache#6039)

* [HUDI-3116]Add a new HoodieDropPartitionsTool to let users drop table partitions through a standalone job. (apache#4459)

Co-authored-by: yuezhang <[email protected]>

* [HUDI-4360] Fix HoodieDropPartitionsTool based on refactored meta sync (apache#6043)

* [HUDI-3836] Improve the way of fetching metadata partitions from table (apache#5286)

Co-authored-by: xicm <[email protected]>

* [HUDI-4359] Support show_fs_path_detail command on Call Produce Command (apache#6042)

* [HUDI-4356] Fix the error when sync hive in CTAS (apache#6029)

* [HUDI-4219] Merge Into when update expression "col=s.col+2" on precombine cause exception (apache#5828)

* [HUDI-4357] Support flink 1.15.x (apache#6050)

* [HUDI-4152] Flink offline compaction support compacting multi compaction plan at once (apache#5677)

* [HUDI-4152] Flink offline compaction allow compact multi compaction plan at once

* [HUDI-4152] Fix exception for duplicated uid when multi compaction plan are compacted

* [HUDI-4152] Provider UT & IT for compact multi compaction plan

* [HUDI-4152] Put multi compaction plans into one compaction plan source

* [HUDI-4152] InstantCompactionPlanSelectStrategy allow multi instant by using comma

* [HUDI-4152] Add IT for InstantCompactionPlanSelectStrategy

* [HUDI-4309] fix spark32 repartition error (apache#6033)

* [HUDI-4366] Synchronous cleaning for flink bounded source (apache#6051)

* [minor] following 4152, refactor the clazz about plan selection strategy (apache#6060)

* [HUDI-4367] Support copyToTable on call (apache#6054)

* [HUDI-4335] Bug fixes in AWSGlueCatalogSyncClient post schema evolution. (apache#5995)

* fix for updateTableParameters which is not excluding partition columns and updateTableProperties boolean check

* Fix - serde parameters getting overrided on table property update

* removing stale syncConfig

* [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (apache#6017)

* [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields.

* fix comments

Co-authored-by: public (bdcee5037027) <[email protected]>

* [HUDI-3500] Add call procedure for RepairsCommand (apache#6053)

* [HUDI-2150] Rename/Restructure configs for better modularity (apache#6061)

- Move clean related configuration to HoodieCleanConfig
- Move Archival related configuration to HoodieArchivalConfig
- hoodie.compaction.payload.class move this to HoodiePayloadConfig

* [MINOR] Bump xalan from 2.7.1 to 2.7.2 (apache#6062)

Bumps xalan from 2.7.1 to 2.7.2.

---
updated-dependencies:
- dependency-name: xalan:xalan
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072)

* [HUDI-4324] Remove use_jdbc config from hudi sync
* Users should use HIVE_SYNC_MODE instead

* [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs (apache#5695)

* [HUDI-4146] RFC for Improve Hive/Meta sync class design and hierarchies

Co-authored-by: jian.feng <[email protected]>
Co-authored-by: Raymond Xu <[email protected]>

* [HUDI-4323] Make database table names optional in sync tool (apache#6073)

* [HUDI-4323] Make database table names optional in sync tool
* Infer from these properties from the table config

* [MINOR] Update RFCs status (apache#6078)

* [HUDI-4298] When reading the mor table with QUERY_TYPE_SNAPSHOT,Unabl… (apache#5937)

* [HUDI-4298] Add test case for reading mor table

Signed-off-by: LinMingQiang <[email protected]>

* [HUDI-4379] Bump Flink versions to 1.14.5 and 1.15.1 (apache#6080)

* [HUDI-4391] Incremental read from archived commits for flink (apache#6096)

* [RFC-51] [HUDI-3478] Hudi to support Change-Data-Capture (apache#5436)



Co-authored-by: Raymond Xu <[email protected]>

* [HUDI-4393] Add marker file for target file when flink merge handle rolls over (apache#6103)

* [HUDI-4399][RFC-57] Claim RFC 57 for DeltaStreamer proto support (apache#6112)

* [HUDI-4397] Flink Inline Cluster and Compact plan distribute strategy changed from rebalance to hash to avoid potential multiple threads accessing the same file (apache#6106)

Co-authored-by: jerryyue <[email protected]>

* [MINOR] Disable TestHiveSyncGlobalCommitTool (apache#6119)

* [HUDI-4403] Fix the end input metadata for bounded source (apache#6116)

* [HUDI-4408] Reuse old rollover file as base file for flink merge handle (apache#6120)

* [HUDI-3503]  Add call procedure for CleanCommand (apache#6065)

* [HUDI-3503] Add call procedure for CleanCommand
Co-authored-by: simonssu <[email protected]>

* [HUDI-4249] Fixing in-memory `HoodieData` implementation to operate lazily  (apache#5855)

* [HUDI-4170] Make user can use hoodie.datasource.read.paths to read necessary files (apache#5722)

* Rebase codes

* Move listFileSlices to HoodieBaseRelation

* Fix review

* Fix style

* Fix bug

* Fix file group count issue with metadata partitions (apache#5892)

* [HUDI-4098] Support HMS for flink HudiCatalog (apache#6082)

* [HUDI-4098]Support HMS for flink HudiCatalog

* [HUDI-4409] Improve LockManager wait logic when catch exception (apache#6122)

* [HUDI-4065] Add FileBasedLockProvider (apache#6071)

* [HUDI-4416] Default database path for hoodie hive catalog (apache#6136)

* [HUDI-4372] Enable matadata table by default for flink (apache#6066)

* [HUDI-4401] Skip HBase version check (apache#6114)

* Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

* [HUDI-4427] Add a computed column IT test (apache#6150)

* [HUDI-4146][RFC-55] Update config changes proposal (apache#6162)

* [HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (apache#5428)

Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation 
and the reason to that is SchemaPruning optimization rule (pruning nested schemas) 
that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's 
not applied in cases when any other relation is used.

This change is porting this rule to Hudi relations (MOR, Incremental, etc) 
by the virtue of leveraging HoodieSparkSessionExtensions mechanism 
injecting modified version of the original SchemaPruning rule 
that is adopted to work w/ Hudi's custom relations.

- Added customOptimizerRules to HoodieAnalysis
- Added NestedSchemaPrunning Spark's Optimizer rule
- Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas)
- Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions
- Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource
- Disabled fallback to HadoopFsRelation

* [HUDI-3764] Allow loading external configs while querying Hudi tables with Spark (apache#4915)

Currently when doing Hudi queries w/ Spark, it won't 
load the external configurations. Say if customers enabled 
metadata listing in their global config file, then this would 
let them actually query w/o metadata feature enabled. 
This PR fixes this issue and allows loading global 
configs during the Hudi reading phase.

Co-authored-by: Wenning Ding <[email protected]>

* [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (apache#5470)

* [MINOR] Add logger for HoodieCopyOnWriteTableInputFormat (apache#6161)

Co-authored-by: Wenning Ding <[email protected]>

* [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table (apache#6113)

Fixes the missing bloom filters in metadata table in the non-partitioned table due to incorrect record key generation, because of wrong file names when generating the metadata payload for the bloom filter.

* [HUDI-4204] Fixing NPE with row writer path and with OCC (apache#5850)

* [HUDI-4247] Upgrading protocol buffers version for presto bundle (apache#5852)

* [MINOR] Fix result missing information issue in commits_compare Procedure (apache#6165)

Co-authored-by: superche <[email protected]>

* [HUDI-4404] Fix insert into dynamic partition write misalignment (apache#6124)

* [MINOR] Fallback to default for hive-style partitioning, url-encoding configs (apache#6175)

- Fixes broken ITTestHoodieDemo#testParquetDemo

* [MINOR] Fix CI issue with TestHiveSyncTool (apache#6110)

* [HUDI-4039] Make sure all builtin `KeyGenerator`s properly implement Spark specific APIs (apache#5523)

This set of changes makes sure that all builtin KeyGenerators properly implement Spark-specific APIs in a performant way (minimizing key-generators overhead)

* [MINOR] Disable Flink compactor IT test (apache#6189)

* Revert "[MINOR] Fix CI issue with TestHiveSyncTool (apache#6110)" (apache#6192)

This reverts commit d5c904e.

* [HUDI-3979] Optimize out mandatory columns when no merging is performed (apache#5430)

For MOR, when no merging is performed there is no point in reading either primary-key or pre-combine-key values (unless query is referencing these). Avoiding reading these allows to potentially save substantial resources wasted for reading it out.

* [HUDI-4303] Use Hive sentinel value as partition default to avoid type caste issues (apache#5954)

* Revert "[HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072)" (apache#6160)

This reverts commit 046044c.

* [HUDI-4435] Fix Avro field not found issue introduced by Avro 1.10 (apache#6155)

Co-authored-by: Wenning Ding <[email protected]>

* [HUDI-4437] Fix test conflicts by clearing file system cache (apache#6123)

Co-authored-by: jian.feng <[email protected]>
Co-authored-by: jian.feng <[email protected]>
Co-authored-by: Raymond Xu <[email protected]>

* [HUDI-4436] Invalidate cached table in Spark after write (apache#6159)

Co-authored-by: Ryan Pifer <[email protected]>

* [MINOR] Fix Call Procedure code style (apache#6186)

* Fix Call Procedure code style.
Co-authored-by: superche <[email protected]>

* [MINOR] Bump CI timeout to 150m (apache#6198)

* [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partition column is missing from schema (apache#6163)

Co-authored-by: Ryan Pifer <[email protected]>

* [HUDI-4071] Make NONE sort mode as default for bulk insert (apache#6195)

* [HUDI-4420] Fixing table schema delineation on partition/data schema for Spark relations  (apache#5708)

* [HUDI-4448] Remove the latest commit refresh for timeline server (apache#6179)

* [HUDI-4450] Revert the checkpoint abort notification (apache#6181)

* [HUDI-4439] Fix Amazon CloudWatch reporter for metadata enabled tables (apache#6164)

Co-authored-by: Udit Mehrotra <[email protected]>
Co-authored-by: Y Ethan Guo <[email protected]>

* [HUDI-4348] fix merge into sql data quality in concurrent scene (apache#6020)

* [HUDI-3510] Add sync validate procedure (apache#6200)

* [HUDI-3510] Add sync validate procedure

Co-authored-by: simonssu <[email protected]>

* [MINOR] Fix typos in Spark client related classes (apache#6204)

* [HUDI-4456] Close FileSystem in SparkClientFunctionalTestHarness  (apache#6201)

* [MINOR] Only log stdout output for non-zero exit from commands in IT (apache#6199)

* [HUDI-4458] Add a converter cache for flink ColumnStatsIndices (apache#6205)

* [HUDI-4071] Match ROLLBACK_USING_MARKERS_ENABLE in sql as datasource (apache#6206)

Co-authored-by: superche <[email protected]>

* [HUDI-4455] Improve test classes for TestHiveSyncTool (apache#6202)

Improve HiveTestService, HiveTestUtil, and related classes.

* [HUDI-4456] Clean up test resources (apache#6203)

* [HUDI-3884] Support archival beyond savepoint commits (apache#5837)


Co-authored-by: sivabalan <[email protected]>

* [HUDI-4250][HUDI-4202] Optimize performance of Column Stats Index reading in Data Skipping  (apache#5746)

We provide an alternative way of fetching Column Stats Index within the reading process to avoid the penalty of a more heavy-weight execution scheduled through a Spark engine.

* [HUDI-4471] Relocate AWSDmsAvroPayload class to hudi-common

* [HUDI-4474] Infer metasync configs (apache#6217)

- infer repeated sync configs from original configs
  - `META_SYNC_BASE_FILE_FORMAT`
    - infer from `org.apache.hudi.common.table.HoodieTableConfig.BASE_FILE_FORMAT`
  - `META_SYNC_ASSUME_DATE_PARTITION`
    - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ASSUME_DATE_PARTITIONING`
  - `META_SYNC_DECODE_PARTITION`
    - infer from `org.apache.hudi.common.table.HoodieTableConfig.URL_ENCODE_PARTITIONING`
  - `META_SYNC_USE_FILE_LISTING_FROM_METADATA`
    - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ENABLE`

As proposed in https://github.com/apache/hudi/blob/master/rfc/rfc-55/rfc-55.md#compatible-changes

* [HUDI-4210] Create custom hbase index to solve data skew issue on hbase regions (apache#5797)

* [HUDI-3730] Keep metasync configs backward compatible (apache#6221)

* [HUDI-4469] Flip reuse flag to true in HoodieBackedTableMetadata to improve file listing (apache#6214)

* [HUDI-4186] Support Hudi with Spark 3.3.0 (apache#5943)

Co-authored-by: Shawn Chang <[email protected]>

* [HUDI-4126] Disable file splits for Bootstrap real time queries (via InputFormat) (apache#6219)


Co-authored-by: Udit Mehrotra <[email protected]>
Co-authored-by: Raymond Xu <[email protected]>

* [HUDI-4490] Make AWSDmsAvroPayload class backwards compatible (apache#6229)

Co-authored-by: Rahil Chertara <[email protected]>

* [HUDI-4484] Add default lock config options for flink metadata table (apache#6222)

* [HUDI-4494] keep the fields' order when data is written out of order (apache#6233)

* [MINOR] Minor changes around Spark 3.3 support (apache#6231)

Co-authored-by: Shawn Chang <[email protected]>

* [HUDI-4081][HUDI-4472] Addressing Spark SQL vs Spark DS performance gap (apache#6213)

* [HUDI-4495] Fix handling of S3 paths incompatible with java URI standards (apache#6237)

* [HUDI-4499] Tweak default retry times for flink metadata table lock (apache#6238)

* [HUDI-4221] Optimzing getAllPartitionPaths  (apache#6234)

- Levering spark par for dir processing

* Moving to 0.13.0-SNAPSHOT on master branch.

* [HUDI-4504] Disable metadata table by default for flink (apache#6241)

* [HUDI-4505] Returns instead of throws if lock file exists for FileSystemBasedLockProvider (apache#6242)

To avoid unnecessary exception throws

* [HUDI-4507] Improve file name extraction logic in metadata utils (apache#6250)

* [MINOR] Fix convertPathWithScheme tests (apache#6251)

* [MINOR] Add license header (apache#6247)

Add license header to TestConfigUtils

* [HUDI-4025] Add Presto and Trino query node to validate queries (apache#5578)

* Add Presto and Trino query nodes to hudi-integ-test
* Add yamls for query validation
* Add presto-jdbc and trino-jdbc to integ-test-bundle

* [HUDI-4518] Free lock if allocated but not acquired (apache#6272)

If the lock is not null but its state has not yet transitioned to 
ACQUIRED, retry fails because the lock is not de-allocated. 
See issue apache#5702

* [HUDI-4510] Repair config "hive_sync.metastore.uris" in flink sql hive schema sync is not effective (apache#6257)

* [HUDI-3848] Fixing minor bug in listing based rollback request generation (apache#6244)

* [HUDI-4512][HUDI-4513] Fix bundle name for spark3 profile (apache#6261)

* [HUDI-4501] Throwing exception when restore is attempted with hoodie.arhive.beyond.savepoint is enabled (apache#6239)

* [HUDI-4516] fix Task not serializable error when run HoodieCleaner after one failure (apache#6265)


Co-authored-by: jian.feng <[email protected]>

* remove test resources (apache#6147)

Co-authored-by: root <[email protected]>

* [HUDI-4477] Adjust partition number of flink sink task (apache#6218)

Co-authored-by: lewinma <[email protected]>

* [HUDI-4298] Mor table reading for base and log files lost sequence of events (apache#6286)

* [HUDI-4298] Mor table reading for base and log files lost sequence of events

Signed-off-by: HunterXHunter <[email protected]>

* [HUDI-4525] Fixing Spark 3.3 `AvroSerializer` implementation (apache#6279)

* [HUDI-4447] fix no partitioned path extractor error when sync meta (apache#6263)

* [HUDI-4520] Support qualified table 'db.table' in call procedures (apache#6274)

* [HUDI-4531] Wrong partition path for flink hive catalog when the partition fields are not in the last (apache#6292)

* [HUDI-4487] support to create ro/rt table by spark sql (apache#6262)

* [HUDI-4533] Fix RunCleanProcedure's ArrayIndexOutOfBoundsException (apache#6293)

* [HUDI-4536] ClusteringOperator causes the NullPointerException when writing with BulkInsertWriterHelper in clustering (apache#6298)

* [HUDI-4385] Support online compaction in the flink batch mode write (apache#6093)

* [HUDI-4385] Support online compaction in the flink batch mode write

Signed-off-by: HunterXHunter <[email protected]>

* [HUDI-4530] fix default payloadclass in mor is different with cow (apache#6288)

* [HUDI-4545] Do not modify the current record directly for OverwriteNonDefaultsWithLatestAvroPayload (apache#6306)

* [HUDI-4544] support retain hour cleaning policy for flink (apache#6300)

* [HUDI-4547] Fix SortOperatorGen sort indices (apache#6309)

Signed-off-by: HunterXHunter <[email protected]>

* [HUDI-4470] Remove spark dataPrefetch disabled prop in DefaultSource

* [HUDI-4540] Cover different table types in functional tests of Spark structured streaming (apache#6317)

* [HUDI-4514] optimize CTAS to adapt to saveAsTable api in different modes (apache#6295)

* [HUDI-4474] Fix inferring props for meta sync (apache#6310)

- HoodieConfig#setDefaults looks up declared fields, so 
  should pass static class for reflection, otherwise, subclasses 
  of HoodieSyncConfig won't set defaults properly.
- Pass all write client configs of deltastreamer to meta sync
- Make org.apache.hudi.hive.MultiPartKeysValueExtractor 
  default for deltastreamer, to align with SQL and flink

* [HUDI-4550] Fallback to listing based rollback for completed instant (apache#6313)

Ideally, rollback is not triggered for completed instants. 
However, if it gets triggered due to some extraneous condition 
or forced while rollback strategy still configured to be marker-based, 
then fallback to listing-based rollback instead of failing.

- CTOR changes in rollback plan and action executors.
- Change in condition to determine whether to use marker-based rollback.
- Added UT to cover the scenario.

* [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value (apache#6248)

- Added FourToFiveUpgradeHandler to detect hudi tables with "default" partition and throwing exception.
- Added a new write config ("hoodie.skip.default.partition.validation") when enabled, will bypass the above validation. If users have a hudi table where "default" partition was created intentionally and not as sentinel, they can enable this config to get past the validation.

* [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis (apache#6307)

* [HUDI-4534] Fixing upgrade to reload Metaclient for deltastreamer writes (apache#6296)

* [HUDI-4517] If no marker type file, fallback to timeline based marker (apache#6266)

- If MARKERS.type file is not present, the logic assumes that the direct markers are stored, which causes the read failure in certain cases even where timeline server based marker is enabled. This PR handles the failure by falling back to timeline based marker in such cases.

* [HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles… (apache#5884)

- Adding request retry to RemoteHoodieTableFileSystemView. Users can enable using the new configs added.

* [HUDI-4464] Clear warnings in Azure CI (apache#6210)


Co-authored-by: jian.feng <[email protected]>

* [MINOR] Update PR description template (apache#6323)

* [HUDI-4508] Repair the exception when reading optimized query for mor in hive and presto/trino (apache#6254)

In MOR table, file slice may just have log file but no base file, 
before the file slice is compacted. In this case, read-optimized 
query will match the condition !baseFileOpt.isPresent() in HoodieCopyOnWriteTableInputFormat.createFileStatusUnchecked() 
and throw IllegalStateException.

Instead of throwing exception, 
it is more suitable to query nothing in the file slice.

Co-authored-by: sivabalan <[email protected]>

* [HUDI-4548] Unpack the column max/min to string instead of Utf8 for Mor table (apache#6311)

* [HUDI-4447] fix SQL metasync when perform delete table operation (apache#6180)

* [HUDI-4424] Add new compactoin trigger stratgy: NUM_COMMITS_AFTER_REQ… (apache#6144)

* [MINOR] improve flink dummySink's parallelism (apache#6325)

* [HUDI-4568] Shade dropwizard metrics-core in hudi-aws-bundle (apache#6327)

* [HUDI-4572] Fix 'Not a valid schema field: ts' error in HoodieFlinkCompactor if precombine field is not ts (apache#6331)

Co-authored-by: jian.feng <[email protected]>

* [HUDI-4570] Fix hive sync path error due to reuse of storage descriptors. (apache#6329)

* [HUDI-4571] Fix partition extractor infer function when partition field mismatch (apache#6333)

Infer META_SYNC_PARTITION_FIELDS and 
META_SYNC_PARTITION_EXTRACTOR_CLASS 
from hoodie.table.partition.fields first. 
If not set, then from hoodie.datasource.write.partitionpath.field.

Co-authored-by: Raymond Xu <[email protected]>

* [HUDI-4570] Add test for updating multiple partitions in hive sync (apache#6340)

* [MINOR] Fix wrong key to determine sync sql cascade (apache#6339)

* [HUDI-4581] Claim RFC-58 for data skipping integration with query engines (apache#6346)

* [HUDI-4577] Adding test coverage for `DELETE FROM`, Spark Quickstart guide (apache#6318)

* [HUDI-4556] Improve functional test coverage of column stats index (apache#6319)

* [HUDI-4558] lost 'hoodie.table.keygenerator.class' in hoodie.properties (apache#6320)

Co-authored-by: 吴文池 <[email protected]>

* [HUDI-4543] Support natural order when table schema contains a field named 'ts' (apache#6246)

* be able to disable precombine field when table schema contains a field named ts

Co-authored-by: jian yonghua <[email protected]>

* [HUDI-4569][RFC-58] Claim RFC-58 for adding a new feature named 'Multiple event_time Fields Latest Verification in a Single Table' for Hudi (apache#6328)

Co-authored-by: XinyaoTian <[email protected]>

* [HUDI-3503] Support more feature to call procedure CleanCommand (apache#6353)

* [HUDI-4590] Add hudi-aws dependency to hudi-flink-bundle. (apache#6356)

* [MINOR] fix potential npe in spark writer (apache#6363)

Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>

* fix bug in cli show fsview all (apache#6314)

* [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency (apache#6228)

* [HUDI-4611] Fix the duplicate creation of config in HoodieFlinkStreamer (apache#6369)

Co-authored-by: linfey <[email protected]>

* [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table (apache#6141)

* Spark support MOR read archived commits for incremental query

* [MINOR] fix progress field calculate logic in HoodieLogRecordReader (apache#6291)

* [HUDI-4608] Fix upgrade command in Hudi CLI (apache#6374)

* [HUDI-4609] Improve usability of upgrade/downgrade commands in Hudi CLI (apache#6377)

* [HUDI-4574] Fixed timeline based marker thread safety issue (apache#6383)

* fixed timeline based markers thread safety issue
* add document for TimelineBasedMarkers thread safety issues

* [HUDI-4621] Add validation that bucket index fields should be subset of primary keys (apache#6396)

* check bucket index fields

Co-authored-by: 吴文池 <[email protected]>

* [HUDI-4354] Add --force-empty-sync flag to deltastreamer (apache#6027)

* [HUDI-4601] Read error from MOR table after compaction with timestamp partitioning (apache#6365)

* read error from mor after compaction

Co-authored-by: 吴文池 <[email protected]>

* [MINOR] Update DOAP with 0.12.0 Release (apache#6413)

* [HUDI-4529] Tweak some default config options for flink (apache#6287)

* [HUDI-4632] Remove the force active property for flink1.14 profile (apache#6415)

* [HUDI-4551] Tweak the default parallelism of flink pipeline to execution env  parallelism (apache#6312)

* [MINOR] Improve code style of CLI Command classes (apache#6427)

* [HUDI-3625] Claim RFC-60 for Federated Storage Layer (apache#6440)

* [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar (apache#6386)

- Adding PulsarSource to DeltaStreamer to support ingesting from Apache Pulsar.
- Current implementation of PulsarSource is relying on "pulsar-spark-connector" to ingest using Spark instead of building similar pipeline from scratch.

* [HUDI-3579] Add timeline commands in hudi-cli (apache#5139)

* [HUDI-4638] Rename payload clazz and preCombine field options for flink sql (apache#6434)

* Revert "[HUDI-4632] Remove the force active property for flink1.14 profile (apache#6415)" (apache#6449)

This reverts commit 9055b2f.

* [HUDI-4643] MergeInto syntax WHEN MATCHED is optional but must be set (apache#6443)

* [HUDI-4644] Change default flink profile to 1.15.x (apache#6445)

* [HUDI-4678] Claim RFC-61 for Snapshot view management (apache#6461)

Co-authored-by: jian.feng <[email protected]>

* [HUDI-4676] infer cleaner policy when write concurrency mode is OCC (apache#6459)

* [HUDI-4676] infer cleaner policy when write concurrency mode is OCC
Co-authored-by: jian.feng <[email protected]>

* [HUDI-4683] Use enum class value for default value in flink options (apache#6453)

* [HUDI-4584] Cleaning up Spark utilities (apache#6351)

Cleans up Spark utilities and removes duplication

* [HUDI-4686] Flip option 'write.ignore.failed' to default false (apache#6467)

Also fix the flaky test

* [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy (apache#6267)

* [HUDI-4637] Release thread in RateLimiter doesn't been terminated (apache#6433)

* [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid conflicts with flink table core (apache#6481)

* HUDI-4687 add show_invalid_parquet procedure (apache#6480)

Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>

* [HUDI-4584] Fixing `SQLConf` not being propagated to executor (apache#6352)

Fixes `HoodieSparkUtils.createRDD` to make sure `SQLConf` is properly propagated to the executor (required by `AvroSerializer`)

* [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies (apache#6170)

* [HUDI-4665] Flipping default for "ignore failed batch" config in streaming sink to false (apache#6450)

* [HUDI-4713] Fix flaky ITTestHoodieDataSource#testAppendWrite (apache#6490)

* [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat (apache#6494)

* Revert "[HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles… (apache#5884)" (apache#6501)

This reverts commit 660177b.

* [Stacked on 6386] Fixing `DebeziumSource` to properly commit offsets; (apache#6416)

* [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer (apache#6111)

* [HUDI-4703] use the historical schema to response time travel query (apache#6499)

* [HUDI-4703] use the historical schema to response time travel query

* [HUDI-4549]  Remove avro from hudi-hive-sync-bundle and hudi-aws-bundle (apache#6472)

* Remove avro shading from hudi-hive-sync-bundle
   and hudi-aws-bundle.

Co-authored-by: Raymond Xu <[email protected]>

* [HUDI-4482] remove guava and use caffeine instead for cache (apache#6240)

* [HUDI-4483] Fix checkstyle in integ-test module (apache#6523)

* [HUDI-4340] fix not parsable text DateTimeParseException by addng a method parseDateFromInstantTimeSafely for parsing timestamp when output metrics (apache#6000)

* [DOCS] Add docs about javax.security.auth.login.LoginException when starting Hudi Sink Connector (apache#6255)

* [HUDI-4327] Fixing flaky deltastreamer test (testCleanerDeleteReplacedDataWithArchive) (apache#6533)

* [HUDI-4730] Fix batch job cannot clean old commits files (apache#6515)

* [HUDI-4370] Fix batch job cannot clean old commits files

Co-authored-by: jian.feng <[email protected]>

* [HUDI-4740] Add metadata fields for hive catalog #createTable (apache#6541)

* [HUDI-4695] Fixing flaky TestInlineCompaction#testCompactionRetryOnFailureBasedOnTime (apache#6534)

* [HUDI-4193] change protoc version to unblock hudi compilation on m1 mac (apache#6535)

* [HUDI-4438] Fix flaky TestCopyOnWriteActionExecutor#testPartitionMetafileFormat (apache#6546)

* [MINOR] Fix typo in HoodieArchivalConfig (apache#6542)

* [HUDI-4582] Support batch synchronization of partition to HMS to avoid timeout (apache#6347)


Co-authored-by: xxhua <[email protected]>

* [HUDI-4742] Fix AWS Glue partition's location is wrong when updatePartition (apache#6545)

Co-authored-by: xxhua <[email protected]>

* [HUDI-4418] Add support for ProtoKafkaSource (apache#6135)

- Adds PROTO to Source.SourceType enum.
- Handles PROTO type in SourceFormatAdapter by converting to Avro from proto Message objects. 
   Conversion to Row goes Proto -> Avro -> Row currently.
- Added ProtoClassBasedSchemaProvider to generate schemas for a proto class that is currently on the classpath.
- Added ProtoKafkaSource which parses byte[] into a class that is on the path.
- Added ProtoConversionUtil which exposes methods for creating schemas and 
   translating from Proto messages to Avro GenericRecords.
- Added KafkaSource which provides a base class for the other Kafka sources to use.

* [HUDI-4642] Adding support to hudi-cli to repair deprecated partition (apache#6438)

* [HUDI-4751] Fix owner instants for transaction manager api callers (apache#6549)

* [HUDI-4739] Wrong value returned when key's length equals 1 (apache#6539)

* extracts key fields

Co-authored-by: 吴文池 <[email protected]>

* [HUDI-4528] Add diff tool to compare commit metadata (apache#6485)

* Add diff tool to compare commit metadata
* Add partition level info to commits and compaction command
* Partition support for compaction archived timeline
* Add diff command test

* [HUDI-4648] Support rename partition through CLI (apache#6569)

* [HUDI-4775] Fixing incremental source for MOR table (apache#6587)

* Fixing incremental source for MOR table

* Remove unused import

Co-authored-by: Sagar Sumit <[email protected]>

* [HUDI-4694] Print testcase running time for CI jobs (apache#6586)

* [RFC] Claim RFC-62 for Diagnostic Reporter (apache#6599)

Co-authored-by: yuezhang <[email protected]>

* [minor] following HUDI-4739, fix the extraction for simple record keys (apache#6594)

* [HUDI-4619] Add a remote request retry mechanism for 'Remotehoodietablefilesystemview'. (apache#6393)

* [HUDI-4720] Fix HoodieInternalRow return wrong num of fields when source not contains meta fields (apache#6500)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

* [HUDI-4389] Make HoodieStreamingSink idempotent (apache#6098)

* Support checkpoint and idempotent writes in HoodieStreamingSink

- Use batchId as the checkpoint key and add to commit metadata
- Support multi-writer for checkpoint data model

* Walk back previous commits until checkpoint is found

* Handle delete operation and fix test

* [MINOR] Remove redundant braces (apache#6604)

* [HUDI-4618] Separate log word for CommitUitls class (apache#6392)

* [HUDI-4776] Fix merge into use unresolved assignment (apache#6589)

* [HUDI-4795] Fix KryoException when bulk insert into a not bucket index hudi table

Co-authored-by: hbg <[email protected]>

* [HUDI-4615] Return checkpoint as null for empty data from events queue.  (apache#6387)


Co-authored-by: sivabalan <[email protected]>

* [HUDI-4782] Support TIMESTAMP_LTZ type for flink (apache#6607)

* [HUDI-4731] Shutdown CloudWatch reporter when query completes (apache#6468)

* [HUDI-4793] Fixing ScalaTest tests to properly respect Log4j2 configs (apache#6617)

* [HUDI-4766] Strengthen flink clustering job (apache#6566)

* Allow rollbacks if required during clustering
* Allow size to be defined in Long instead of Integer
* Fix bug where clustering will produce files of 120MB in the same filegroup
* Added clean task
* Fix scheduling config to be consistent with that with compaction
* Fix filter mode getting ignored issue
* Add --instant-time parameter
* Prevent no execute() calls exception from being thrown (clustering & compaction)

* Apply upstream changes

* Fix compilation issues

* Fix checkstyle

Signed-off-by: LinMingQiang <[email protected]>
Signed-off-by: HunterXHunter <[email protected]>
Co-authored-by: miomiocat <[email protected]>
Co-authored-by: RexAn <[email protected]>
Co-authored-by: JerryYue-M <[email protected]>
Co-authored-by: jerryyue <[email protected]>
Co-authored-by: Danny Chan <[email protected]>
Co-authored-by: superche <[email protected]>
Co-authored-by: superche <[email protected]>
Co-authored-by: Shiyan Xu <[email protected]>
Co-authored-by: jian.feng <[email protected]>
Co-authored-by: jian.feng <[email protected]>
Co-authored-by: voonhous <[email protected]>
Co-authored-by: voonhou.su <[email protected]>
Co-authored-by: YueZhang <[email protected]>
Co-authored-by: yuezhang <[email protected]>
Co-authored-by: Y Ethan Guo <[email protected]>
Co-authored-by: xi chaomin <[email protected]>
Co-authored-by: xicm <[email protected]>
Co-authored-by: ForwardXu <[email protected]>
Co-authored-by: 董可伦 <[email protected]>
Co-authored-by: shenjiayu17 <[email protected]>
Co-authored-by: Lanyuanxiaoyao <[email protected]>
Co-authored-by: KnightChess <[email protected]>
Co-authored-by: 苏承祥 <[email protected]>
Co-authored-by: Kumud Kumar Srivatsava Tirupati <[email protected]>
Co-authored-by: xiarixiaoyao <[email protected]>
Co-authored-by: liujinhui <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: 冯健 <[email protected]>
Co-authored-by: Sagar Sumit <[email protected]>
Co-authored-by: HunterXHunter <[email protected]>
Co-authored-by: Luning (Lucas) Wang <[email protected]>
Co-authored-by: Yann Byron <[email protected]>
Co-authored-by: Tim Brown <[email protected]>
Co-authored-by: simonsssu <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Sivabalan Narayanan <[email protected]>
Co-authored-by: Bo Cui <[email protected]>
Co-authored-by: Rahil Chertara <[email protected]>
Co-authored-by: wenningd <[email protected]>
Co-authored-by: Wenning Ding <[email protected]>
Co-authored-by: Rahil C <[email protected]>
Co-authored-by: Ryan Pifer <[email protected]>
Co-authored-by: Udit Mehrotra <[email protected]>
Co-authored-by: simonssu <[email protected]>
Co-authored-by: Vander <[email protected]>
Co-authored-by: Tim Brown <[email protected]>
Co-authored-by: Dongwook Kwon <[email protected]>
Co-authored-by: Shawn Chang <[email protected]>
Co-authored-by: Shawn Chang <[email protected]>
Co-authored-by: 5herhom <[email protected]>
Co-authored-by: 吴祥平 <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: F7753 <[email protected]>
Co-authored-by: lewinma <[email protected]>
Co-authored-by: shaoxiong.zhan <[email protected]>
Co-authored-by: Nicholas Jiang <[email protected]>
Co-authored-by: Yonghua Jian_deepnova <[email protected]>
Co-authored-by: leesf <[email protected]>
Co-authored-by: 5herhom <[email protected]>
Co-authored-by: RexXiong <[email protected]>
Co-authored-by: BruceLin <[email protected]>
Co-authored-by: Pratyaksh Sharma <[email protected]>
Co-authored-by: wuwenchi <[email protected]>
Co-authored-by: 吴文池 <[email protected]>
Co-authored-by: jian yonghua <[email protected]>
Co-authored-by: Xinyao Tian (Richard) <[email protected]>
Co-authored-by: XinyaoTian <[email protected]>
Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>
Co-authored-by: vamshigv <[email protected]>
Co-authored-by: feiyang_deepnova <[email protected]>
Co-authored-by: linfey <[email protected]>
Co-authored-by: novisfff <[email protected]>
Co-authored-by: Qi Ji <[email protected]>
Co-authored-by: hehuiyuan <[email protected]>
Co-authored-by: Zouxxyy <[email protected]>
Co-authored-by: Teng <[email protected]>
Co-authored-by: leandro-rouberte <[email protected]>
Co-authored-by: Jon Vexler <[email protected]>
Co-authored-by: smilecrazy <[email protected]>
Co-authored-by: xxhua <[email protected]>
Co-authored-by: komao <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: felixYyu <[email protected]>
Co-authored-by: Bingeng Huang <[email protected]>
Co-authored-by: hbg <[email protected]>
Co-authored-by: Vinish Reddy <[email protected]>
Co-authored-by: junyuc25 <[email protected]>
Co-authored-by: rmahindra123 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hive Issues related to hive priority:blocker
Projects
Archived in project
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants