Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7098] Adding max bytes per partition with cloud stores source in DS #10100

Merged
merged 2 commits into from
Nov 19, 2023

Conversation

nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Nov 15, 2023

Change Logs

Adding max bytes per partition with cloud stores source in DS

Impact

Adding max bytes per partition with cloud stores source in DS. This should help scale cloud store source scaling for large ingest.

Risk level (write none, low medium or high below)

none

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan self-assigned this Nov 15, 2023
@nsivabalan
Copy link
Contributor Author

hey @lokeshj1703 : can you review the patch.
hey @xushiyan : once the patch is approved, can you take care of landing the patch please

@nsivabalan nsivabalan added the priority:critical production down; pipelines stalled; Need help asap. label Nov 15, 2023
@nsivabalan
Copy link
Contributor Author

@xushiyan : good to review again.

@apache apache deleted a comment from hudi-bot Nov 18, 2023
@xushiyan
Copy link
Member

rerunning CI

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@xushiyan xushiyan merged commit 3913dca into apache:master Nov 19, 2023
30 checks passed
jonvex pushed a commit to jonvex/hudi that referenced this pull request Nov 29, 2023
commit dfa3bde
Merge: bfc0a85 473cf9a
Author: Jonathan Vexler <=>
Date:   Wed Nov 29 15:01:45 2023 -0500

    Merge branch 'master' into fg_reader_implement_bootstrap

commit bfc0a85
Author: Jonathan Vexler <=>
Date:   Wed Nov 29 14:55:57 2023 -0500

    fix bug with nested required fields due to spark nested schema pruning bug

commit 473cf9a
Author: Rajesh Mahindra <[email protected]>
Date:   Wed Nov 29 08:37:40 2023 -0800

    [HUDI-7138] Fix error table writer and schema registry provider (apache#10173)

    ---------

    Co-authored-by: rmahindra123 <[email protected]>

commit 91eabab
Author: Lin Liu <[email protected]>
Date:   Tue Nov 28 23:49:37 2023 -0800

    [HUDI-7103] Support time travel queies for COW tables (apache#10109)

    This is based on HadoopFsRelation.

commit b300728
Author: Rajesh Mahindra <[email protected]>
Date:   Tue Nov 28 22:31:12 2023 -0800

    [HUDI-7086] Fix the default for gcp pub sub max sync time to 1min (apache#10171)

    Co-authored-by: rmahindra123 <[email protected]>

commit 8370c62
Author: Shiyan Xu <[email protected]>
Date:   Tue Nov 28 22:31:34 2023 -0600

    [HUDI-7149] Add a dbt example project with CDC capability (apache#10192)

commit 817d81a
Author: zhuanshenbsj1 <[email protected]>
Date:   Wed Nov 29 11:46:20 2023 +0800

    [MINOR] Add log to print wrong number of instant metadata files (apache#10196)

commit cadeade
Author: leixin <[email protected]>
Date:   Wed Nov 29 11:45:24 2023 +0800

    [minor] when metric prefix length is 0 ignore the metric prefix (apache#10190)

    Co-authored-by: leixin1 <[email protected]>

commit 91daa7d
Author: Lin Liu <[email protected]>
Date:   Tue Nov 28 19:03:50 2023 -0800

    [HUDI-7102] Fix bugs related to time travel queries (apache#10102)

commit d1dfa5b
Author: Dongsj <[email protected]>
Date:   Wed Nov 29 10:49:38 2023 +0800

    [HUDI-7148] Add an additional fix to the potential thread insecurity problem of heartbeat client (apache#10188)

    Co-authored-by: dongsj <[email protected]>

commit b0b711e
Author: Jonathan Vexler <=>
Date:   Tue Nov 28 21:35:20 2023 -0500

    nested schema kinda fix

commit 77cfb3a
Author: YueZhang <[email protected]>
Date:   Wed Nov 29 09:46:53 2023 +0800

    [HUDI-7147] Fix CDC write flush bug (apache#10186)

    * Using iterator instead of values to avoid unsupported operation exception

    * check style

commit b144ee0
Author: Jon Vexler <[email protected]>
Date:   Tue Nov 28 14:23:46 2023 -0500

    Update hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala

    Co-authored-by: Sagar Sumit <[email protected]>

commit 89fab14
Author: Jonathan Vexler <=>
Date:   Tue Nov 28 14:23:03 2023 -0500

    fix failing tests and address some of sagar pr review

commit 675abf1
Author: Tim Brown <[email protected]>
Date:   Mon Nov 27 23:21:56 2023 -0600

    [MINOR] Schema Converter should use default identity transform if not specified (apache#10178)

commit 5450aff
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 22:21:06 2023 -0500

    disable vector for bootstrap

commit fb062df
Author: Danny Chan <[email protected]>
Date:   Tue Nov 28 10:52:33 2023 +0800

    [Minor] Fix the flaky tests in TestRemoteHoodieTableFileSystemView (apache#10179)

commit 3ae4d30
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 21:07:17 2023 -0500

    fix various issues that caused failing tests

commit a045da6
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 18:00:46 2023 -0500

    see if this works

commit 91be81a
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 17:07:30 2023 -0500

    use java to create unary operator

commit c22d1db
Merge: 38b2603 4c3a1db
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 15:56:39 2023 -0500

    Merge branch 'master' into fg_reader_implement_bootstrap

commit 38b2603
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 15:42:22 2023 -0500

    set precombine in test

commit 2a9a363
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 13:27:38 2023 -0500

    try to fix scala2.11 unary operator issue

commit 60bdf14
Author: Jonathan Vexler <=>
Date:   Mon Nov 27 13:02:16 2023 -0500

    try fix ci

commit 4c3a1db
Author: majian <[email protected]>
Date:   Mon Nov 27 16:44:25 2023 +0800

    [HUDI-7110][FOLLOW-UP] Improve call procedure for show column stats information (apache#10169)

commit 499423c
Author: zhuanshenbsj1 <[email protected]>
Date:   Sun Nov 26 10:13:46 2023 +0800

    [HUDI-7041] Optimize the memory usage of timeline server for table service (apache#10002)

commit 4f875ed
Author: Y Ethan Guo <[email protected]>
Date:   Sat Nov 25 15:10:37 2023 -0800

    [HUDI-7139] Fix operation type for bulk insert with row writer in Hudi Streamer (apache#10175)

    This commit fixes the bug which causes the `operationType` to be null in the commit metadata of bulk insert operation with row writer enabled in Hudi Streamer (`hoodie.datasource.write.row.writer.enable=true`).  `HoodieStreamerDatasetBulkInsertCommitActionExecutor` is updated so that `#preExecute` and `#afterExecute` should run the same logic as regular bulk insert operation without row writer.

commit 332e7e8
Author: harshal <[email protected]>
Date:   Sat Nov 25 14:04:29 2023 +0530

    [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync (apache#10158)

    ---------

    Co-authored-by: sivabalan <[email protected]>

commit 86232d2
Author: Sivabalan Narayanan <[email protected]>
Date:   Thu Nov 23 19:27:50 2023 -0800

    [HUDI-7095] Making perf enhancements to JSON serde (apache#10097)

commit a7fd27c
Author: Sivabalan Narayanan <[email protected]>
Date:   Thu Nov 23 19:20:01 2023 -0800

    [HUDI-7086] Scaling gcs event source (apache#10073)

    -  Scaling gcs event source

    ---------

    Co-authored-by: rmahindra123 <[email protected]>

commit bb42c4b
Author: Sivabalan Narayanan <[email protected]>
Date:   Thu Nov 23 18:33:32 2023 -0800

    [HUDI-7097] Fix instantiation of Hms Uri with HiveSync tool (apache#10099)

commit 0b7f47a
Author: Jonathan Vexler <=>
Date:   Thu Nov 23 16:27:36 2023 -0500

    decently working

commit bcb974b
Author: VitoMakarevich <[email protected]>
Date:   Thu Nov 23 11:22:14 2023 +0100

    [HUDI-7034] Fix refresh table/view (apache#10151)

    * [HUDI-7034] Refresh index fix - remove cached file slices within partitions

    ---------

    Co-authored-by: vmakarevich <[email protected]>
    Co-authored-by: Sagar Sumit <[email protected]>

commit b77eff2
Author: Lokesh Jain <[email protected]>
Date:   Thu Nov 23 10:47:40 2023 +0530

    [HUDI-7120] Performance improvements in deltastreamer executor code path (apache#10135)

commit 405be17
Author: Sivabalan Narayanan <[email protected]>
Date:   Wed Nov 22 21:00:33 2023 -0800

    [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) (apache#10095)

    * Making misc fixes to deltastreamer sources

    * Fixing test failures

    * adding inference to CloudSourceconfig... cloud.data.datafile.format

    * Fix the tests for s3 events source

    * Fix the tests for s3 events source

    ---------

    Co-authored-by: rmahindra123 <[email protected]>

commit 3d21285
Author: Tim Brown <[email protected]>
Date:   Wed Nov 22 22:51:14 2023 -0600

    [HUDI-7112] Reuse existing timeline server and performance improvements (apache#10122)

    - Reuse timeline server across tables.

    ---------

    Co-authored-by: sivabalan <[email protected]>

commit 72ff9a7
Author: Rajesh Mahindra <[email protected]>
Date:   Wed Nov 22 20:49:15 2023 -0800

    [HUDI-7052] Fix partition key validation for custom key generators. (apache#10014)

    ---------

    Co-authored-by: rmahindra123 <[email protected]>

commit 8d6d043
Author: majian <[email protected]>
Date:   Thu Nov 23 10:08:17 2023 +0800

    [HUDI-7110] Add call procedure for show column stats information (apache#10120)

commit aabaa99
Author: huangxiaoping <[email protected]>
Date:   Thu Nov 23 09:06:45 2023 +0800

    [MINOR] Remove unused import (apache#10159)

commit f88a73f
Author: Y Ethan Guo <[email protected]>
Date:   Wed Nov 22 10:48:48 2023 -0800

    [HUDI-7123] Improve CI scripts (apache#10136)

    Improves the CI scripts in the following aspects:
    - Removes `hudi-common` tests from `test-spark` job in GH CI as they are already covered by Azure CI
    - Removes unnecesary bundle validation jobs and adds new bundle validation images (`flink1153hive313spark323`, `flink1162hive313spark331`)
    - Updates `validate-release-candidate-bundles` jobs
    - Moves functional tests of `hudi-spark-datasource/hudi-spark` from job 4 (3 hours) to job 2 (1 hour) in Azure CI to rebalance the finish time.

commit 38c87b7
Author: harshal <[email protected]>
Date:   Wed Nov 22 20:53:42 2023 +0530

    [HUDI-7004] Add support of snapshotLoadQuerySplitter in s3/gcs sources (apache#10152)

commit d0edfb5
Author: Sivabalan Narayanan <[email protected]>
Date:   Wed Nov 22 10:22:53 2023 -0500

    [HUDI-6961] Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custome delete marker (apache#10150)

    - Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custom delete marker across all delete apis

commit cda9dbc
Author: Jing Zhang <[email protected]>
Date:   Wed Nov 22 18:04:39 2023 +0800

    [HUDI-7129] Fix bug when upgrade from table version three using UpgradeOrDowngradeProcedure (apache#10147)

commit 18f7181
Author: Shiyan Xu <[email protected]>
Date:   Wed Nov 22 02:00:27 2023 -0600

    [HUDI-7133] Improve dbt example for better guidance (apache#10155)

commit c5af85d
Author: Sivabalan Narayanan <[email protected]>
Date:   Wed Nov 22 01:33:49 2023 -0500

    [HUDI-7096] Improving incremental query to fetch partitions based on commit metadata (apache#10098)

commit 2522f6d
Author: xuzifu666 <[email protected]>
Date:   Wed Nov 22 11:53:21 2023 +0800

    [HUDI-7128] DeleteMarkerProcedures support delete in batch mode (apache#10148)

    Co-authored-by: xuyu <[email protected]>

commit a1afcdd
Author: Tim Brown <[email protected]>
Date:   Tue Nov 21 14:58:12 2023 -0600

    [HUDI-7115] Add in new options for the bigquery sync (apache#10125)

    - Add in new options for the bigquery sync

commit 35cd873
Author: Sivabalan Narayanan <[email protected]>
Date:   Tue Nov 21 13:11:21 2023 -0500

    [HUDI-7084] Fixing schema retrieval for table w/ no commits (apache#10069)

    * Fixing schema retrieval for table w/ no commits

    * fixing compilation failure

commit 74793d5
Author: Rajesh Mahindra <[email protected]>
Date:   Tue Nov 21 09:53:12 2023 -0800

    [HUDI-7106] Fix sqs deletes, deltasync service close and error table default configs. (apache#10117)

    Co-authored-by: rmahindra123 <[email protected]>

commit b981877
Author: harshal <[email protected]>
Date:   Tue Nov 21 22:52:28 2023 +0530

    [HUDI-7003] Add option to fallback to full table scan if files are deleted due to cleaner (apache#9941)

commit 600fd4d
Author: Akira Ajisaka <[email protected]>
Date:   Wed Nov 22 01:24:37 2023 +0900

    [HUDI-6734] Add back HUDI-5409: Avoid file index and use fs view cache in COW input format (apache#9567)

    * [HUDI-6734] Add back HUDI-5409: Avoid file index and use fs view cache in COW input format

    This reverts commit 2567ada.

     Conflicts:
    	hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java
    	hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadTableInputFormat.java

    * Always use file index if files partition is available

    ---------

    Co-authored-by: Sagar Sumit <[email protected]>

commit 9e2500c
Author: Sivabalan Narayanan <[email protected]>
Date:   Tue Nov 21 09:55:23 2023 -0500

    [HUDI-7083] Adding support for multiple tables with Prometheus Reporter (apache#10068)

    * Adding support for multiple tables with Prometheus Reporter

    * Fixing closure of http server

    * Remove entry from port-collector registry map after stopping http server

    ---------

    Co-authored-by: Sagar Sumit <[email protected]>

commit baffe1d
Author: Sivabalan Narayanan <[email protected]>
Date:   Tue Nov 21 09:32:39 2023 -0500

    [MINOR] Misc fixes in deltastreamer (apache#10067)

commit 0c4f3a3
Author: Sivabalan Narayanan <[email protected]>
Date:   Tue Nov 21 02:17:13 2023 -0500

    [HUDI-7127] Fixing set up and tear down in tests (apache#10146)

commit eaba114
Author: Akira Ajisaka <[email protected]>
Date:   Tue Nov 21 11:37:47 2023 +0900

    [HUDI-7107] Reused MetricsReporter fails to publish metrics in Spark streaming job (apache#10132)

commit 578e756
Author: Jing Zhang <[email protected]>
Date:   Tue Nov 21 10:04:33 2023 +0800

    [HUDI-7118] Set conf 'spark.sql.parquet.enableVectorizedReader' to true automatically only if the value is not explicitly set (apache#10134)

commit d24220a
Author: Jing Zhang <[email protected]>
Date:   Tue Nov 21 09:56:07 2023 +0800

    [HUDI-7111] Fix performance regression of tag when written into simple bucket index table (apache#10130)

commit 84990ae
Author: Rajesh Mahindra <[email protected]>
Date:   Mon Nov 20 11:17:45 2023 -0800

    Fix schema refresh for KafkaAvroSchemaDeserializer (apache#10118)

    Co-authored-by: rmahindra123 <[email protected]>

commit 979132b
Author: majian <[email protected]>
Date:   Mon Nov 20 10:43:11 2023 +0800

    [HUDI-7099] Providing metrics for archive and defining some string constants (apache#10101)

commit 3225625
Author: Fabio Buso <[email protected]>
Date:   Mon Nov 20 03:19:41 2023 +0100

    [MINOR] Add Hopsworks File System to StorageSchemes (apache#10141)

commit 3913dca
Author: Sivabalan Narayanan <[email protected]>
Date:   Sat Nov 18 23:50:37 2023 -0500

    [HUDI-7098] Add max bytes per partition with cloud stores source in DS (apache#10100)

commit 4c295b2
Author: hehuiyuan <[email protected]>
Date:   Sun Nov 19 09:43:52 2023 +0800

    [HUDI-7119] Don't write precombine field to hoodie.properties when the ts field does not exist for append mode (apache#10133)

commit b2f4493
Author: Jing Zhang <[email protected]>
Date:   Sun Nov 19 09:35:54 2023 +0800

    [HUDI-7072] Remove support for Flink 1.13 (apache#10052)

commit dfe1674
Author: Sagar Lakshmipathy <[email protected]>
Date:   Fri Nov 17 18:43:07 2023 -0800

    [Minor] Fixed twitter link to redirect to twitter (apache#10139)

commit f58d9cb
Author: Jonathan Vexler <=>
Date:   Fri Nov 17 18:10:00 2023 -0500

    current point

commit 184858b
Author: Jonathan Vexler <=>
Date:   Fri Nov 17 16:21:56 2023 -0500

    non-working. Want to review with team that this makes sense

commit 8240b6a
Author: Y Ethan Guo <[email protected]>
Date:   Fri Nov 17 11:20:57 2023 -0800

    [HUDI-7113] Update release scripts and docs for Spark 3.5 support (apache#10123)

commit 216aeb4
Author: Danny Chan <[email protected]>
Date:   Fri Nov 17 14:35:17 2023 +0800

    [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 (apache#10126)

commit 3d0c450
Author: YueZhang <[email protected]>
Date:   Fri Nov 17 09:48:59 2023 +0800

    [HUDI-7109] Fix Flink may re-use a committed instant in append mode (apache#10119)

commit f06ff5b
Author: hehuiyuan <[email protected]>
Date:   Fri Nov 17 09:43:21 2023 +0800

    [HUDI-7090] Set the maxParallelism for singleton operator  (apache#10090)

commit faa73e9
Author: Y Ethan Guo <[email protected]>
Date:   Thu Nov 16 12:12:22 2023 -0800

    [MINOR] Disable failed test on master (apache#10124)

commit 6cc39bf
Author: Sivabalan Narayanan <[email protected]>
Date:   Thu Nov 16 06:00:54 2023 -0500

    [MINOR] Removing unnecessary guards to row writer (apache#10004)

commit 4ea752f
Author: voonhous <[email protected]>
Date:   Thu Nov 16 16:53:28 2023 +0800

    [MINOR] Modified description to include missing trigger strategy (apache#10114)

commit 874b5de
Author: Shawn Chang <[email protected]>
Date:   Wed Nov 15 21:57:14 2023 -0800

    [HUDI-6806] Support Spark 3.5.0 (apache#9717)

    ---------

    Co-authored-by: Shawn Chang <[email protected]>
    Co-authored-by: Y Ethan Guo <[email protected]>

commit 35af64d
Author: Shawn Chang <[email protected]>
Date:   Wed Nov 15 18:36:42 2023 -0800

    [Minor] Throw exceptions when cleaner/compactor fail (apache#10108)

    Co-authored-by: Shawn Chang <[email protected]>

commit bada5d9
Author: Shawn Chang <[email protected]>
Date:   Wed Nov 15 16:50:38 2023 -0800

    [HUDI-5936] Fix serialization problem when FileStatus is not serializable (apache#10065)

    Co-authored-by: Shawn Chang <[email protected]>

commit dcd5a81
Author: majian <[email protected]>
Date:   Wed Nov 15 16:10:15 2023 +0800

    [HUDI-7069] Optimize metaclient construction and include table config options (apache#10048)

commit f218e54
Author: Jing Zhang <[email protected]>
Date:   Wed Nov 15 16:07:04 2023 +0800

    [MINOR] Add detailed error logs in RunCompactionProcedure (apache#10070)

    * add detailed error logs in RunCompactionProcedure
    * only print 100 error file paths into logs

commit 2185abb
Author: Jing Zhang <[email protected]>
Date:   Wed Nov 15 16:03:23 2023 +0800

    [HUDI-7094] AlterTableAddColumnCommand/AlterTableChangeColumnCommand update table with ro/rt suffix (apache#10094)

commit abd3afc
Author: Hussein Awala <[email protected]>
Date:   Wed Nov 15 06:55:47 2023 +0200

    [HUDI-6695] Use the AWS provider chain in Glue sync and add a new provider for STS assume role (apache#9260)

commit 424e0ce
Author: chao chen <[email protected]>
Date:   Wed Nov 15 12:20:10 2023 +0800

    [HUDI-7050] Flink HoodieHiveCatalog supports hadoop parameters (apache#10013)

commit 19b3e7f
Author: leixin <[email protected]>
Date:   Wed Nov 15 09:24:29 2023 +0800

    [Minor] Throws an exception when using bulk_insert and stream mode (apache#10082)

    Co-authored-by: leixin1 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:critical production down; pipelines stalled; Need help asap. release-0.14.1
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

3 participants