-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7098] Adding max bytes per partition with cloud stores source in DS #10100
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hey @lokeshj1703 : can you review the patch. |
nsivabalan
added
the
priority:critical
production down; pipelines stalled; Need help asap.
label
Nov 15, 2023
xushiyan
reviewed
Nov 16, 2023
...ities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudStoreIngestionConfig.java
Show resolved
Hide resolved
@xushiyan : good to review again. |
xushiyan
approved these changes
Nov 18, 2023
rerunning CI |
nsivabalan
added a commit
to nsivabalan/hudi
that referenced
this pull request
Nov 23, 2023
jonvex
pushed a commit
to jonvex/hudi
that referenced
this pull request
Nov 29, 2023
commit dfa3bde Merge: bfc0a85 473cf9a Author: Jonathan Vexler <=> Date: Wed Nov 29 15:01:45 2023 -0500 Merge branch 'master' into fg_reader_implement_bootstrap commit bfc0a85 Author: Jonathan Vexler <=> Date: Wed Nov 29 14:55:57 2023 -0500 fix bug with nested required fields due to spark nested schema pruning bug commit 473cf9a Author: Rajesh Mahindra <[email protected]> Date: Wed Nov 29 08:37:40 2023 -0800 [HUDI-7138] Fix error table writer and schema registry provider (apache#10173) --------- Co-authored-by: rmahindra123 <[email protected]> commit 91eabab Author: Lin Liu <[email protected]> Date: Tue Nov 28 23:49:37 2023 -0800 [HUDI-7103] Support time travel queies for COW tables (apache#10109) This is based on HadoopFsRelation. commit b300728 Author: Rajesh Mahindra <[email protected]> Date: Tue Nov 28 22:31:12 2023 -0800 [HUDI-7086] Fix the default for gcp pub sub max sync time to 1min (apache#10171) Co-authored-by: rmahindra123 <[email protected]> commit 8370c62 Author: Shiyan Xu <[email protected]> Date: Tue Nov 28 22:31:34 2023 -0600 [HUDI-7149] Add a dbt example project with CDC capability (apache#10192) commit 817d81a Author: zhuanshenbsj1 <[email protected]> Date: Wed Nov 29 11:46:20 2023 +0800 [MINOR] Add log to print wrong number of instant metadata files (apache#10196) commit cadeade Author: leixin <[email protected]> Date: Wed Nov 29 11:45:24 2023 +0800 [minor] when metric prefix length is 0 ignore the metric prefix (apache#10190) Co-authored-by: leixin1 <[email protected]> commit 91daa7d Author: Lin Liu <[email protected]> Date: Tue Nov 28 19:03:50 2023 -0800 [HUDI-7102] Fix bugs related to time travel queries (apache#10102) commit d1dfa5b Author: Dongsj <[email protected]> Date: Wed Nov 29 10:49:38 2023 +0800 [HUDI-7148] Add an additional fix to the potential thread insecurity problem of heartbeat client (apache#10188) Co-authored-by: dongsj <[email protected]> commit b0b711e Author: Jonathan Vexler <=> Date: Tue Nov 28 21:35:20 2023 -0500 nested schema kinda fix commit 77cfb3a Author: YueZhang <[email protected]> Date: Wed Nov 29 09:46:53 2023 +0800 [HUDI-7147] Fix CDC write flush bug (apache#10186) * Using iterator instead of values to avoid unsupported operation exception * check style commit b144ee0 Author: Jon Vexler <[email protected]> Date: Tue Nov 28 14:23:46 2023 -0500 Update hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala Co-authored-by: Sagar Sumit <[email protected]> commit 89fab14 Author: Jonathan Vexler <=> Date: Tue Nov 28 14:23:03 2023 -0500 fix failing tests and address some of sagar pr review commit 675abf1 Author: Tim Brown <[email protected]> Date: Mon Nov 27 23:21:56 2023 -0600 [MINOR] Schema Converter should use default identity transform if not specified (apache#10178) commit 5450aff Author: Jonathan Vexler <=> Date: Mon Nov 27 22:21:06 2023 -0500 disable vector for bootstrap commit fb062df Author: Danny Chan <[email protected]> Date: Tue Nov 28 10:52:33 2023 +0800 [Minor] Fix the flaky tests in TestRemoteHoodieTableFileSystemView (apache#10179) commit 3ae4d30 Author: Jonathan Vexler <=> Date: Mon Nov 27 21:07:17 2023 -0500 fix various issues that caused failing tests commit a045da6 Author: Jonathan Vexler <=> Date: Mon Nov 27 18:00:46 2023 -0500 see if this works commit 91be81a Author: Jonathan Vexler <=> Date: Mon Nov 27 17:07:30 2023 -0500 use java to create unary operator commit c22d1db Merge: 38b2603 4c3a1db Author: Jonathan Vexler <=> Date: Mon Nov 27 15:56:39 2023 -0500 Merge branch 'master' into fg_reader_implement_bootstrap commit 38b2603 Author: Jonathan Vexler <=> Date: Mon Nov 27 15:42:22 2023 -0500 set precombine in test commit 2a9a363 Author: Jonathan Vexler <=> Date: Mon Nov 27 13:27:38 2023 -0500 try to fix scala2.11 unary operator issue commit 60bdf14 Author: Jonathan Vexler <=> Date: Mon Nov 27 13:02:16 2023 -0500 try fix ci commit 4c3a1db Author: majian <[email protected]> Date: Mon Nov 27 16:44:25 2023 +0800 [HUDI-7110][FOLLOW-UP] Improve call procedure for show column stats information (apache#10169) commit 499423c Author: zhuanshenbsj1 <[email protected]> Date: Sun Nov 26 10:13:46 2023 +0800 [HUDI-7041] Optimize the memory usage of timeline server for table service (apache#10002) commit 4f875ed Author: Y Ethan Guo <[email protected]> Date: Sat Nov 25 15:10:37 2023 -0800 [HUDI-7139] Fix operation type for bulk insert with row writer in Hudi Streamer (apache#10175) This commit fixes the bug which causes the `operationType` to be null in the commit metadata of bulk insert operation with row writer enabled in Hudi Streamer (`hoodie.datasource.write.row.writer.enable=true`). `HoodieStreamerDatasetBulkInsertCommitActionExecutor` is updated so that `#preExecute` and `#afterExecute` should run the same logic as regular bulk insert operation without row writer. commit 332e7e8 Author: harshal <[email protected]> Date: Sat Nov 25 14:04:29 2023 +0530 [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync (apache#10158) --------- Co-authored-by: sivabalan <[email protected]> commit 86232d2 Author: Sivabalan Narayanan <[email protected]> Date: Thu Nov 23 19:27:50 2023 -0800 [HUDI-7095] Making perf enhancements to JSON serde (apache#10097) commit a7fd27c Author: Sivabalan Narayanan <[email protected]> Date: Thu Nov 23 19:20:01 2023 -0800 [HUDI-7086] Scaling gcs event source (apache#10073) - Scaling gcs event source --------- Co-authored-by: rmahindra123 <[email protected]> commit bb42c4b Author: Sivabalan Narayanan <[email protected]> Date: Thu Nov 23 18:33:32 2023 -0800 [HUDI-7097] Fix instantiation of Hms Uri with HiveSync tool (apache#10099) commit 0b7f47a Author: Jonathan Vexler <=> Date: Thu Nov 23 16:27:36 2023 -0500 decently working commit bcb974b Author: VitoMakarevich <[email protected]> Date: Thu Nov 23 11:22:14 2023 +0100 [HUDI-7034] Fix refresh table/view (apache#10151) * [HUDI-7034] Refresh index fix - remove cached file slices within partitions --------- Co-authored-by: vmakarevich <[email protected]> Co-authored-by: Sagar Sumit <[email protected]> commit b77eff2 Author: Lokesh Jain <[email protected]> Date: Thu Nov 23 10:47:40 2023 +0530 [HUDI-7120] Performance improvements in deltastreamer executor code path (apache#10135) commit 405be17 Author: Sivabalan Narayanan <[email protected]> Date: Wed Nov 22 21:00:33 2023 -0800 [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) (apache#10095) * Making misc fixes to deltastreamer sources * Fixing test failures * adding inference to CloudSourceconfig... cloud.data.datafile.format * Fix the tests for s3 events source * Fix the tests for s3 events source --------- Co-authored-by: rmahindra123 <[email protected]> commit 3d21285 Author: Tim Brown <[email protected]> Date: Wed Nov 22 22:51:14 2023 -0600 [HUDI-7112] Reuse existing timeline server and performance improvements (apache#10122) - Reuse timeline server across tables. --------- Co-authored-by: sivabalan <[email protected]> commit 72ff9a7 Author: Rajesh Mahindra <[email protected]> Date: Wed Nov 22 20:49:15 2023 -0800 [HUDI-7052] Fix partition key validation for custom key generators. (apache#10014) --------- Co-authored-by: rmahindra123 <[email protected]> commit 8d6d043 Author: majian <[email protected]> Date: Thu Nov 23 10:08:17 2023 +0800 [HUDI-7110] Add call procedure for show column stats information (apache#10120) commit aabaa99 Author: huangxiaoping <[email protected]> Date: Thu Nov 23 09:06:45 2023 +0800 [MINOR] Remove unused import (apache#10159) commit f88a73f Author: Y Ethan Guo <[email protected]> Date: Wed Nov 22 10:48:48 2023 -0800 [HUDI-7123] Improve CI scripts (apache#10136) Improves the CI scripts in the following aspects: - Removes `hudi-common` tests from `test-spark` job in GH CI as they are already covered by Azure CI - Removes unnecesary bundle validation jobs and adds new bundle validation images (`flink1153hive313spark323`, `flink1162hive313spark331`) - Updates `validate-release-candidate-bundles` jobs - Moves functional tests of `hudi-spark-datasource/hudi-spark` from job 4 (3 hours) to job 2 (1 hour) in Azure CI to rebalance the finish time. commit 38c87b7 Author: harshal <[email protected]> Date: Wed Nov 22 20:53:42 2023 +0530 [HUDI-7004] Add support of snapshotLoadQuerySplitter in s3/gcs sources (apache#10152) commit d0edfb5 Author: Sivabalan Narayanan <[email protected]> Date: Wed Nov 22 10:22:53 2023 -0500 [HUDI-6961] Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custome delete marker (apache#10150) - Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custom delete marker across all delete apis commit cda9dbc Author: Jing Zhang <[email protected]> Date: Wed Nov 22 18:04:39 2023 +0800 [HUDI-7129] Fix bug when upgrade from table version three using UpgradeOrDowngradeProcedure (apache#10147) commit 18f7181 Author: Shiyan Xu <[email protected]> Date: Wed Nov 22 02:00:27 2023 -0600 [HUDI-7133] Improve dbt example for better guidance (apache#10155) commit c5af85d Author: Sivabalan Narayanan <[email protected]> Date: Wed Nov 22 01:33:49 2023 -0500 [HUDI-7096] Improving incremental query to fetch partitions based on commit metadata (apache#10098) commit 2522f6d Author: xuzifu666 <[email protected]> Date: Wed Nov 22 11:53:21 2023 +0800 [HUDI-7128] DeleteMarkerProcedures support delete in batch mode (apache#10148) Co-authored-by: xuyu <[email protected]> commit a1afcdd Author: Tim Brown <[email protected]> Date: Tue Nov 21 14:58:12 2023 -0600 [HUDI-7115] Add in new options for the bigquery sync (apache#10125) - Add in new options for the bigquery sync commit 35cd873 Author: Sivabalan Narayanan <[email protected]> Date: Tue Nov 21 13:11:21 2023 -0500 [HUDI-7084] Fixing schema retrieval for table w/ no commits (apache#10069) * Fixing schema retrieval for table w/ no commits * fixing compilation failure commit 74793d5 Author: Rajesh Mahindra <[email protected]> Date: Tue Nov 21 09:53:12 2023 -0800 [HUDI-7106] Fix sqs deletes, deltasync service close and error table default configs. (apache#10117) Co-authored-by: rmahindra123 <[email protected]> commit b981877 Author: harshal <[email protected]> Date: Tue Nov 21 22:52:28 2023 +0530 [HUDI-7003] Add option to fallback to full table scan if files are deleted due to cleaner (apache#9941) commit 600fd4d Author: Akira Ajisaka <[email protected]> Date: Wed Nov 22 01:24:37 2023 +0900 [HUDI-6734] Add back HUDI-5409: Avoid file index and use fs view cache in COW input format (apache#9567) * [HUDI-6734] Add back HUDI-5409: Avoid file index and use fs view cache in COW input format This reverts commit 2567ada. Conflicts: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadTableInputFormat.java * Always use file index if files partition is available --------- Co-authored-by: Sagar Sumit <[email protected]> commit 9e2500c Author: Sivabalan Narayanan <[email protected]> Date: Tue Nov 21 09:55:23 2023 -0500 [HUDI-7083] Adding support for multiple tables with Prometheus Reporter (apache#10068) * Adding support for multiple tables with Prometheus Reporter * Fixing closure of http server * Remove entry from port-collector registry map after stopping http server --------- Co-authored-by: Sagar Sumit <[email protected]> commit baffe1d Author: Sivabalan Narayanan <[email protected]> Date: Tue Nov 21 09:32:39 2023 -0500 [MINOR] Misc fixes in deltastreamer (apache#10067) commit 0c4f3a3 Author: Sivabalan Narayanan <[email protected]> Date: Tue Nov 21 02:17:13 2023 -0500 [HUDI-7127] Fixing set up and tear down in tests (apache#10146) commit eaba114 Author: Akira Ajisaka <[email protected]> Date: Tue Nov 21 11:37:47 2023 +0900 [HUDI-7107] Reused MetricsReporter fails to publish metrics in Spark streaming job (apache#10132) commit 578e756 Author: Jing Zhang <[email protected]> Date: Tue Nov 21 10:04:33 2023 +0800 [HUDI-7118] Set conf 'spark.sql.parquet.enableVectorizedReader' to true automatically only if the value is not explicitly set (apache#10134) commit d24220a Author: Jing Zhang <[email protected]> Date: Tue Nov 21 09:56:07 2023 +0800 [HUDI-7111] Fix performance regression of tag when written into simple bucket index table (apache#10130) commit 84990ae Author: Rajesh Mahindra <[email protected]> Date: Mon Nov 20 11:17:45 2023 -0800 Fix schema refresh for KafkaAvroSchemaDeserializer (apache#10118) Co-authored-by: rmahindra123 <[email protected]> commit 979132b Author: majian <[email protected]> Date: Mon Nov 20 10:43:11 2023 +0800 [HUDI-7099] Providing metrics for archive and defining some string constants (apache#10101) commit 3225625 Author: Fabio Buso <[email protected]> Date: Mon Nov 20 03:19:41 2023 +0100 [MINOR] Add Hopsworks File System to StorageSchemes (apache#10141) commit 3913dca Author: Sivabalan Narayanan <[email protected]> Date: Sat Nov 18 23:50:37 2023 -0500 [HUDI-7098] Add max bytes per partition with cloud stores source in DS (apache#10100) commit 4c295b2 Author: hehuiyuan <[email protected]> Date: Sun Nov 19 09:43:52 2023 +0800 [HUDI-7119] Don't write precombine field to hoodie.properties when the ts field does not exist for append mode (apache#10133) commit b2f4493 Author: Jing Zhang <[email protected]> Date: Sun Nov 19 09:35:54 2023 +0800 [HUDI-7072] Remove support for Flink 1.13 (apache#10052) commit dfe1674 Author: Sagar Lakshmipathy <[email protected]> Date: Fri Nov 17 18:43:07 2023 -0800 [Minor] Fixed twitter link to redirect to twitter (apache#10139) commit f58d9cb Author: Jonathan Vexler <=> Date: Fri Nov 17 18:10:00 2023 -0500 current point commit 184858b Author: Jonathan Vexler <=> Date: Fri Nov 17 16:21:56 2023 -0500 non-working. Want to review with team that this makes sense commit 8240b6a Author: Y Ethan Guo <[email protected]> Date: Fri Nov 17 11:20:57 2023 -0800 [HUDI-7113] Update release scripts and docs for Spark 3.5 support (apache#10123) commit 216aeb4 Author: Danny Chan <[email protected]> Date: Fri Nov 17 14:35:17 2023 +0800 [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 (apache#10126) commit 3d0c450 Author: YueZhang <[email protected]> Date: Fri Nov 17 09:48:59 2023 +0800 [HUDI-7109] Fix Flink may re-use a committed instant in append mode (apache#10119) commit f06ff5b Author: hehuiyuan <[email protected]> Date: Fri Nov 17 09:43:21 2023 +0800 [HUDI-7090] Set the maxParallelism for singleton operator (apache#10090) commit faa73e9 Author: Y Ethan Guo <[email protected]> Date: Thu Nov 16 12:12:22 2023 -0800 [MINOR] Disable failed test on master (apache#10124) commit 6cc39bf Author: Sivabalan Narayanan <[email protected]> Date: Thu Nov 16 06:00:54 2023 -0500 [MINOR] Removing unnecessary guards to row writer (apache#10004) commit 4ea752f Author: voonhous <[email protected]> Date: Thu Nov 16 16:53:28 2023 +0800 [MINOR] Modified description to include missing trigger strategy (apache#10114) commit 874b5de Author: Shawn Chang <[email protected]> Date: Wed Nov 15 21:57:14 2023 -0800 [HUDI-6806] Support Spark 3.5.0 (apache#9717) --------- Co-authored-by: Shawn Chang <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> commit 35af64d Author: Shawn Chang <[email protected]> Date: Wed Nov 15 18:36:42 2023 -0800 [Minor] Throw exceptions when cleaner/compactor fail (apache#10108) Co-authored-by: Shawn Chang <[email protected]> commit bada5d9 Author: Shawn Chang <[email protected]> Date: Wed Nov 15 16:50:38 2023 -0800 [HUDI-5936] Fix serialization problem when FileStatus is not serializable (apache#10065) Co-authored-by: Shawn Chang <[email protected]> commit dcd5a81 Author: majian <[email protected]> Date: Wed Nov 15 16:10:15 2023 +0800 [HUDI-7069] Optimize metaclient construction and include table config options (apache#10048) commit f218e54 Author: Jing Zhang <[email protected]> Date: Wed Nov 15 16:07:04 2023 +0800 [MINOR] Add detailed error logs in RunCompactionProcedure (apache#10070) * add detailed error logs in RunCompactionProcedure * only print 100 error file paths into logs commit 2185abb Author: Jing Zhang <[email protected]> Date: Wed Nov 15 16:03:23 2023 +0800 [HUDI-7094] AlterTableAddColumnCommand/AlterTableChangeColumnCommand update table with ro/rt suffix (apache#10094) commit abd3afc Author: Hussein Awala <[email protected]> Date: Wed Nov 15 06:55:47 2023 +0200 [HUDI-6695] Use the AWS provider chain in Glue sync and add a new provider for STS assume role (apache#9260) commit 424e0ce Author: chao chen <[email protected]> Date: Wed Nov 15 12:20:10 2023 +0800 [HUDI-7050] Flink HoodieHiveCatalog supports hadoop parameters (apache#10013) commit 19b3e7f Author: leixin <[email protected]> Date: Wed Nov 15 09:24:29 2023 +0800 [Minor] Throws an exception when using bulk_insert and stream mode (apache#10082) Co-authored-by: leixin1 <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Logs
Adding max bytes per partition with cloud stores source in DS
Impact
Adding max bytes per partition with cloud stores source in DS. This should help scale cloud store source scaling for large ingest.
Risk level (write none, low medium or high below)
none
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist