forked from apache/hudi
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge upstream master #1
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ith hudi incr source (#7132)
… path (#6358) Currently, HoodieParquetReader is not specifying projected schema properly when reading Parquet files which ends up failing in cases when the provided schema is not equal to the schema of the file being read (even though it might be a proper projection, ie subset of it), like in common CDC case when column is being dropped from the RDS and the schema of the new batch is lacking the old column. To address the original issue described in HUDI-4588, we also have to relax the constraints imposed by TableSchemaResolver.isSchemaCompatible method not allowing columns to be evolved by the way of dropping columns. After addressing the original problem a considerable amount of new issues have been discovered including following: - Writer's schemas were deduced differently for different flows (for ex, Bulk Insert vs other ops) - Writer's schema was not reconciled against table's schemas in terms of nullability (this is asserted w/in AvroSchemaConverter, that is now invoked as part of the projection) - (After enabling schema validations) Incorrect schema handling w/in HoodieWriteHandle has been detected (there were 2 ambiguous schema references, set incorrectly, creating confusion) - (After enabling schema validations) Incorrect schema handling w/in MergeHelper impls, where writer's schema was used as a existing file's reader-schema (failing in cases when these 2 diverge) Changes: - Adding missing schema projection when reading Parquet file (using AvroParquetReader) - Relaxing schema evolution constraints to allow columns to be dropped - Revisiting schema reconciliation logic to make sure it's coherent - Streamlining schema handling in HoodieSparkSqlWriter to make sure it's uniform for all operations (it isn't applied properly for Bulk-insert at the moment) - Added comprehensive test for Basic Schema Evolution (columns being added, dropped) - Fixing HoodieWriteHandle impls to properly handle writer schema and avoid duplication - Fixing MergeHelper impls to properly handle schema evolution
Co-authored-by: zhuanshenbsj1 <[email protected]>
…service fails (#7243) After the files are written, table services like clustering and compaction can fail. This causes the sync to the metaserver to not happen. This patch adds a config that when set to false, the deltastreamer will not fail and the sync to the metaserver will occur. A warning will be logged with the exception that occurred. To use this new behavior, set hoodie.fail.writes.on.inline.table.service.exception to false. Co-authored-by: Jonathan Vexler <=>
…orming a LATEST streaming read (#6920)
…cords issue if it contains delta files while still splittable (#7264)
…tFoundException of InLineFileSystem (#7124)
… flows (#7230) Add good test coverage for some of the core user flows w/ spark data source writes.
) Addressing an invalid semantic of MOR iterators not being actually idempotent: ie, calling `hasNext` multiple times was actually leading to iterator advancing, therefore potentially skipping the elements for ex in cases like: ``` // [[isEmpty]] will invoke [[hasNext]] to check whether Iterator has any elements if (iter.isEmpty) { // ... } else { // Here [[map]] will invoke [[hasNext]] again, therefore skipping the elements iter.map { /* ... */ } } ```
…eIterator (#7340) * Unify RecordIterator and HoodieParquetReader with ClosableIterator * Add a factory clazz for RecordIterator * Add more documents
Fixes deploy_staging_jars.sh to generate all hudi-utilities-slim-bundle.
Co-authored-by: hbg <[email protected]>
Cleaning up some of the recently introduced configs: Shortening file-listing mode override for Spark's FileIndex Fixing Disruptor's write buffer limit config Scoped CANONICALIZE_NULLABLE config to HoodieSparkSqlWriter
…esystem instance instead (#7685) Should use `writeClient. getHoodieTable(). getHoodieView()` to determine the fileSystemView
…cluding exception) (#7799) Looks like we miss to close the writeClient on some of the failure cases while writing via spark-ds and spark-sql writes.
…m data table timeline w/ metadata reads (#7798) Fixing metadata table to read rollback info even w/ empty rollback completed meta file.
Co-authored-by: jameswei <[email protected]>
This change addresses a few performance regressions in `HoodieSparkRecord` identified during our recent benchmarking:: 1. `HoodieSparkRecord` rewrites records using `rewriteRecord` and `rewriteRecordWithNewSchema` which do Schema traversals for every record. Instead we should do schema traversal only once and produce a transformer that will directly create new record from the old one. 2. `HoodieRecord`s currently could be rewritten multiple times even in cases when just meta-fields need to be mixed into the schema (in that case, `HoodieSparkRecord` simply wraps source `InternalRow` into `HoodieInternalRow` holding the meta-fields). This is problematic due to a) `UnsafeProjection` re-using mutable row (as a buffer) to avoid allocation of small objects leading to b) recursive overwriting of the same row. 3. Records are currently copied for every Executor even for Simple one which actually is not buffering any records and therefore doesn't require records to be copied. To address aforementioned gaps following changes have been implemented: 1. Row writing utils have been revisited to decouple `RowWriter` generation from actual application (to the source row; that way actual application is much more efficient). Additionally, considerable number of row-writing utilities have been eliminated as these are purely duplicative. 2. `HoodieRecord.rewriteRecord` API is renamed into `prependMetaFields` to clearly disambiguate it from `rewriteRecordWithSchema` 3. `WriteHandle` and `HoodieMergeHelper` implementations are substantially simplified and streamlined accommodating being rebased onto `prependMetaFields`
…alter table reports an error (#7706) Co-authored-by: danny0405 <[email protected]>
…InstantTime/RunClean/RunCompactionProcedure (#7655)
…lt (#7787) * [HUDI-5646] Guard dropping columns by a config, do not allow by default * Replaced superfluous `isSchemaCompatible` override by explicitly specifying whether column drop should be allowed; * Revisited `HoodieSparkSqlWriter` to avoid (unnecessary) schema handling for delete operations * Remove meta-fields from latest table schema during analysis * Disable schema validation when partition columns are dropped --------- Co-authored-by: Alexey Kudinkin <[email protected]> Co-authored-by: sivabalan <[email protected]>
…fault (#7813) * Remove `COMBINE_BEFORE_INSERT` config being overridden for insert operations * Revisited Spark SQL feature configuration to allow dichotomy of having: - (Feature-)specific "default" configuration (that could be overridden by the user) - "Overriding" configuration (that could NOT be overridden by the user) * Restoring existing behavior for Insert Into to deduplicate by default (if pre-combine is specified) * Fixing compilation * Fixing compilation (one more time) * Fixing options combination ordering
… helper (#7818) `deduceShuffleParallelism` returns 0 in some situations which should never occur.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tips
What is the purpose of the pull request
(For example: This pull request adds quick-start document.)
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.