Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge upstream master #1

Merged
merged 2,294 commits into from
Feb 2, 2023
Merged

merge upstream master #1

merged 2,294 commits into from
Feb 2, 2023

Conversation

pushpavanthar
Copy link
Owner

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

nsivabalan and others added 30 commits November 22, 2022 22:34
* [HUDI-5252] ClusteringCommitSink supports to rollback clustering
…7269)

* For COW table STRICT insert mode, PK uniqueness should be honored irrespective of precombine field.
… path (#6358)

Currently, HoodieParquetReader is not specifying projected schema properly when reading Parquet files which ends up failing in cases when the provided schema is not equal to the schema of the file being read (even though it might be a proper projection, ie subset of it), like in common CDC case when column is being dropped from the RDS and the schema of the new batch is lacking the old column.

To address the original issue described in HUDI-4588, we also have to relax the constraints imposed by TableSchemaResolver.isSchemaCompatible method not allowing columns to be evolved by the way of dropping columns.

After addressing the original problem a considerable amount of new issues have been discovered including following:

 - Writer's schemas were deduced differently for different flows (for ex, Bulk Insert vs other ops)
 - Writer's schema was not reconciled against table's schemas in terms of nullability (this is asserted w/in AvroSchemaConverter, that is now invoked as part of the projection)
 - (After enabling schema validations) Incorrect schema handling w/in HoodieWriteHandle has been detected (there were 2 ambiguous schema references, set incorrectly, creating confusion)
 - (After enabling schema validations) Incorrect schema handling w/in MergeHelper impls, where writer's schema was used as a existing file's reader-schema (failing in cases when these 2 diverge)

Changes:

 - Adding missing schema projection when reading Parquet file (using AvroParquetReader)
 - Relaxing schema evolution constraints to allow columns to be dropped
 - Revisiting schema reconciliation logic to make sure it's coherent
 - Streamlining schema handling in HoodieSparkSqlWriter to make sure it's uniform for all operations (it isn't applied properly for Bulk-insert at the moment)
 - Added comprehensive test for Basic Schema Evolution (columns being added, dropped)
 - Fixing HoodieWriteHandle impls to properly handle writer schema and avoid duplication
 - Fixing MergeHelper impls to properly handle schema evolution
…service fails (#7243)

After the files are written, table services like clustering and compaction can fail. This causes the sync to the metaserver to not happen. This patch adds a config that when set to false, the deltastreamer will not fail and the sync to the metaserver will occur. A warning will be logged with the exception that occurred. To use this new behavior, set hoodie.fail.writes.on.inline.table.service.exception to false.

Co-authored-by: Jonathan Vexler <=>
…cords issue if it contains delta files while still splittable (#7264)
… flows (#7230)

Add good test coverage for some of the core user flows w/ spark data source writes.
)

Addressing an invalid semantic of MOR iterators not being actually idempotent: ie, calling `hasNext` multiple times was actually leading to iterator advancing, therefore potentially skipping the elements for ex in cases like:

```
// [[isEmpty]] will invoke [[hasNext]] to check whether Iterator has any elements
if (iter.isEmpty) {
  // ...
} else {
  // Here [[map]] will invoke [[hasNext]] again, therefore skipping the elements
  iter.map { /* ... */ }
}
```
…eIterator (#7340)

* Unify RecordIterator and HoodieParquetReader with ClosableIterator
* Add a factory clazz for RecordIterator
* Add more documents
yihua and others added 29 commits January 28, 2023 22:33
Fixes deploy_staging_jars.sh to generate all hudi-utilities-slim-bundle.
Cleaning up some of the recently introduced configs:

Shortening file-listing mode override for Spark's FileIndex
Fixing Disruptor's write buffer limit config
Scoped CANONICALIZE_NULLABLE config to HoodieSparkSqlWriter
)

- Ensures that Hudi CLI commands which require launching Spark can be executed with hudi-cli-bundle
…esystem instance instead (#7685)

Should use `writeClient. getHoodieTable(). getHoodieView()` to determine the fileSystemView
…cluding exception) (#7799)

Looks like we miss to close the writeClient on some of the failure cases while writing via spark-ds and spark-sql writes.
…m data table timeline w/ metadata reads (#7798)

Fixing metadata table to read rollback info even w/ empty rollback completed meta file.
This change addresses a few performance regressions in `HoodieSparkRecord` identified during our recent benchmarking::

1. `HoodieSparkRecord` rewrites records using `rewriteRecord` and `rewriteRecordWithNewSchema` which do Schema traversals for every record. Instead we should do schema traversal only once and produce a transformer that will directly create new record from the old one.

2. `HoodieRecord`s currently could be rewritten multiple times even in cases when just meta-fields need to be mixed into the schema (in that case, `HoodieSparkRecord` simply wraps source `InternalRow` into `HoodieInternalRow` holding the meta-fields). This is problematic due to a) `UnsafeProjection` re-using mutable row (as a buffer) to avoid allocation of small objects leading to b) recursive overwriting of the same row.

3. Records are currently copied for every Executor even for Simple one which actually is not buffering any records and therefore doesn't require records to be copied.

To address aforementioned gaps following changes have been implemented:

 1. Row writing utils have been revisited to decouple `RowWriter` generation from actual application (to the source row; that way actual application is much more efficient). Additionally, considerable number of row-writing utilities have been eliminated as these are purely duplicative.
 
 2. `HoodieRecord.rewriteRecord` API is renamed into `prependMetaFields` to clearly disambiguate it from `rewriteRecordWithSchema`

 3. `WriteHandle` and `HoodieMergeHelper` implementations are substantially simplified and streamlined accommodating being rebased onto `prependMetaFields`
…lt (#7787)

* [HUDI-5646] Guard dropping columns by a config, do not allow by default

* Replaced superfluous `isSchemaCompatible` override by explicitly specifying whether column drop should be allowed;

* Revisited `HoodieSparkSqlWriter` to avoid (unnecessary) schema handling for delete operations

* Remove meta-fields from latest table schema during analysis

* Disable schema validation when partition columns are dropped

---------

Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: sivabalan <[email protected]>
…ource (#7810)

This is restoring existing behavior for DeltaStreamer Incremental Source, as the change in #7769 removed _hoodie_partition_path field from the dataset making it impossible to be accessed from the DS Transformers for ex
)

This is addressing misconfiguration of the Kryo object used specifically to serialize Spark's internal structures (like `Expression`s): previously we're using default `SparkConf` instance to configure it, while instead we should have used the one provided by `SparkEnv`
…fault (#7813)

* Remove `COMBINE_BEFORE_INSERT` config being overridden for insert operations

* Revisited Spark SQL feature configuration to allow dichotomy of having:
  - (Feature-)specific "default" configuration (that could be overridden by the user)
  - "Overriding" configuration (that could NOT be overridden by the user)

* Restoring existing behavior for Insert Into to deduplicate by default (if pre-combine is specified)

* Fixing compilation

* Fixing compilation (one more time)

* Fixing options combination ordering
… helper (#7818)

`deduceShuffleParallelism` returns 0 in some situations which should never occur.
@pushpavanthar pushpavanthar merged commit 49ab579 into pushpavanthar:master Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.