Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable bootstrap precombine #1

Open
wants to merge 2,284 commits into
base: master
Choose a base branch
from

Conversation

a49a
Copy link

@a49a a49a commented Feb 16, 2023

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

TJX2014 and others added 30 commits September 22, 2022 18:49
…instant (apache#6574)

* Keep a clustering running at the same time
* Simplify filtering logic

Co-authored-by: dongsj <[email protected]>
…e#6550)

As part of adding support for Spark 3.3 in Hudi 0.12, a lot of the logic
from Spark 3.2 module has been simply copied over.

This PR is rectifying that by:
1. Creating new module "hudi-spark3.2plus-common"
    (that is shared across Spark 3.2 and Spark 3.3)
2. Moving shared components under "hudi-spark3.2plus-common"
…the log file to be too large (apache#6602)

* hoodie.logfile.max.size It does not take effect, causing the log file to be too large

Co-authored-by: [email protected] <loukey_7821>
…dd nest type (apache#6486)

InternalSchemaChangeApplier#applyAddChange forget to remove parent name when calling ColumnAddChange#addColumns
…6634)

* [HUDI-4813] Fix infer keygen not work in sparksql side issue

Co-authored-by: xiaoxingstack <[email protected]>
… MOR snapshot query after delete operations with test (apache#6688)

Co-authored-by: Rahil Chertara <[email protected]>
yihua and others added 30 commits December 14, 2022 10:52
…nsert (apache#7396)

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
If someone has enabled schema on read by mistake and never really renamed or dropped a column. it should be feasible to disable schema on read. This patch fixes that. essentially both on read and write path, if "hoodie.schema.on.read.enable" config is not set, it will fallback to regular code path. It might fail or users might miss data if any they have performed any irrevocable changes like renames. But for rest, this should work.
Before the patch, when there are partial failover within the write tasks, the write task current instant was initialized as the latest inflight instant, the write task then waits for a new instant to write with so hangs and failover continuously.

For a task recovered from failover (with attempt number greater than 0), the latest inflight instant can actually be reused, the intermediate data files can be cleaned with MARGER files post commit.
Make spark3.3 profile to upgrade from Spark 3.3.0 to 3.3.1 (HUDI-4871)
Make spark3.2 profile to upgrade from Spark 3.2.1 to 3.2.3 (HUDI-4411)
…ig (apache#7069)

Revert to FSUtils.getAllPartitionPaths to load partitions properly. Details in apache#6016 (comment)

Only for 0.12.2 to keep behavior consistent over patch releases
* [HUDI-5007] Prevent Hudi from reading the entire timeline's when performing a LATEST streaming read (apache#6920)

(cherry picked from commit 6baf733)

* [HUDI-5228] Flink table service job fs view conf overwrites the one of writing job (apache#7214)

(cherry picked from commit dc5cc08)

Co-authored-by: voonhous <[email protected]>
…pache#7464)

* [HUDI-5366] Closing metadata writer from within writeClient (apache#7437)

* Closing metadata writer from within writeClient

* Close metadata writer in flink client

Co-authored-by: Sagar Sumit <[email protected]>

* Fixing build failure

* Fixing flink metadata writer usages

Co-authored-by: Sagar Sumit <[email protected]>
… to HoodieROTablePathFilter (apache#7088)

* Add the feature flag back to disable HoodieFileIndex and fall back to HoodieROTablePathFilter

* Turn off hoodie.file.index.enable by default to test CI

* Add tests for Spark datasource with the fallback to HoodieROTablePathFilter
…ableFileIndex (apache#7488)

Currently, on the reader or query engine side, the direct file listing on the file system is used by default, as indicated by HoodieMetadataConfig.DEFAULT_METADATA_ENABLE_FOR_READERS (=false). Without providing explicit config of hoodie.metadata.enable, the metadata-table-based file listing is disabled. However, the BaseHoodieTableFileIndex, the common File Index implementation, used by Trino Hive connector, does not respect this default behavior. This leads to performance regression of query latency in Trino Hive connector, due to way of how the connector is integrated with the Input Format and the File Index with metadata table enabled.

This PR fixes the BaseHoodieTableFileIndex to respect the expected behavior defined by HoodieMetadataConfig.DEFAULT_METADATA_ENABLE_FOR_READERS, i.e., metadata-table-based file listing is disabled by default. The metadata-table-based file listing is only enabled when hoodie.metadata.enable is set to true and the files partition of the metadata table is ready for read based on the Hudi table config.

Impact
This mitigates the performance regression of query latency in Trino Hive connector and fixes the read-side behavior of the file listing.

Tested the PR that by default, the HoodieParquetInputFormat does not read metadata table for file listing anymore.

Co-authored-by: Sagar Sumit <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
apache#7493)

- This PR falls back to the original code path using fs view cache as in 0.10.1 or earlier, instead of creating file index.

- Query engines using initial InputFormat based integration will not be using file index. Instead directly fetch file status from fs view cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.