-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Loading preexisting (hudi 0.10) partitioned tables from hive metastore with hudi 0.12 #6940
Comments
Hi @matthiasdg , thanks for reporting the performance regression. We refactored the Hive query engine integration with Hudi in 0.11.0. I'll take a look at the file index implementation for Hive which may contribute to the regression here. |
We're facing the same problem, in our case that's when using Trino on EMR 6.8.0 that comes with hudi 0.11.1 and reading data written with hudi 0.10.0. Queries are at least 10x slower :( |
Is this related to #6787 (comment)? It now does a full scan before knowing the partitions? |
@matthiasdg Thanks for raising this performance issue. We've put a few performance fixes on the latest master recently to address the performance issue in the Hudi file index: @konradwudkowski we've verified that with 0.12.2 RC1 containing the first two fixes, queries using Trino Hive connector should now be par with old releases (more than 10x faster than 0.12.1). @matthiasdg These fixes should also fix the slowness of file listing for the queries in Spark. A few community users have already verified that with the master branch. I'm going to close this issue now. @konradwudkowski @matthiasdg if you still observe the same performance problem, feel free to reopen this Github issue. We'll triage it again. |
Just tried 0.12.2, works great again! |
Opened https://issues.apache.org/jira/browse/HUDI-6734 to revive the fix |
Describe the problem you faced
We have existing tables ingested with the default settings of hudi 0.10.
We synced them to hive metastore using the standalone sync tool. Typical partitioning scheme is by device, year, month, day.
Sync command is with parameters like:
We load the table using a
SparkSession
withenableHiveSupport()
and the correct metastore uri, then usesparkSession.table(tableName)
.When we tried to upgrade to 0.12 and use the existing hive tables we noticed queries take way longer than before (x10).
What we experience is: some
Metadata table was not found at path ...
messages and then it seems like nothing happens for e.g. 10 minutes. In the Spark UI, we can see a running query in theSQL
tab, but there are no active jobs in theJobs
tab. When I ran this on my local machine, I noticed a pretty high upload meanwhile. When I enabled DEBUG logging, messages are continuously being generated like added in the Stacktrace section (getting FileStatuses).This probably has something to do with the changed behavior of the metadata table? If we write using 0.12, we noticed that the metadata table is now enabled by default (
metadata
folder inside.hoodie
folder). It seems to work fine if we write, sync and read with hudi 0.12 and metadata enabled (did not do large scale tests though).Is it somehow possible to indicate we are not using metadata tables in combination with hive metastore, is there some configuration for this?
I already tried adding a
hudi-defaults.conf
with parameters likehoodie.metadata.enable
to false etc, but it seems not to have effect there.Resyncing the existing data with hudi's 0.12 synctool to another hive table does also not help. (Noticed there are parameters like
META_SYNC_USE_FILE_LISTING_FROM_METADATA
but you pbly first need metadata for that to have an effect).For tables without partitions the problem is not noticeable.
To Reproduce
Steps to reproduce the behavior:
Metadata table was not found at path
messages, seems like lots of FileStatuses are fetched, eventually Spark Job gets launched (may take 10 minutes).Expected behavior
Ability to load existing hive tables (or resync) without performance impact or do we need to rewrite data?
Environment Description
Hudi version : upgraded from 0.10 to 0.12
Spark version : 3.1.2 (initially had also upgraded to 3.2, but wanted to rule this out so reverted)
Hive version : 2.3.9 comes with hudi libs, metastore is standalone 3.0
Hadoop version : 3.2.0 (with Spark 3.2 I had 3.3.2)
Storage (HDFS/S3/GCS..) : Azure Data Lake Gen 2
Running on Docker? (yes/no) : both with Spark in k8s mode or local mode.
Additional context
Stacktrace
Log excerpt
The text was updated successfully, but these errors were encountered: