Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Loading preexisting (hudi 0.10) partitioned tables from hive metastore with hudi 0.12 #6940

Closed
matthiasdg opened this issue Oct 13, 2022 · 7 comments
Labels
hive Issues related to hive performance priority:critical production down; pipelines stalled; Need help asap. release-0.12.2 Patches targetted for 0.12.2

Comments

@matthiasdg
Copy link

Describe the problem you faced

We have existing tables ingested with the default settings of hudi 0.10.
We synced them to hive metastore using the standalone sync tool. Typical partitioning scheme is by device, year, month, day.
Sync command is with parameters like:

--base-path abfss://[email protected]/.../qwe.aq_msm \
--database default --sync-mode hms --partitioned-by device,year,month,day --spark-datasource \
--table qwe_aq_msm --jdbc-url thrift://localhost:9083 --user hive --pass hive \
--partition-value-extractor org.apache.hudi.hive.MultiPartKeysValueExtractor"

We load the table using a SparkSession with enableHiveSupport() and the correct metastore uri, then use sparkSession.table(tableName).

When we tried to upgrade to 0.12 and use the existing hive tables we noticed queries take way longer than before (x10).

What we experience is: some Metadata table was not found at path ... messages and then it seems like nothing happens for e.g. 10 minutes. In the Spark UI, we can see a running query in the SQL tab, but there are no active jobs in the Jobs tab. When I ran this on my local machine, I noticed a pretty high upload meanwhile. When I enabled DEBUG logging, messages are continuously being generated like added in the Stacktrace section (getting FileStatuses).

This probably has something to do with the changed behavior of the metadata table? If we write using 0.12, we noticed that the metadata table is now enabled by default (metadata folder inside .hoodie folder). It seems to work fine if we write, sync and read with hudi 0.12 and metadata enabled (did not do large scale tests though).
Is it somehow possible to indicate we are not using metadata tables in combination with hive metastore, is there some configuration for this?
I already tried adding a hudi-defaults.conf with parameters like hoodie.metadata.enable to false etc, but it seems not to have effect there.
Resyncing the existing data with hudi's 0.12 synctool to another hive table does also not help. (Noticed there are parameters like META_SYNC_USE_FILE_LISTING_FROM_METADATA but you pbly first need metadata for that to have an effect).

For tables without partitions the problem is not noticeable.

To Reproduce

Steps to reproduce the behavior:

  1. Ingest data with lots of partitions, without metadata table enabled (was using hudi 0.10 in our case)
  2. Sync to hive metastore (0.10 or 0.12)
  3. Load the created table using session.table and do some query (0.12)
  4. Query seems to hang after some Metadata table was not found at path messages, seems like lots of FileStatuses are fetched, eventually Spark Job gets launched (may take 10 minutes).

Expected behavior
Ability to load existing hive tables (or resync) without performance impact or do we need to rewrite data?

Environment Description

  • Hudi version : upgraded from 0.10 to 0.12

  • Spark version : 3.1.2 (initially had also upgraded to 3.2, but wanted to rule this out so reverted)

  • Hive version : 2.3.9 comes with hudi libs, metastore is standalone 3.0

  • Hadoop version : 3.2.0 (with Spark 3.2 I had 3.3.2)

  • Storage (HDFS/S3/GCS..) : Azure Data Lake Gen 2

  • Running on Docker? (yes/no) : both with Spark in k8s mode or local mode.

Additional context

Stacktrace
Log excerpt

[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 193 Incrementing counter action_http_head_request by 1 with final value 13653
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 216 First execution of REST operation - GetPathStatus
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 216 First execution of REST operation - GetPathStatu
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 193 Incrementing counter connections_made by 1 with final value 14221
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 254 Authenticating request with OAuth2 access token
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 193 Incrementing counter connections_made by 1 with final value 14222
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 51 Request Headers
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 254 Authenticating request with OAuth2 access token
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 64   Accept-Charset=utf-8
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 51 Request Headers
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 64   x-ms-version=2019-12-12
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 64   Accept=application/json, application/octet-stream
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 64   Accept-Charset=utf-8
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 64   User-Agent=APN/1.0 Azure Blob FS/3.3.2 (AdoptOpenJDK JavaJRE 11.0.10; Linux 5.15.0-48-generic/amd64; SunJSSE-11.0; UNKNOWN/UNKNOWN) unknown
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 64   x-ms-version=2019-12-12
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 64   x-ms-client-request-id=:92d1dc21-0ec6-4b94-9a27-8eb38684dd69:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 64   Accept=application/json, application/octet-stream
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 64   Content-Type=
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 64   User-Agent=APN/1.0 Azure Blob FS/3.3.2 (AdoptOpenJDK JavaJRE 11.0.10; Linux 5.15.0-48-generic/amd64; SunJSSE-11.0; UNKNOWN/UNKNOWN) unknown
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 64   x-ms-client-request-id=:3ec30675-75be-4d3e-96b4-8cbdc410aad0:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0
[DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 64   Content-Type=
[DEBUG] [14:21:04.160] [ForkJoinPool.commonPool-worker-5] 51 Response Headers
[DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   HTTP Response=HTTP/1.1 200 OK
[DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   x-ms-group=a3fbf5af-b6f5-4372-97f3-1d5ff16faa88
[DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   x-ms-version=2019-12-12
[DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   Server=Windows-Azure-HDFS/1.0 Microsoft-HTTPAPI/2.0
[DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   Last-Modified=Fri, 05 Aug 2022 19:57:08 GMT
[DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   Date=Tue, 11 Oct 2022 12:21:03 GMT
[DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   x-ms-resource-type=file
[DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   x-ms-server-encrypted=true
[DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   ETag="0x8DA771CAD82156B"
[DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   x-ms-permissions=rwxrwx---+
[DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   Content-Length=96
[DEBUG] [14:21:04.162] [ForkJoinPool.commonPool-worker-5] 64   x-ms-request-id=a750d9cf-301f-002f-806b-dd6a93000000
[DEBUG] [14:21:04.162] [ForkJoinPool.commonPool-worker-5] 64   x-ms-client-request-id=:86d903e4-25df-4d06-a2fe-433fd6665477:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0
[DEBUG] [14:21:04.162] [ForkJoinPool.commonPool-worker-5] 64   x-ms-owner=$superuser
[DEBUG] [14:21:04.162] [ForkJoinPool.commonPool-worker-5] 193 Incrementing counter get_responses by 1 with final value 14217
[DEBUG] [14:21:04.162] [ForkJoinPool.commonPool-worker-5] 193 Incrementing counter bytes_received by 0 with final value 1210
[DEBUG] [14:21:04.162] [ForkJoinPool.commonPool-worker-5] 323 HttpRequest: GetPathStatus: 200,,,cid=:86d903e4-25df-4d06-a2fe-433fd6665477:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0,rid=a750d9cf-301f-002f-806b-dd6a93000000,sent=0,recv=0,HEAD,https://stsdpglasshouse.dfs.core.windows.net/dev/datalakehouse/tml.tf_msm/39877/2021/06/07/.hoodie_partition_metadata?upn=false&action=getStatus&timeout=90
[DEBUG] [14:21:04.163] [ForkJoinPool.commonPool-worker-5] 193 Incrementing counter op_exists by 1 with final value 13647
[DEBUG] [14:21:04.163] [ForkJoinPool.commonPool-worker-5] 604 AzureBlobFileSystem.getFileStatus path: abfss://[email protected]/datalakehouse/tml.tf_msm/39877/2021/06/08/.hoodie_partition_metadata
[DEBUG] [14:21:04.163] [ForkJoinPool.commonPool-worker-5] 193 Incrementing counter op_get_file_status by 1 with final value 13650
[DEBUG] [14:21:04.163] [ForkJoinPool.commonPool-worker-5] 933 getFileStatus filesystem: dev path: abfss://[email protected]/datalakehouse/tml.tf_msm/39877/2021/06/08/.hoodie_partition_metadata isNamespaceEnabled: true
[DEBUG] [14:21:04.163] [ForkJoinPool.commonPool-worker-5] 193 Incrementing counter action_http_head_request by 1 with final value 13654
[DEBUG] [14:21:04.163] [ForkJoinPool.commonPool-worker-5] 216 First execution of REST operation - GetPathStatu
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 193 Incrementing counter connections_made by 1 with final value 14223
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 254 Authenticating request with OAuth2 access token
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 51 Request Headers
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 64   Accept-Charset=utf-8
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-17] 51 Response Headers
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 64   x-ms-version=2019-12-12
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-7] 51 Response Headers
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-17] 64   HTTP Response=HTTP/1.1 200 OK
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 64   Accept=application/json, application/octet-stream
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-17] 64   x-ms-group=a3fbf5af-b6f5-4372-97f3-1d5ff16faa8
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 64   User-Agent=APN/1.0 Azure Blob FS/3.3.2 (AdoptOpenJDK JavaJRE 11.0.10; Linux 5.15.0-48-generic/amd64; SunJSSE-11.0; UNKNOWN/UNKNOWN) unknown
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-7] 64   HTTP Response=HTTP/1.1 200 OK
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-17] 64   x-ms-version=2019-12-12
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 64   x-ms-client-request-id=:65bc8c0c-541b-473b-ac05-76a762de7918:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-17] 64   Server=Windows-Azure-HDFS/1.0 Microsoft-HTTPAPI/2.0
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 64   Content-Type=
[DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-7] 64   x-ms-group=a3fbf5af-b6f5-4372-97f3-1d5ff16faa88
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   Last-Modified=Sat, 03 Sep 2022 00:06:12 GMT
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   Date=Tue, 11 Oct 2022 12:21:03 GMT
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-7] 64   x-ms-version=2019-12-12
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   x-ms-resource-type=file
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-7] 64   Server=Windows-Azure-HDFS/1.0 Microsoft-HTTPAPI/2.0
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   x-ms-server-encrypted=true
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   ETag="0x8DA8D401C66C479"
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-7] 64   Last-Modified=Fri, 05 Aug 2022 20:09:16 GMT
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   x-ms-permissions=rw-r-----+
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   Content-Length=96
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-7] 64   Date=Tue, 11 Oct 2022 12:21:03 GMT
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   x-ms-request-id=3754b849-801f-0048-126b-dd7a6f000000
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   x-ms-client-request-id=:b458b6e8-8bfd-4122-874d-8e1c9b58cd0a:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-7] 64   x-ms-resource-type=file
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   x-ms-owner=fbab02f2-d7a0-4f78-ab47-349acf6f275
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 193 Incrementing counter get_responses by 1 with final value 14218
[DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-7] 64   x-ms-server-encrypted=true
[DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-17] 193 Incrementing counter bytes_received by 0 with final value 1210
[DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-7] 64   ETag="0x8DA771E5FD67CCB"
[DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-17] 323 HttpRequest: GetPathStatus: 200,,,cid=:b458b6e8-8bfd-4122-874d-8e1c9b58cd0a:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0,rid=3754b849-801f-0048-126b-dd7a6f000000,sent=0,recv=0,HEAD,https://stsdpglasshouse.dfs.core.windows.net/dev/datalakehouse/tml.tf_msm/9000001550/2022/09/02/.hoodie_partition_metadata?upn=false&action=getStatus&timeout=90
[DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-7] 64   x-ms-permissions=rwxrwx---+
[DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-7] 64   Content-Length=96
[DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-17] 193 Incrementing counter op_exists by 1 with final value 13648
[DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-17] 604 AzureBlobFileSystem.getFileStatus path: abfss://[email protected]/datalakehouse/tml.tf_msm/9000001550/2022/09/03/.hoodie_partition_metadata
@xushiyan xushiyan added the priority:critical production down; pipelines stalled; Need help asap. label Oct 13, 2022
@xushiyan xushiyan moved this to Awaiting Triage in Hudi Issue Support Oct 13, 2022
@yihua yihua added performance hive Issues related to hive labels Oct 14, 2022
@yihua
Copy link
Contributor

yihua commented Oct 14, 2022

Hi @matthiasdg , thanks for reporting the performance regression. We refactored the Hive query engine integration with Hudi in 0.11.0. I'll take a look at the file index implementation for Hive which may contribute to the regression here.

@konradwudkowski
Copy link

konradwudkowski commented Nov 9, 2022

We're facing the same problem, in our case that's when using Trino on EMR 6.8.0 that comes with hudi 0.11.1 and reading data written with hudi 0.10.0. Queries are at least 10x slower :(

@matthiasdg
Copy link
Author

Is this related to #6787 (comment)? It now does a full scan before knowing the partitions?

@nsivabalan nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 7, 2022
@yihua
Copy link
Contributor

yihua commented Dec 23, 2022

@matthiasdg Thanks for raising this performance issue. We've put a few performance fixes on the latest master recently to address the performance issue in the Hudi file index:
(1) Avoid file index and use fs view cache in COW input format #7493 cherry-picked for 0.12.2 release
(2) Turn off metadata-table-based file listing in BaseHoodieTableFileIndex #7488 cherry-picked for 0.12.2 release
(3) Lazy fetching partition path & file slice for HoodieFileIndex #6680 targeted for 0.13.0 release
(4) Fixing FileIndex impls to properly batch partitions listing #7233 targeted for 0.13.0 release

@konradwudkowski we've verified that with 0.12.2 RC1 containing the first two fixes, queries using Trino Hive connector should now be par with old releases (more than 10x faster than 0.12.1).

@matthiasdg These fixes should also fix the slowness of file listing for the queries in Spark. A few community users have already verified that with the master branch.

I'm going to close this issue now. @konradwudkowski @matthiasdg if you still observe the same performance problem, feel free to reopen this Github issue. We'll triage it again.

@yihua yihua closed this as completed Dec 23, 2022
Repository owner moved this from ⏳ Awaiting Triage to ✅ Done in Hudi Issue Support Dec 23, 2022
@matthiasdg
Copy link
Author

Just tried 0.12.2, works great again!

@aajisaka
Copy link
Member

(2) Turn off metadata-table-based file listing in BaseHoodieTableFileIndex #7488 cherry-picked for 0.12.2 release

Note that it's reverted by #7526 and the performance is worse in 0.12.3 and upper. Is there any option to revive this change?

@aajisaka
Copy link
Member

Opened https://issues.apache.org/jira/browse/HUDI-6734 to revive the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hive Issues related to hive performance priority:critical production down; pipelines stalled; Need help asap. release-0.12.2 Patches targetted for 0.12.2
Projects
Archived in project
Development

No branches or pull requests

6 participants