-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Spark structured streaming ingestion into Hudi fails after an upgrade to 0.12.2 #8890
Comments
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition
There seems some inconsistency between data set table and metadata table about the file handle list. |
I've tried upgrading from 0.12.1 to 0.12.2 first and hit the same error. I updated the issue name to reflect that, it seems to be caused by that minor version upgrade. |
Did you have any chance to try the 0.12.3 release then? I kind of believe it is not caused by version upgrade, the in consistency should be a bug. |
I've tried both 0.12.2 and 0.13.0, would you like me to also test it with 0.12.3? |
Yeah, we have bunch of fixes for 0.12.3 and 0.13.1. |
I've tested it with 0.12.3 and it fails with the same error. |
@danny0405 I have same error. Is there any solution please? its stuck. @psendyk Excuse me, are you ok now? Is there a solution? |
@ad1happy2go Is there any possibility you can re-produce this issue if you have spare time to help, that would be greate. |
@psendyk You add this settings and try it. May be it works. .option(HoodieWriteConfig.AVRO_EXTERNAL_SCHEMA_TRANSFORMATION_ENABLE.key(), true)
.option(HoodieWriteConfig.AVRO_SCHEMA_VALIDATE_ENABLE.key(), true)
.option(HoodieWriteConfig.SCHEMA_ALLOW_AUTO_EVOLUTION_COLUMN_DROP.key(), true)
.option(HoodieCommonConfig.RECONCILE_SCHEMA.key(), true)
.option(HoodieCommonConfig.SCHEMA_EVOLUTION_ENABLE.key(), true)
.option(HoodieIndexConfig.INDEX_TYPE.key(), HoodieIndex.IndexType.SIMPLE.name())
.option(HoodieLayoutConfig.LAYOUT_TYPE.key(), HoodieStorageLayout.LayoutType.DEFAULT.name()) and query with configs set hoodie.schema.on.read.enable=true;
set hoodie.datasource.read.extract.partition.values.from.path=true; |
I tested it again using the options @zyclove posted above and the job still fails with the same error. Also, this time I tested it on a fresh table to make sure there were no issues with our production table. I ingested ~1B records from Kafka to a new S3 location, written to ~18k partitions. So it should be reproducible, let me know if you need any additional details. |
@psendyk Thanks. I will try reproducing it. |
@psendyk Wanted to confirm, To reproduce Should we recreate the table using 0.12.1 and then upgrade. Also, to ingest into Hudi table we suggest to use deltastreamer. Is their any limitation due to which you are forced to use spark structured stream. |
Yes, I created a new table and ingested ~1B records using Hudi 0.12.1. Then I restarted the job with 0.13.0 (same issue happened with 0.12.2, 0.12.3); the first micro-batch succeeded and the next one failed. We haven't looked into the DeltaStreamer to be honest as all of our jobs are written in Spark. Our ingestion job does a bunch of transformations and filtering and adds computed columns. Seems like this might be possible with the transformer class using the DeltaStreamer but we haven't looked into it, we haven't had any issues with Spark Streaming until now. |
Any update here @ad1happy2go ? We are blocked on upgrades. |
@psendyk This is the gist with code I am using to reproduce - https://gist.github.com/ad1happy2go/a2df5b11c3aff1a15b205b458b6b480a |
@ad1happy2go Our initial upgrade attempt only failed for one out of four of our tables; the other three have much lower incoming data volume so perhaps it's related to that. I just trie reproducing the error on another fresh table with less data -- I ingested a single micro-batch (which also created the table) using 0.12.1, and then continued the ingestion with 0.13.0. This time the 0.13.0 job continued to make progress for a couple of micro-batches until I killed it; it didn't run into that issue. |
I've narrowed down the bug to a specific commit -- I believe it's Also, I realize the commit I linked was released in 0.12.1 not 0.12.2... Apologies about that, we manage our JARs in our own artifact repository and it must've been mislabeled as 0.12.1 while we're actually running 0.12.0. To confirm, I've downloaded a 0.12.1 version from the Maven repo and ran into the same issue but 0.12.0 worked fine. It shouldn't have affected the reproducibility though as (as @danny0405 already mentioned) it's not a version incompatibility issue but rather a bug; even when I create the table using 0.12.1, the job still fails after a while. |
Already fixed by - #9879 in master. Closing this. |
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at [email protected].
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
After an upgrade from 0.12.1 to 0.13.0, ingestion from Kafka into a Hudi table via Spark structured streaming fails on the second micro-batch. When the job is restarted, it fails on the first micro-batch. After reverting the version to 0.12.1 the issue goes away. Each time the upgrade is attempted, the first micro-batch succeeds and the second one fails. The issue seems to occur on an attempt to expand small files which do not exist in the underlying storage.
To Reproduce
Steps to reproduce the behavior:
Use the write options provided in the below section to write data via Spark structured streaming. The job should fail when writing data in the second micro-batch.
Expected behavior
The ingestion job should continue to ingest more micro-batches.
Environment Description
Hudi version : 0.13.0
Spark version : 3.3.0
Hive version : 3.1.3
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
Table services are set up asynchronously in separate jobs but were not running at the time. There was only one writer into the table at the time. Below are the full write options of the streaming ingestion (some values were redacted):
Stacktrace
The exact partition values were redacted.
The text was updated successfully, but these errors were encountered: