Integrate XTable into DataLake #369
-
Hello, I am currently employed by a company managing an AWS Data Lake featuring 250+ datasets written in Hudi. We are exploring the possibility of using these datasets as Iceberg tables as well, allowing us to create tables in Snowflake and potentially save on costs. Based on my research, it seems that XTable could be a solution for generating metadata for the Hudi datasets. I conducted a small Proof of Concept (PoC) that was successful. In this test, I generated a Hudi dataset, initiated a small EC2 instance, created metadata using the JAR file, and executed an AWS Glue crawler. My query pertains to the implementation of the OneTable step into our existing solution. Currently, we employ a step function that launches an EKS cluster to process the Hudi Spark job. My initial thought is to integrate another step in the process. Once Hudi completes writing the data, I aim to run an additional step in the step function. This step would update the metadata for a specific dataset, utilizing either a container or a Lambda function. Do you think this will work? I welcome any suggestions or insights you may have on this matter. Thank you in advance for your assistance. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
I think that would work. If you need help building a container for the project, let me know. If you're able to get something working, it would be awesome if you can share it with the rest of the community that may be looking for a similar solution. |
Beta Was this translation helpful? Give feedback.
-
Have you seen this? https://atwong.medium.com/e-commerce-funnel-analysis-with-starrocks-87-million-records-apache-hudi-apache-iceberg-delta-ebf0923c149a |
Beta Was this translation helpful? Give feedback.
-
Here's an xtable example with Hudi converted to Iceberg and Delta on S3 with HMS. https://github.com/alberttwong/incubator-xtable/tree/main/demo-s3 |
Beta Was this translation helpful? Give feedback.
I think that would work. If you need help building a container for the project, let me know. If you're able to get something working, it would be awesome if you can share it with the rest of the community that may be looking for a similar solution.