-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] pyiceberg 0.6.0 #350
Comments
Hi @gui-elastic, Saying that, in the next few days, I will try to fill the gaps for the refactoring and am waiting for the feedback on the general code flow. The implementation of the Iceberg afterwards should be straightforward because you will get the arrow table / record batch directly into the store function of the plugin Happy to hear your feedback |
Hey @milicevica23 I simply loved the idea. I already think that dbt-duckdb is an amazing project, but with this improvement, it will be on another level, being used for Data Lakehouse architectures, and interacting with Delta and Iceberg tables (reading and writing). When this refactoring is merged, please let me know, I will be glad to test it. If needed, also on writing custom plugins. |
Yes, I think that too, and I believe that this improvement will bring a bunch of new use cases that can be done where everything that speaks arrow can be integrated https://georgheiler.com/2023/12/11/dagster-dbt-duckdb-as-new-local-mds/ I would encourage you to subscribe to the refactoring pull request and look into the code; I am happy to chat/discuss it. You can find me in dbt slack |
Thank you so much! Just to confirm, the refactoring PR is this one #332, correct? I will take a look at the blog post right now. Thx! |
Hey @milicevica23! |
Hi @MRocholl, I can say that I don't work on this feature right now because I am swamped privately. When I was doing this refactoring, I didn't have time to go over and comprehend all the options and breaking changes produced by this pull request and guarantee that all the use cases would work as expected. |
Yeah I think the ideal here is always to rely on DuckDB + the extension to do this reading/writing itself as much as possible vs. having dbt-duckdb do it (and in the process turning into its own sort of data catalog-type thing, which is really not what I was going for when I started down this path, but here we are.) This pattern seems to work well for e.g. Postgres and MySQL via the Just like @milicevica23, I'm super busy with the actual job I am paid to do (which unfortunately doesn't involve all that much DuckDB.) |
Thank you both for the fast reply. As @jwills said, I believe a lot can be done already with the extensions that duckdb ships by itself by hooking a post-hook and using copy statements or by using the attach functionality. |
Hello,
Recently, the pyiceberg 0.6.0 version was released which allows writing iceberg tables without needing tools like Spark and Trino.
I was about to write a custom plugin to implement the writing feature, however, I see that when using the external materialization with a custom plugin, first the outputted data is stored locally and then is read and ingested to the final source, however for Iceberg and Delta is does not seem to be a good solution. Would be good instead of storing the data on disk, simply load an Arrow Dataframe and then write to the final destination (e.g., s3 in Iceberg format).
I saw this thread: #332 (comment), so I would like to ask you if there is any ETA to implement this feature. It would be an amazing feature to even use for production workloads with a Data Lakehouse architecture.
This explains well what needs to be fixed to use the iceberg writer in the best way possible: #284 (comment)
The text was updated successfully, but these errors were encountered: