Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for delta #241

Open
Yacobolo opened this issue Sep 1, 2023 · 9 comments
Open

Support for delta #241

Yacobolo opened this issue Sep 1, 2023 · 9 comments

Comments

@Yacobolo
Copy link

Yacobolo commented Sep 1, 2023

Looking forward to the support for delta. This would enable us to run a poor man's data lakehouse! Do you need any help? What is the eta - this year?

@jwills
Copy link
Collaborator

jwills commented Sep 7, 2023

Ack, sorry for the lag here @Yacobolo, I was on the road and missed this going by. I would like to have a plugin that supported Delta akin to the one I have for Iceberg; I'm assuming it would use the deltalake python package, but I personally don't have access to a Delta lake instance and tbh don't really care enough about learning how to setup a real one to do it myself "for fun."

However, if you (or anyone else!) does have a Delta lake instance and you know it should be configured as a dbt-duckdb plugin, I would most definitely be happy to merge it in.

@milicevica23
Copy link
Contributor

Hi, @jwills, I would like to try this integration. This would be my first contribution, so I would appreciate some help and guidance at the beginning.

I did a first draft of read plugin integration here
and doing parallel an example project here where i showcase it

Here is the source configuration which loads data as the source with file and projection prunning

What workflow works best for you that you are able to give a feedback?

@jwills
Copy link
Collaborator

jwills commented Oct 1, 2023

Hey @milicevica23, thanks so much for taking a crack at this!

The code as-written makes sense to me, but I have to be honest that I don't have a great sense for how folks actually use the deltalake python module in the real world-- like, do folks really use delta tables w/o a catalog? https://delta-io.github.io/delta-rs/python/usage.html#loading-a-delta-table

@milicevica23
Copy link
Contributor

The nice thing is that you can but should not use a catalog to know where your table is and i thought to implement support for both ways. Or at least try to do it..
You can think of that as that we add a new file format to external files and not everybody who is on prem or doing simple projects have catalogs. But would be happy to hear feedback from others

@Yacobolo
Copy link
Author

Yacobolo commented Oct 1, 2023

Same here, the main use case is not the catalog, but more the metadata it generates together with the ACID transactions and time travel / change history🔥

@jwills
Copy link
Collaborator

jwills commented Oct 1, 2023

Alright, super cool. So @milicevica23 if you would put your change together as a PR and other folks on this thread can weigh in on any additional config options we need to support those use cases, that would be great!

@milicevica23
Copy link
Contributor

milicevica23 commented Oct 1, 2023

Sure, i will open an draft PR.

The things still to do

  • check/rename naming for config params (feedback appreciate)
  • add support for time traveling to load older versions of the tables
  • add support for at least azure because i have access to, if you have some guidance for AWS s3 local environment like Lolcalstack i can try that too but have to learn it too.
  • (optional) try to rewrite the reading logic that we don't have to import first and then process but that we can push down the predicates. I made an stack-overflow question, maybe you can help me to do/uderstand that @jwills? The benefit will be that we don't have to specify filters on the config level but it will come from the first query on the view which is pointing/ holding an instance to the delta table
  • (optional) try to add databricks catalog option. I don't have so much experience with that but can learn it on the way
  • do testing after we agree on the structure

Be free to add new ideas, topics

I am not used to PR process in the github so feel free to rewrite, do stuff as it fits the needs and best practices

@geoHeil
Copy link

geoHeil commented Jun 28, 2024

How would https://duckdb.org/2024/06/10/delta.html the new delta kernel work here to simplify and perhaps make the access to delta based data more performant?

@geoHeil
Copy link

geoHeil commented Jun 28, 2024

A: https://duckdb.org/docs/extensions/delta#supported-duckdb-versions-and-platforms simply adding the extension (if the platform is supported)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants