-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: read_iceberg
similar to read_parquet
#6013
Comments
Thanks for the issue! These indeed would be nice features to have. Looking at the Both of those implementations read everything into memory as a PyArrow table, and while this is nice I think it's unlikely that most heavy iceberg users have iceberg tables that can fit into their client machine's main memory. That said, we could probably instead pluck out the list of files backing the table and hand that to duckdb. |
added for DuckDB and polars here: #6354 |
Changing the issue title to reflect @lostmygithubaccount's work on getting |
read_iceberg
similar to read_parquet
Took another look at pyiceberg, and it's still a bit too optimistic with respect to host memory (it's still reading everything into a pyarrow Table). If there are ever improvements to that part of it or another API comes along that doesn't have the in-memory limitation then we can revisit adding |
DuckDB now has an Iceberg extension under development at https://github.com/duckdblabs/duckdb_iceberg. We should take a look at that. |
per discussion it's not very active and doesn't support writes. once this (with at least some minimal docs) or the pyiceberg package supports write, I think this will be easy to add similar to delta lake tables |
Is your feature request related to a problem?
Table formats like Apache Iceberg and Delta Lake have mostly been available in Spark / JVM, but with the Python modules pyiceberg and deltalake we can load these table formats into Arrow data without the need for the JVM. pyiceberg already has examples for importing to DuckDB. deltalake supports reading a table into a PyArrow dataset, which DuckDB, DataFusion, and Polars support predicate and projection pushdown for (various examples here).
Describe the solution you'd like
It would be cool to design an API like
read_parquet
that is uniform between Spark and the Arrow-compatible backends (DuckDB, DataFusion, and Polars) for both of these formats.What version of ibis are you running?
NA
What backend(s) are you using, if any?
Relevant backends: Spark, DuckDB, DataFusion, Polars
Code of Conduct
The text was updated successfully, but these errors were encountered: