Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: read_iceberg similar to read_parquet #6013

Closed
1 task done
wjones127 opened this issue Apr 14, 2023 · 6 comments
Closed
1 task done

feat: read_iceberg similar to read_parquet #6013

wjones127 opened this issue Apr 14, 2023 · 6 comments
Labels
feature Features or general enhancements

Comments

@wjones127
Copy link
Contributor

Is your feature request related to a problem?

Table formats like Apache Iceberg and Delta Lake have mostly been available in Spark / JVM, but with the Python modules pyiceberg and deltalake we can load these table formats into Arrow data without the need for the JVM. pyiceberg already has examples for importing to DuckDB. deltalake supports reading a table into a PyArrow dataset, which DuckDB, DataFusion, and Polars support predicate and projection pushdown for (various examples here).

Describe the solution you'd like

It would be cool to design an API like read_parquet that is uniform between Spark and the Arrow-compatible backends (DuckDB, DataFusion, and Polars) for both of these formats.

What version of ibis are you running?

NA

What backend(s) are you using, if any?

Relevant backends: Spark, DuckDB, DataFusion, Polars

Code of Conduct

  • I agree to follow this project's Code of Conduct
@wjones127 wjones127 added the feature Features or general enhancements label Apr 14, 2023
@cpcloud
Copy link
Member

cpcloud commented Apr 15, 2023

Thanks for the issue!

These indeed would be nice features to have.

Looking at the pyiceberg implementation I'm not sure this is ready for integration into ibis yet using the to_arrow and to_duckdb APIs.

Both of those implementations read everything into memory as a PyArrow table, and while this is nice I think it's unlikely that most heavy iceberg users have iceberg tables that can fit into their client machine's main memory.

That said, we could probably instead pluck out the list of files backing the table and hand that to duckdb.

@lostmygithubaccount
Copy link
Member

added for DuckDB and polars here: #6354

@cpcloud
Copy link
Member

cpcloud commented Jun 13, 2023

Changing the issue title to reflect @lostmygithubaccount's work on getting read_delta and to_delta up and running for various backends.

@cpcloud cpcloud changed the title feat: read_iceberg and read_deltalake methods similar to read_parquet feat: read_iceberg and ~read_deltalake~ methods similar to read_parquet Jun 13, 2023
@cpcloud cpcloud changed the title feat: read_iceberg and ~read_deltalake~ methods similar to read_parquet feat: read_iceberg similar to read_parquet Jun 13, 2023
@cpcloud
Copy link
Member

cpcloud commented Jun 13, 2023

Took another look at pyiceberg, and it's still a bit too optimistic with respect to host memory (it's still reading everything into a pyarrow Table).

If there are ever improvements to that part of it or another API comes along that doesn't have the in-memory limitation then we can revisit adding read_iceberg.

@cpcloud cpcloud closed this as completed Jun 13, 2023
@ianmcook
Copy link
Contributor

DuckDB now has an Iceberg extension under development at https://github.com/duckdblabs/duckdb_iceberg. We should take a look at that.

@lostmygithubaccount
Copy link
Member

per discussion it's not very active and doesn't support writes. once this (with at least some minimal docs) or the pyiceberg package supports write, I think this will be easy to add similar to delta lake tables

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
None yet
Development

No branches or pull requests

4 participants