feat: `read_iceberg` similar to `read_parquet` #6013

wjones127 · 2023-04-14T19:39:04Z

Is your feature request related to a problem?

Table formats like Apache Iceberg and Delta Lake have mostly been available in Spark / JVM, but with the Python modules pyiceberg and deltalake we can load these table formats into Arrow data without the need for the JVM. pyiceberg already has examples for importing to DuckDB. deltalake supports reading a table into a PyArrow dataset, which DuckDB, DataFusion, and Polars support predicate and projection pushdown for (various examples here).

Describe the solution you'd like

It would be cool to design an API like read_parquet that is uniform between Spark and the Arrow-compatible backends (DuckDB, DataFusion, and Polars) for both of these formats.

What version of ibis are you running?

NA

What backend(s) are you using, if any?

Relevant backends: Spark, DuckDB, DataFusion, Polars

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

cpcloud · 2023-04-15T12:05:04Z

Thanks for the issue!

These indeed would be nice features to have.

Looking at the pyiceberg implementation I'm not sure this is ready for integration into ibis yet using the to_arrow and to_duckdb APIs.

Both of those implementations read everything into memory as a PyArrow table, and while this is nice I think it's unlikely that most heavy iceberg users have iceberg tables that can fit into their client machine's main memory.

That said, we could probably instead pluck out the list of files backing the table and hand that to duckdb.

lostmygithubaccount · 2023-06-06T17:22:16Z

added for DuckDB and polars here: #6354

cpcloud · 2023-06-13T10:52:44Z

Changing the issue title to reflect @lostmygithubaccount's work on getting read_delta and to_delta up and running for various backends.

cpcloud · 2023-06-13T10:55:38Z

Took another look at pyiceberg, and it's still a bit too optimistic with respect to host memory (it's still reading everything into a pyarrow Table).

If there are ever improvements to that part of it or another API comes along that doesn't have the in-memory limitation then we can revisit adding read_iceberg.

ianmcook · 2023-08-22T22:44:01Z

DuckDB now has an Iceberg extension under development at https://github.com/duckdblabs/duckdb_iceberg. We should take a look at that.

lostmygithubaccount · 2023-08-23T15:28:23Z

per discussion it's not very active and doesn't support writes. once this (with at least some minimal docs) or the pyiceberg package supports write, I think this will be easy to add similar to delta lake tables

wjones127 added the feature Features or general enhancements label Apr 14, 2023

lostmygithubaccount mentioned this issue Jun 6, 2023

feat: add read_delta and to_delta for PySpark #6383

Closed

1 task

cpcloud changed the title ~~feat: read_iceberg and read_deltalake methods similar to read_parquet~~ feat: read_iceberg and ~read_deltalake~ methods similar to read_parquet Jun 13, 2023

cpcloud changed the title ~~feat: read_iceberg and ~read_deltalake~ methods similar to read_parquet~~ feat: read_iceberg similar to read_parquet Jun 13, 2023

cpcloud closed this as completed Jun 13, 2023

jayceslesar mentioned this issue Sep 15, 2023

Add to_file with Python API apache/iceberg#8567

Closed

Fokko mentioned this issue Oct 2, 2023

Add to_file with Python API apache/iceberg-python#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `read_iceberg` similar to `read_parquet` #6013

feat: `read_iceberg` similar to `read_parquet` #6013

wjones127 commented Apr 14, 2023

cpcloud commented Apr 15, 2023

lostmygithubaccount commented Jun 6, 2023

cpcloud commented Jun 13, 2023

cpcloud commented Jun 13, 2023

ianmcook commented Aug 22, 2023

lostmygithubaccount commented Aug 23, 2023

feat: read_iceberg similar to read_parquet #6013

feat: read_iceberg similar to read_parquet #6013

Comments

wjones127 commented Apr 14, 2023

Is your feature request related to a problem?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

cpcloud commented Apr 15, 2023

lostmygithubaccount commented Jun 6, 2023

cpcloud commented Jun 13, 2023

cpcloud commented Jun 13, 2023

ianmcook commented Aug 22, 2023

lostmygithubaccount commented Aug 23, 2023

feat: `read_iceberg` similar to `read_parquet` #6013

feat: `read_iceberg` similar to `read_parquet` #6013