Error with `read_iceberg` function when using an s3 warehouse location #2004

maxime-petitjean · 2024-03-12T14:27:31Z

Describe the bug
When reading a table from iceberg (with function read_iceberg), I get a weird schema error:

ValueError: DaftError::SchemaMismatch While building a Table, we found that the Schema Field and the Series Field  did not match. schema field: col#Int64 vs series field: col#Int64

This error appears only if I use an s3 warehouse location in iceberg.

To Reproduce

Create an iceberg catalog and fill it with a test table

import pandas
import pyarrow
from pyiceberg.catalog.sql import SqlCatalog

catalog = SqlCatalog(
    "default",
    **{
        "uri": "postgresql://my-postgresql-server-url",
        "warehouse": "s3://my-bucket/warehouse",
    },
)

catalog.create_namespace("test_database")
table = pyarrow.Table.from_pandas(pandas.DataFrame({"col": [1, 2, 3]}))
catalog_table = catalog.create_table("test_database.test_table", schema=table.schema)
catalog_table.append(table)

Read iceberg catalog test table

import daft
from pyiceberg.catalog.sql import SqlCatalog

catalog = SqlCatalog(
    "default",
    **{
        "uri": "postgresql://my-postgresql-server-url",
        "warehouse": "s3://my-bucket/warehouse",
    },
)

table = catalog.load_table("test_database.test_table")
df = daft.read_iceberg(table)
out = df.to_pandas()
print(out)

See error

Traceback (most recent call last):                                                                                                                                                
  File "/Users/mpn/work/test/iceberg/test.py", line 79, in <module>
    out = df.to_pandas()
          ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/daft/api_annotations.py", line 31, in _wrap
    return timed_method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/daft/analytics.py", line 189, in tracked_method
    result = method(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/daft/dataframe/dataframe.py", line 1251, in to_pandas
    pd_df = result.to_pandas(
            ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/daft/runners/partitioning.py", line 221, in to_pandas
    merged_partition = self._get_merged_vpartition()
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/daft/runners/pyrunner.py", line 45, in _get_merged_vpartition
    return MicroPartition.concat([part for id, part in ids_and_partitions])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/daft/table/micropartition.py", line 126, in concat
    return MicroPartition._from_pymicropartition(_PyMicroPartition.concat(micropartitions))
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: DaftError::SchemaMismatch While building a Table, we found that the Schema Field and the Series Field  did not match. schema field: col#Int64 vs series field: col#Int64

Expected behavior
If I use in warehouse configuration a local path (file:///tmp/warehouse), I have the correct result:

Desktop:

OS: macOS Sonoma (14.2.1)
Version 0.2.17

The text was updated successfully, but these errors were encountered:

jaychia · 2024-03-12T18:35:11Z

Hi @maxime-petitjean!

I think I know what this might be. We're may not be handling field metadata correctly...

Could you try this code that I modified from your snippet and print the output please?

import daft
from pyiceberg.catalog.sql import SqlCatalog

catalog = SqlCatalog(
    "default",
    **{
        "uri": "postgresql://my-postgresql-server-url",
        "warehouse": "s3://my-bucket/warehouse",
    },
)

table = catalog.load_table("test_database.test_table")

# Inspect the schema and fields
from pyiceberg.io.pyarrow import schema_to_pyarrow
from daft.logical.schema import Schema

iceberg_schema = table.schema()
arrow_schema = schema_to_pyarrow(table.schema())
daft_schema = Schema.from_pyarrow_schema(arrow_schema)

print("Iceberg Schema:\n", iceberg_schema)
print("Converted arrow schema:\n", arrow_schema)
print("Converted Daft schema:\n", daft_schema)

maxime-petitjean · 2024-03-12T18:54:18Z

Thanks for taking a look, here is the output:

Iceberg Schema:
 table {
  1: col: optional long
}
Converted arrow schema:
 col: int64
  -- field metadata --
  PARQUET:field_id: '1'
Converted Daft schema:
╭─────────────┬───────╮
│ Column Name ┆ Type  │
╞═════════════╪═══════╡
│ col         ┆ Int64 │
╰─────────────┴───────╯

jaychia · 2024-03-12T22:36:52Z

@samster25 to reproduce:

from pyiceberg.catalog.sql import SqlCatalog

warehouse_path = "s3://eventual-data-test-bucket/test-iceberg-issue-2004/"
catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///pyiceberg_test_catalog.db",
        "warehouse": warehouse_path,
    },
)
catalog_table = catalog.load_table("test_database.test_table")
daft.read_iceberg(catalog_table).to_pandas()

pyiceberg_test_catalog.db.zip

Second try:
pyiceberg_test_catalog.db.zip

* Ignores metadata when comparing Fields * Adds test that reads table written by pyiceberg closes: #2004

samster25 · 2024-03-13T05:53:07Z

@maxime-petitjean Just merged in a fix to this issue! We plan to release the next version of daft tomorrow! Will ping you when it's out :)

samster25 · 2024-03-13T19:55:24Z

@maxime-petitjean we just cut getdaft==0.2.18 which includes the fix to this issue! Give it a shot :)

ghalimi · 2024-03-14T06:48:58Z

Thanks a ton for your help guys, this is awesome.

maxime-petitjean · 2024-03-14T08:28:17Z

@maxime-petitjean we just cut getdaft==0.2.18 which includes the fix to this issue! Give it a shot :)

Tested and working, thanks!

samster25 self-assigned this Mar 12, 2024

samster25 mentioned this issue Mar 13, 2024

[BUG] skip metadata check for field equality #2006

Merged

samster25 closed this as completed in #2006 Mar 13, 2024

samster25 added a commit that referenced this issue Mar 13, 2024

[BUG] skip metadata check for field equality (#2006)

3c02ab1

* Ignores metadata when comparing Fields * Adds test that reads table written by pyiceberg closes: #2004

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with `read_iceberg` function when using an s3 warehouse location #2004

Error with `read_iceberg` function when using an s3 warehouse location #2004

maxime-petitjean commented Mar 12, 2024

jaychia commented Mar 12, 2024

maxime-petitjean commented Mar 12, 2024

jaychia commented Mar 12, 2024 •

edited

Loading

samster25 commented Mar 13, 2024

samster25 commented Mar 13, 2024

ghalimi commented Mar 14, 2024

maxime-petitjean commented Mar 14, 2024

Error with read_iceberg function when using an s3 warehouse location #2004

Error with read_iceberg function when using an s3 warehouse location #2004

Comments

maxime-petitjean commented Mar 12, 2024

jaychia commented Mar 12, 2024

maxime-petitjean commented Mar 12, 2024

jaychia commented Mar 12, 2024 • edited Loading

samster25 commented Mar 13, 2024

samster25 commented Mar 13, 2024

ghalimi commented Mar 14, 2024

maxime-petitjean commented Mar 14, 2024

Error with `read_iceberg` function when using an s3 warehouse location #2004

Error with `read_iceberg` function when using an s3 warehouse location #2004

jaychia commented Mar 12, 2024 •

edited

Loading