Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with read_iceberg function when using an s3 warehouse location #2004

Closed
maxime-petitjean opened this issue Mar 12, 2024 · 7 comments · Fixed by #2006
Closed

Error with read_iceberg function when using an s3 warehouse location #2004

maxime-petitjean opened this issue Mar 12, 2024 · 7 comments · Fixed by #2006
Assignees

Comments

@maxime-petitjean
Copy link
Contributor

Describe the bug
When reading a table from iceberg (with function read_iceberg), I get a weird schema error:

ValueError: DaftError::SchemaMismatch While building a Table, we found that the Schema Field and the Series Field  did not match. schema field: col#Int64 vs series field: col#Int64

This error appears only if I use an s3 warehouse location in iceberg.

To Reproduce

  1. Create an iceberg catalog and fill it with a test table
import pandas
import pyarrow
from pyiceberg.catalog.sql import SqlCatalog

catalog = SqlCatalog(
    "default",
    **{
        "uri": "postgresql://my-postgresql-server-url",
        "warehouse": "s3://my-bucket/warehouse",
    },
)

catalog.create_namespace("test_database")
table = pyarrow.Table.from_pandas(pandas.DataFrame({"col": [1, 2, 3]}))
catalog_table = catalog.create_table("test_database.test_table", schema=table.schema)
catalog_table.append(table)
  1. Read iceberg catalog test table
import daft
from pyiceberg.catalog.sql import SqlCatalog

catalog = SqlCatalog(
    "default",
    **{
        "uri": "postgresql://my-postgresql-server-url",
        "warehouse": "s3://my-bucket/warehouse",
    },
)

table = catalog.load_table("test_database.test_table")
df = daft.read_iceberg(table)
out = df.to_pandas()
print(out)
  1. See error
Traceback (most recent call last):                                                                                                                                                
  File "/Users/mpn/work/test/iceberg/test.py", line 79, in <module>
    out = df.to_pandas()
          ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/daft/api_annotations.py", line 31, in _wrap
    return timed_method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/daft/analytics.py", line 189, in tracked_method
    result = method(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/daft/dataframe/dataframe.py", line 1251, in to_pandas
    pd_df = result.to_pandas(
            ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/daft/runners/partitioning.py", line 221, in to_pandas
    merged_partition = self._get_merged_vpartition()
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/daft/runners/pyrunner.py", line 45, in _get_merged_vpartition
    return MicroPartition.concat([part for id, part in ids_and_partitions])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/daft/table/micropartition.py", line 126, in concat
    return MicroPartition._from_pymicropartition(_PyMicroPartition.concat(micropartitions))
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: DaftError::SchemaMismatch While building a Table, we found that the Schema Field and the Series Field  did not match. schema field: col#Int64 vs series field: col#Int64

Expected behavior
If I use in warehouse configuration a local path (file:///tmp/warehouse), I have the correct result:

   col
0    1
1    2
2    3

Desktop:

  • OS: macOS Sonoma (14.2.1)
  • Version 0.2.17
@jaychia
Copy link
Contributor

jaychia commented Mar 12, 2024

Hi @maxime-petitjean!

I think I know what this might be. We're may not be handling field metadata correctly...

Could you try this code that I modified from your snippet and print the output please?

import daft
from pyiceberg.catalog.sql import SqlCatalog

catalog = SqlCatalog(
    "default",
    **{
        "uri": "postgresql://my-postgresql-server-url",
        "warehouse": "s3://my-bucket/warehouse",
    },
)

table = catalog.load_table("test_database.test_table")

# Inspect the schema and fields
from pyiceberg.io.pyarrow import schema_to_pyarrow
from daft.logical.schema import Schema

iceberg_schema = table.schema()
arrow_schema = schema_to_pyarrow(table.schema())
daft_schema = Schema.from_pyarrow_schema(arrow_schema)

print("Iceberg Schema:\n", iceberg_schema)
print("Converted arrow schema:\n", arrow_schema)
print("Converted Daft schema:\n", daft_schema)

@maxime-petitjean
Copy link
Contributor Author

Thanks for taking a look, here is the output:

Iceberg Schema:
 table {
  1: col: optional long
}
Converted arrow schema:
 col: int64
  -- field metadata --
  PARQUET:field_id: '1'
Converted Daft schema:
╭─────────────┬───────╮
│ Column Name ┆ Type  │
╞═════════════╪═══════╡
│ col         ┆ Int64 │
╰─────────────┴───────╯

@samster25 samster25 self-assigned this Mar 12, 2024
@jaychia
Copy link
Contributor

jaychia commented Mar 12, 2024

@samster25 to reproduce:

from pyiceberg.catalog.sql import SqlCatalog

warehouse_path = "s3://eventual-data-test-bucket/test-iceberg-issue-2004/"
catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///pyiceberg_test_catalog.db",
        "warehouse": warehouse_path,
    },
)
catalog_table = catalog.load_table("test_database.test_table")
daft.read_iceberg(catalog_table).to_pandas()

pyiceberg_test_catalog.db.zip

Second try:
pyiceberg_test_catalog.db.zip

samster25 added a commit that referenced this issue Mar 13, 2024
* Ignores metadata when comparing Fields
* Adds test that reads table written by pyiceberg

closes: #2004
@samster25
Copy link
Member

@maxime-petitjean Just merged in a fix to this issue! We plan to release the next version of daft tomorrow! Will ping you when it's out :)

@samster25
Copy link
Member

@maxime-petitjean we just cut getdaft==0.2.18 which includes the fix to this issue! Give it a shot :)

@ghalimi
Copy link

ghalimi commented Mar 14, 2024

Thanks a ton for your help guys, this is awesome.

@maxime-petitjean
Copy link
Contributor Author

@maxime-petitjean we just cut getdaft==0.2.18 which includes the fix to this issue! Give it a shot :)

Tested and working, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants