Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IOException when querying table with a list<int> column. #47

Open
daturkel opened this issue Mar 28, 2024 · 4 comments
Open

IOException when querying table with a list<int> column. #47

daturkel opened this issue Mar 28, 2024 · 4 comments

Comments

@daturkel
Copy link

daturkel commented Mar 28, 2024

Hello, thanks for your great work on duckdb-iceberg!

I'm encountering an issue where attempting to query an iceberg table with iceberg_scan fails if the table has a column which is a list of ints (list<int>).

table_with_lists # a pyiceberg.table.Table object
table_without_lists # another pyiceberg.table.Table object

duckdb.sql("INSTALL ICEBERG; LOAD ICEBERG;")

# querying the table that doesn't feature any lists works fine
duckdb.sql(f"SELECT * FROM iceberg_scan('{table_without_lists.metadata_location}') LIMIT 10")

# querying the table that *does* have lists throws an error
duckdb.sql(f"SELECT * FROM iceberg_scan('{table_with_lists.metadata_location}') LIMIT 10")
# IOException: IO Error: Invalid field found while parsing field: type

This seems to be related to an error parsing the table schema. I was wondering if there maybe was no logic for complex types in the duckdb-iceberg extension, but it appears that there is.

@samansmink
Copy link
Collaborator

Hey @daturkel!

There should be a workaround here #45. It requires installing the nightly extension though!

@daturkel
Copy link
Author

daturkel commented Apr 3, 2024

(Whoops, originally tried this without adding the skip_schema_inference argument. Tried again and it works now, update below.)

Thanks for the quick response! I tried the workaround suggested in that thread:

duckdb.sql("force install iceberg from 'http://nightly-extensions.duckdb.org'")
duckdb.sql("load iceberg")
duckdb.sql(f"SELECT * FROM iceberg_scan('{metadata_location}', skip_schema_inference=True) LIMIT 10")

and it worked perfectly!

@rustyconover
Copy link
Contributor

I think I've addressed this with #50, it should fix parsing the metadata.

@rmoff
Copy link

rmoff commented Jun 6, 2024

I'm getting the same error with DuckDB v1.0.0 1f98600c2c and some Iceberg data with nested fields

🟡◗ SELECT *
    FROM iceberg_scan('s3://my-bucket/iceberg/rmoff.db/rmoff_basket02/metadata/00640-ed419044-046c-44c0-a20a-e51f0cce381f.metadata.json');
Run Time (s): real 1.685 user 0.008519 sys 0.006796
IO Error: Invalid field found while parsing field: type

It works fine if I use skip_schema_inference=True. Here's the schema as seen by duckDB if I CTAS the iceberg table into a local one:

🟡◗ describe tmp
    ;
┌─────────────────┬────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│   column_name   │                                    column_type                                     │  null   │   key   │ default │  extra  │
│     varcharvarcharvarcharvarcharvarcharvarchar │
├─────────────────┼────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ basketId        │ VARCHAR                                                                            │ YES     │         │         │         │
│ customerId      │ VARCHAR                                                                            │ YES     │         │         │         │
│ customerName    │ VARCHAR                                                                            │ YES     │         │         │         │
│ customerAddress │ VARCHAR                                                                            │ YES     │         │         │         │
│ storeId         │ VARCHAR                                                                            │ YES     │         │         │         │
│ storeName       │ VARCHAR                                                                            │ YES     │         │         │         │
│ storeLocation   │ VARCHAR                                                                            │ YES     │         │         │         │
│ products        │ STRUCT(productName VARCHAR, quantity INTEGER, unitPrice FLOAT, category VARCHAR)[] │ YES     │         │         │         │
│ timestampVARCHAR                                                                            │ YES     │         │         │         │
└─────────────────┴────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 0.002 user 0.000879 sys 0.000218

The Iceberg manifest is:

{
  "format-version" : 2,
  "table-uuid" : "59dc9799-83e7-4bec-8a84-2a74e3abeaf6",
  "location" : "s3://rmoff/iceberg-test/rmoff.db/rmoff_basket02",
  "last-sequence-number" : 646,
  "last-updated-ms" : 1717678301424,
  "last-column-id" : 14,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "basketId",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "customerId",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 3,
      "name" : "customerName",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 4,
      "name" : "customerAddress",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 5,
      "name" : "storeId",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 6,
      "name" : "storeName",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 7,
      "name" : "storeLocation",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 8,
      "name" : "products",
      "required" : false,
      "type" : {
        "type" : "list",
        "element-id" : 10,
        "element" : {
          "type" : "struct",
          "fields" : [ {
            "id" : 11,
            "name" : "productName",
            "required" : false,
            "type" : "string"
          }, {
            "id" : 12,
            "name" : "quantity",
            "required" : false,
            "type" : "int"
          }, {
            "id" : 13,
            "name" : "unitPrice",
            "required" : false,
            "type" : "float"
          }, {
            "id" : 14,
            "name" : "category",
            "required" : false,
            "type" : "string"
          } ]
        },
        "element-required" : false
      }
    }, {
      "id" : 9,
      "name" : "timestamp",
      "required" : false,
      "type" : "string"
    } ]
  } ],
[…]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants