Cast PyArrow schema to `large_*` types #807

sungwy · 2024-06-11T01:56:32Z

Fixes #791

For consistency, we should always cast to large types when inferring pyarrow schema from Iceberg Schema, and when scanning using the physical_schema of the fragment

HonahX

Thanks for working on this! It looks great. Just have 2 minor comments

To summarize the discussion in #791, we could always benefit from reading data as large_* type since offset is 64-bit. For parquet, we will still write data in non large type due to parquet's 2GB data size limitation.

Just to confirm my understanding, since the major difference between large_binary and binary is the offset type (64-bit versus 32-bit), there will be no significant increase in memory usage when reading data as large_binary.

pyiceberg/io/pyarrow.py

HonahX · 2024-06-12T07:03:00Z

pyiceberg/io/pyarrow.py

@@ -998,7 +1026,7 @@ def _task_to_table(

        fragment_scanner = ds.Scanner.from_fragment(
            fragment=fragment,
-            schema=physical_schema,
+            schema=_pyarrow_with_large_types(physical_schema),


It may be good to add a comment (either here or in the method body) to explain that we read data as large_* types to improve the performance

sungwy · 2024-06-12T14:36:52Z

To summarize the discussion in #791, we could always benefit from reading data as large_* type since offset is 64-bit.

Yes, that's how I understand it too. There are benefits to using large_* types in memory, so we can decouple the motivation for storing data in memory as large types from that of writing large types, even if our only supported (in PyIceberg) file type doesn't support writing large data yet.

For parquet, we will still write data in non large type due to parquet's 2GB data size limitation.

I think it actually won't matter either way because we will get an error when we either try to down cast the type to the smaller type, or try to write the parquet file when we have actually large data in the table. I think the important thing is to choose one and be consistent even on writes for the following reasons:

Write will fail if the provided schema in the ParquetWriter does not match with the table schema (1)
We should have a consistent error message for the users if they attempt to write large data. (2)

I've updated to_requested_schema function to always cast to large types even on write for consistency.

Just to confirm my understanding, since the major difference between large_binary and binary is the offset type (64-bit versus 32-bit), there will be no significant increase in memory usage when reading data as large_binary.

Yes that's right. I've added a comment as you've suggested 🙂

Some relevant error traces:
(1)

ValueError: Table schema does not match schema used to create file: 
table:
foo: large_binary
bar: string
baz: string
list: list<item: int8>
  child 0, item: int8 vs. 
file:
foo: binary
bar: string
baz: string
list: list<item: int8>
  child 0, item: int8

(2)

ArrowInvalid: Parquet cannot store strings with size 2GB or more

HonahX

Thanks for the detailed explanation!

pyiceberg/io/pyarrow.py

sungwy · 2024-06-14T17:59:23Z

@HonahX could I ask for you to merge this in? It'll help unblock me in https://github.com/apache/iceberg-python/pull/786/files

Fokko

This looks good, thanks @syun64 for adding this, and @HonahX for the review 👍

One possible addition would be to have this configurable. Based on configuration, it would either go with normal or with large types.

_pyarrow_with

af44df9

sungwy mentioned this pull request Jun 11, 2024

Avoid Upcasting to pa.large_binary() #803

Closed

sungwy added 2 commits June 11, 2024 02:10

fix

89d33a8

fix test

9474968

HonahX reviewed Jun 12, 2024

View reviewed changes

adopt review feedback

69069aa

revert accidental conf change

9949396

HonahX approved these changes Jun 13, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

adopt-nit

cf2ba49

Fokko approved these changes Jun 14, 2024

View reviewed changes

Fokko merged commit d4a4eed into apache:main Jun 14, 2024
7 checks passed

This was referenced Jun 19, 2024

Allow writing dataframes that are either a subset of table schema or in arbitrary order #829

Closed

ValueError: Mismatch in fields: ? #674

Closed

sungwy mentioned this pull request Jun 21, 2024

Support Table.to_arrow_batch_reader to return RecordBatchReader instead of a fully materialized Arrow Table #786

Merged

This was referenced Jul 2, 2024

Forward incompatible types introduced when writing Iceberg data #887

Closed

Forward Compatible large_* type support: read as large, write as small #890

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cast PyArrow schema to `large_*` types #807

Cast PyArrow schema to `large_*` types #807

sungwy commented Jun 11, 2024

HonahX left a comment

HonahX Jun 12, 2024

sungwy commented Jun 12, 2024 •

edited

Loading

HonahX left a comment

sungwy commented Jun 14, 2024

Fokko left a comment

Cast PyArrow schema to large_* types #807

Cast PyArrow schema to large_* types #807

Conversation

sungwy commented Jun 11, 2024

HonahX left a comment

Choose a reason for hiding this comment

HonahX Jun 12, 2024

Choose a reason for hiding this comment

sungwy commented Jun 12, 2024 • edited Loading

HonahX left a comment

Choose a reason for hiding this comment

sungwy commented Jun 14, 2024

Fokko left a comment

Choose a reason for hiding this comment

Cast PyArrow schema to `large_*` types #807

Cast PyArrow schema to `large_*` types #807

sungwy commented Jun 12, 2024 •

edited

Loading