Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cast PyArrow schema to large_* types #807

Merged
merged 6 commits into from
Jun 14, 2024
Merged

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Jun 11, 2024

Fixes #791

For consistency, we should always cast to large types when inferring pyarrow schema from Iceberg Schema, and when scanning using the physical_schema of the fragment

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! It looks great. Just have 2 minor comments

To summarize the discussion in #791, we could always benefit from reading data as large_* type since offset is 64-bit. For parquet, we will still write data in non large type due to parquet's 2GB data size limitation.

Just to confirm my understanding, since the major difference between large_binary and binary is the offset type (64-bit versus 32-bit), there will be no significant increase in memory usage when reading data as large_binary.

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
@@ -998,7 +1026,7 @@ def _task_to_table(

fragment_scanner = ds.Scanner.from_fragment(
fragment=fragment,
schema=physical_schema,
schema=_pyarrow_with_large_types(physical_schema),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be good to add a comment (either here or in the method body) to explain that we read data as large_* types to improve the performance

@sungwy
Copy link
Collaborator Author

sungwy commented Jun 12, 2024

To summarize the discussion in #791, we could always benefit from reading data as large_* type since offset is 64-bit.

Yes, that's how I understand it too. There are benefits to using large_* types in memory, so we can decouple the motivation for storing data in memory as large types from that of writing large types, even if our only supported (in PyIceberg) file type doesn't support writing large data yet.

For parquet, we will still write data in non large type due to parquet's 2GB data size limitation.

I think it actually won't matter either way because we will get an error when we either try to down cast the type to the smaller type, or try to write the parquet file when we have actually large data in the table. I think the important thing is to choose one and be consistent even on writes for the following reasons:

  1. Write will fail if the provided schema in the ParquetWriter does not match with the table schema (1)
  2. We should have a consistent error message for the users if they attempt to write large data. (2)

I've updated to_requested_schema function to always cast to large types even on write for consistency.

Just to confirm my understanding, since the major difference between large_binary and binary is the offset type (64-bit versus 32-bit), there will be no significant increase in memory usage when reading data as large_binary.

Yes that's right. I've added a comment as you've suggested 🙂

Some relevant error traces:
(1)

ValueError: Table schema does not match schema used to create file: 
table:
foo: large_binary
bar: string
baz: string
list: list<item: int8>
  child 0, item: int8 vs. 
file:
foo: binary
bar: string
baz: string
list: list<item: int8>
  child 0, item: int8

(2)

ArrowInvalid: Parquet cannot store strings with size 2GB or more

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation!

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
@sungwy
Copy link
Collaborator Author

sungwy commented Jun 14, 2024

@HonahX could I ask for you to merge this in? It'll help unblock me in https://github.com/apache/iceberg-python/pull/786/files

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks @syun64 for adding this, and @HonahX for the review 👍

One possible addition would be to have this configurable. Based on configuration, it would either go with normal or with large types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upcasting and Downcasting inconsistencies with PyArrow Schema
3 participants