-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cast PyArrow schema to large_*
types
#807
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! It looks great. Just have 2 minor comments
To summarize the discussion in #791, we could always benefit from reading data as large_*
type since offset is 64-bit. For parquet, we will still write data in non large type due to parquet's 2GB data size limitation.
Just to confirm my understanding, since the major difference between large_binary
and binary
is the offset type (64-bit versus 32-bit), there will be no significant increase in memory usage when reading data as large_binary
.
pyiceberg/io/pyarrow.py
Outdated
@@ -998,7 +1026,7 @@ def _task_to_table( | |||
|
|||
fragment_scanner = ds.Scanner.from_fragment( | |||
fragment=fragment, | |||
schema=physical_schema, | |||
schema=_pyarrow_with_large_types(physical_schema), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be good to add a comment (either here or in the method body) to explain that we read data as large_*
types to improve the performance
Yes, that's how I understand it too. There are benefits to using
I think it actually won't matter either way because we will get an error when we either try to down cast the type to the smaller type, or try to write the parquet file when we have actually large data in the table. I think the important thing is to choose one and be consistent even on writes for the following reasons:
I've updated
Yes that's right. I've added a comment as you've suggested 🙂 Some relevant error traces:
(2)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation!
@HonahX could I ask for you to merge this in? It'll help unblock me in https://github.com/apache/iceberg-python/pull/786/files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, thanks @syun64 for adding this, and @HonahX for the review 👍
One possible addition would be to have this configurable. Based on configuration, it would either go with normal or with large types.
Fixes #791
For consistency, we should always cast to large types when inferring pyarrow schema from Iceberg Schema, and when scanning using the physical_schema of the fragment