Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_files support partitioned tables #531

Merged
merged 5 commits into from
Mar 21, 2024
Merged

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Mar 18, 2024

As a follow up to #506, this PR introduces the support for adding files as DataFiles to partitioned tables.

Instead of relying on the more inaccurate method of parsing and inferring partition values from the file path relying on a Hive partitioning scheme, this approach requires that the partition values are there in the parquet files, and infers the partition values from the partition metadata footer by taking using the lower and upper bound values.

The optimization to use the lower bound and upper bound values prevents the client from having to read the entire parquet file as it is able to use the aggregated statistics from the parquet metadata footer. As a result, this implementation of add_files does not support tables with partition transforms that are non-linear (not preserves_order).

Among the existing Transforms, the following Transform partitions are supported:

  • IdentityTransform
  • TruncateTransform
  • YearTransform
  • MonthTransform
  • DayTransform
  • HourTransform

The following are not:

  • VoidTransform
  • BucketTransform
  • UnknownTransform

@sungwy sungwy requested review from Fokko and HonahX March 18, 2024 18:46
mkdocs/docs/api.md Outdated Show resolved Hide resolved
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syun64 Thanks for working on this, this looks great!

mkdocs/docs/api.md Outdated Show resolved Hide resolved
pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
pyiceberg/io/pyarrow.py Show resolved Hide resolved
pyiceberg/manifest.py Outdated Show resolved Hide resolved
tests/integration/test_add_files.py Outdated Show resolved Hide resolved
tests/integration/test_add_files.py Outdated Show resolved Hide resolved
tests/integration/test_add_files.py Outdated Show resolved Hide resolved
pyiceberg/partitioning.py Outdated Show resolved Hide resolved
pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
@sungwy sungwy requested a review from Fokko March 19, 2024 15:08
@sungwy
Copy link
Collaborator Author

sungwy commented Mar 19, 2024

@syun64 Thanks for working on this, this looks great!

Thank you very much for the detailed review @Fokko . I've adopted all of your review comments 👍 - I would appreciate another round of review!

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks again for the work @syun64

@Fokko Fokko merged commit 6989b92 into apache:main Mar 21, 2024
7 checks passed
@sungwy
Copy link
Collaborator Author

sungwy commented Mar 21, 2024

This looks good, thanks again for the work @syun64

Thank you! As always! @Fokko

@sungwy sungwy deleted the add-files-partitioned branch March 21, 2024 20:28
@sungwy sungwy added this to the PyIceberg 0.7.0 release milestone Jul 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants