Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting write.parquet.row-group-limit #1016

Merged
merged 5 commits into from
Aug 8, 2024

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Aug 7, 2024

And update the docs

Fixes #1013

@Fokko Fokko force-pushed the fd-allow-setting-max-row-group-size branch from 5b91696 to 46afeaf Compare August 7, 2024 15:00
@Fokko Fokko added this to the PyIceberg 0.7.1 release milestone Aug 7, 2024
@sungwy
Copy link
Collaborator

sungwy commented Aug 7, 2024

LGTM @Fokko - merging in the change from main to resolve the conflict on the doc

@Fokko
Copy link
Contributor Author

Fokko commented Aug 8, 2024

Also threw in a test here 👍

@sungwy sungwy merged commit debda66 into apache:main Aug 8, 2024
7 checks passed
@@ -32,6 +32,7 @@ Iceberg tables support table properties to configure table behavior.
| -------------------------------------- | --------------------------------- | ------- | ------------------------------------------------------------------------------------------- |
| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. |
| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg |
| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko @sungwy Thanks, I believe this has resolved my issue #1012 as well.

However, I would like to remind you that this option already exists in the doc, right after write.parquet.dict-size-bytes, the UI doesn't allow me to leave a comment there, so please expand the collapsed area to see it.

Additionally, I'm kind of curious as to why the default value used this time is significantly larger than the previous one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for flagging this @zhongyujiang - I'll get the second one below with the older default value removed.

To my understanding the new value is the correct default value that matches the default in the PyArrow ParquetWriter: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html

sungwy added a commit that referenced this pull request Aug 9, 2024
* Allow setting `write.parquet.row-group-limit`

And update the docs

* Add test

* Make ruff happy

---------

Co-authored-by: Sung Yun <[email protected]>
sungwy added a commit that referenced this pull request Aug 9, 2024
* Allow setting `write.parquet.row-group-limit`

And update the docs

* Add test

* Make ruff happy

---------

Co-authored-by: Sung Yun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NotImplementedError: Parquet writer option(s) ['write.parquet.row-group-size-bytes'] not implemented
3 participants