Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Expose max_row_group_length option for Parquet writer #589

Merged
merged 2 commits into from
Nov 7, 2024

Conversation

disq
Copy link
Member

@disq disq commented Nov 6, 2024

No description provided.

@disq disq requested a review from a team as a code owner November 6, 2024 13:22
@disq disq requested review from marianogappa and removed request for a team November 6, 2024 13:22
@github-actions github-actions bot added the feat label Nov 6, 2024
@@ -9,13 +9,16 @@ import (
"github.com/invopop/jsonschema"
)

const defaultMaxRowGroupLength = 128 * 1024 * 1024
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason this was set to 128 mibi rows but it's actually a row count.

@@ -94,5 +115,8 @@ func (s *ParquetSpec) Validate() error {
if !slices.Contains(allowedRootRepetitions, s.RootRepetition) {
return fmt.Errorf("invalid rootRepetition: %s. Allowed values are %s", s.RootRepetition, strings.Join(allowedRootRepetitions, ", "))
}
if s.MaxRowGroupLength != nil && *s.MaxRowGroupLength <= 0 {
return fmt.Errorf("invalid: maxRowGroupLength: %v. Must be zero or positive", *s.MaxRowGroupLength)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is zero or a small number ok?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero would be going back to the default of the arrow parquetwriter. Updated this check to be < 0

Copy link
Member

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI changing this setting works see results with 1000 as the value on a table with more than 1000 rows
image

@erezrokah erezrokah added the automerge Add to automerge PRs once requirements are met label Nov 7, 2024
@kodiakhq kodiakhq bot merged commit 4a4b389 into main Nov 7, 2024
5 checks passed
@kodiakhq kodiakhq bot deleted the feat/expose-parquet-max-row-group-length-option branch November 7, 2024 21:02
kodiakhq bot pushed a commit that referenced this pull request Nov 7, 2024
🤖 I have created a release *beep* *boop*
---


## [4.5.0](v4.4.0...v4.5.0) (2024-11-07)


### Features

* Expose `max_row_group_length` option for Parquet writer ([#589](#589)) ([4a4b389](4a4b389))


### Bug Fixes

* **deps:** Update module github.com/cloudquery/plugin-sdk/v4 to v4.65.0 ([#581](#581)) ([136b836](136b836))
* **deps:** Update module github.com/cloudquery/plugin-sdk/v4 to v4.66.0 ([#582](#582)) ([5ca0a10](5ca0a10))
* **deps:** Update module github.com/cloudquery/plugin-sdk/v4 to v4.66.1 ([#583](#583)) ([69a6156](69a6156))
* **deps:** Update module github.com/cloudquery/plugin-sdk/v4 to v4.67.0 ([#584](#584)) ([cab4c19](cab4c19))
* **deps:** Update module github.com/cloudquery/plugin-sdk/v4 to v4.67.1 ([#585](#585)) ([d590743](d590743))
* **deps:** Update module github.com/cloudquery/plugin-sdk/v4 to v4.68.0 ([#586](#586)) ([b19c239](b19c239))
* **deps:** Update module github.com/cloudquery/plugin-sdk/v4 to v4.68.1 ([#587](#587)) ([3b9f708](3b9f708))
* **deps:** Update module github.com/cloudquery/plugin-sdk/v4 to v4.68.2 ([#588](#588)) ([f2be0a6](f2be0a6))
* Optimized CSV writing ([#579](#579)) ([d688971](d688971))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
kodiakhq bot pushed a commit to cloudquery/cloudquery that referenced this pull request Nov 8, 2024
#### Summary

The support for setting the group length was added in #19578 implicitly via cloudquery/filetypes#589.

This PR adds documentation for it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge Add to automerge PRs once requirements are met feat
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants