Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution #8847

Closed
stevenzwu opened this issue Oct 16, 2023 · 5 comments
Labels
Milestone

Comments

@stevenzwu
Copy link
Contributor

Apache Iceberg version

1.4.0 (latest release)

Query engine

Flink

Please describe the bug 🐞

see details from this comment: #7161 (comment)

@stevenzwu stevenzwu added this to the Iceberg 1.4.1 milestone Oct 16, 2023
@rdblue
Copy link
Contributor

rdblue commented Oct 16, 2023

@stevenzwu, can you help us understand what is a problem with this and why it should be removed from the 1.4.1 release?

@stevenzwu stevenzwu changed the title Flink: revert the automatic custom partitioner for bucketing column with hash distribution Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution Oct 16, 2023
@stevenzwu
Copy link
Contributor Author

stevenzwu commented Oct 16, 2023

@rdblue here is the recap from the discussions: #7161 (comment)

PR #7161 automatically apply the custom bucketing partitioner to distribute buckets to writer tasks in a balanced way. It only looks at the bucket column (ignoring other partition columns) with the assumption that the bucket column is the main thing we need to distribute.

But a user reports that they have a partition spec like date, hour, minute, bucket(8). PR #7161 imposed a new default behavior that changed the distribution from simple keyBy on tuples with all partition columns to a custom partitioner with only bucket column. To me, the partition strategy is questionable. Bucket column is used here mainly to work around the OOM issue caused by skewed data distribution across partition columns and unbalanced value distribution from simple keyBy.

In the end, I feel it is safer to revert the behavior change from PR #7161 and ask users to manually apply the customer partitioner for the bucket column. Previously, we were thinking about automatically enable it when the partition spec has a bucketing column and hash distribution is set in table property.

@rdblue
Copy link
Contributor

rdblue commented Oct 17, 2023

Thanks, @stevenzwu! I agree that reverting the behavior change makes the most sense. We should be careful about default behavior changes and rolling back the change (but not the feature) sounds reasonable.

@stevenzwu
Copy link
Contributor Author

We should be careful about default behavior changes

agree. This is on me with the wrong assumption that bucketing column is the only thing need to be distributed.

@nastra
Copy link
Contributor

nastra commented Oct 17, 2023

Closing this as #8848 has been merged to main and I backported it to 1.4.x in #8858

@nastra nastra closed this as completed Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants