-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution #8847
Comments
@stevenzwu, can you help us understand what is a problem with this and why it should be removed from the 1.4.1 release? |
@rdblue here is the recap from the discussions: #7161 (comment) PR #7161 automatically apply the custom bucketing partitioner to distribute buckets to writer tasks in a balanced way. It only looks at the bucket column (ignoring other partition columns) with the assumption that the bucket column is the main thing we need to distribute. But a user reports that they have a partition spec like date, hour, minute, bucket(8). PR #7161 imposed a new default behavior that changed the distribution from simple keyBy on tuples with all partition columns to a custom partitioner with only bucket column. To me, the partition strategy is questionable. Bucket column is used here mainly to work around the OOM issue caused by skewed data distribution across partition columns and unbalanced value distribution from simple keyBy. In the end, I feel it is safer to revert the behavior change from PR #7161 and ask users to manually apply the customer partitioner for the bucket column. Previously, we were thinking about automatically enable it when the partition spec has a bucketing column and hash distribution is set in table property. |
Thanks, @stevenzwu! I agree that reverting the behavior change makes the most sense. We should be careful about default behavior changes and rolling back the change (but not the feature) sounds reasonable. |
agree. This is on me with the wrong assumption that bucketing column is the only thing need to be distributed. |
Apache Iceberg version
1.4.0 (latest release)
Query engine
Flink
Please describe the bug 🐞
see details from this comment: #7161 (comment)
The text was updated successfully, but these errors were encountered: