Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution #8847

stevenzwu · 2023-10-16T23:00:26Z

Apache Iceberg version

1.4.0 (latest release)

Query engine

Flink

Please describe the bug 🐞

see details from this comment: #7161 (comment)

rdblue · 2023-10-16T23:15:43Z

@stevenzwu, can you help us understand what is a problem with this and why it should be removed from the 1.4.1 release?

stevenzwu · 2023-10-16T23:38:06Z

@rdblue here is the recap from the discussions: #7161 (comment)

PR #7161 automatically apply the custom bucketing partitioner to distribute buckets to writer tasks in a balanced way. It only looks at the bucket column (ignoring other partition columns) with the assumption that the bucket column is the main thing we need to distribute.

But a user reports that they have a partition spec like date, hour, minute, bucket(8). PR #7161 imposed a new default behavior that changed the distribution from simple keyBy on tuples with all partition columns to a custom partitioner with only bucket column. To me, the partition strategy is questionable. Bucket column is used here mainly to work around the OOM issue caused by skewed data distribution across partition columns and unbalanced value distribution from simple keyBy.

In the end, I feel it is safer to revert the behavior change from PR #7161 and ask users to manually apply the customer partitioner for the bucket column. Previously, we were thinking about automatically enable it when the partition spec has a bucketing column and hash distribution is set in table property.

rdblue · 2023-10-17T00:14:19Z

Thanks, @stevenzwu! I agree that reverting the behavior change makes the most sense. We should be careful about default behavior changes and rolling back the change (but not the feature) sounds reasonable.

stevenzwu · 2023-10-17T04:50:03Z

We should be careful about default behavior changes

agree. This is on me with the wrong assumption that bucketing column is the only thing need to be distributed.

nastra · 2023-10-17T16:03:20Z

Closing this as #8848 has been merged to main and I backported it to 1.4.x in #8858

stevenzwu added this to the Iceberg 1.4.1 milestone Oct 16, 2023

stevenzwu added the flink label Oct 16, 2023

stevenzwu changed the title ~~Flink: revert the automatic custom partitioner for bucketing column with hash distribution~~ Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution Oct 16, 2023

stevenzwu mentioned this issue Oct 17, 2023

Flink: Reverting the default custom partitioner for bucket column #8848

Merged

nastra closed this as completed Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution #8847

Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution #8847

stevenzwu commented Oct 16, 2023

rdblue commented Oct 16, 2023

stevenzwu commented Oct 16, 2023 •

edited

Loading

rdblue commented Oct 17, 2023

stevenzwu commented Oct 17, 2023

nastra commented Oct 17, 2023

Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution #8847

Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution #8847

Comments

stevenzwu commented Oct 16, 2023

Apache Iceberg version

Query engine

Please describe the bug 🐞

rdblue commented Oct 16, 2023

stevenzwu commented Oct 16, 2023 • edited Loading

rdblue commented Oct 17, 2023

stevenzwu commented Oct 17, 2023

nastra commented Oct 17, 2023

stevenzwu commented Oct 16, 2023 •

edited

Loading