Core, Data, Spark 3.5: Support file and partition delete granularity #9384

aokolnychyi · 2023-12-26T20:35:15Z

This PR adds support for file and partition delete granularity, allowing users to pick between the two.

Under partition granularity, delete writers are allowed to group deletes for different data files into one delete file. This strategy tends to reduce the total number of delete files in the table. However, it may lead to the assignment of irrelevant deletes to some data files during the job planning. All irrelevant deletes are filtered by readers but add extra overhead, which can be mitigated via caching.

Under file granularity, delete writers always organize deletes by their target data file, creating separate delete files for each referenced data file. This strategy ensures the job planning does not assign irrelevant deletes to data files. However, it also increases the total number of delete files in the table and may require a more aggressive approach for delete file compaction.

Currently, this configuration is only applicable to position deletes.

Each granularity has its own benefits and drawbacks and should be picked based on a use case. Despite the chosen granularity, regular delete compaction remains necessary. It is also possible to use one granularity for ingestion and another one for table maintenance.

After

Benchmark                                                                                                                          Mode  Cnt           Score             Error   Units
ParquetWritersBenchmark.writeUnpartitionedClusteredPositionDeleteWriterFileGranularity                                               ss    5           2.751 ±           0.097    s/op
ParquetWritersBenchmark.writeUnpartitionedClusteredPositionDeleteWriterPartitionGranularity                                          ss    5           2.329 ±           0.114    s/op
ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriterFileGranularity                                                  ss    5           3.602 ±           0.085    s/op
ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriterPartitionGranularity                                             ss    5           3.098 ±           0.110    s/op
ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriterShuffledFileGranularity                                          ss    5           3.561 ±           0.108    s/op
ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriterShuffledPartitionGranularity                                     ss    5           3.587 ±           0.142    s/op

Before

Benchmark                                                                                                  Mode  Cnt           Score             Error   Units
ParquetWritersBenchmark.writeUnpartitionedClusteredPositionDeleteWriter                                      ss    5           2.279 ±           0.107    s/op
ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriter                                         ss    5           3.052 ±           0.075    s/op
ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriterShuffled                                 ss    5           3.645 ±           0.081    s/op

aokolnychyi · 2023-12-26T20:37:49Z

core/src/main/java/org/apache/iceberg/TableProperties.java

@@ -334,6 +335,9 @@ private TableProperties() {}
  public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms";
  public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE;

+  public static final String DELETE_GRANULARITY = "write.delete.granularity";


I am still debating the property name. As it stands today, it will be applicable only to position deletes but I am not sure the name has to reflect it.

Also, we would probably want to always use the file granularity for Flink position deletes to solve compaction issues. This property becomes more like a recommendation then.

Any feedback is appreciated.

Maybe just write.position-delete.granularity? I prefer to use a more precise name and limit the scope of its usage.

A while ago I encountered an issue about adjusting the row-group size of Parquet position delete files.
I want to adjust the default row-group size of Parquet pos delete of the tables that I manage to speed up queries (more details are in issue #9149), however I found the parameter write.delete.parquet.row-group-size-bytes that controls the row-group size of Parquet pos delete also controls the row-group size of equality delete files. But obviously the row-group sizes applicable to these two type of delete files are not the same.

Because we also use equality delete when the data size is small, I cannot directly set a default value of write.delete.parquet.row-group-size-bytes for new tables. I can only adjust write.delete.parquet.row-group-size-bytes according to the specific use of each table, which is inconvenient.

In fact, I think it is not appropriate to use one parameter to control the row-group size of both position delete files and equality delete files, so I created #9177 to add a separate parameter for the position delete file that only writes the file_path and pos columns.

Back to this, IIUC, If we later add a grouping granularity for equality delete, since position delete and equality delete have different characteristics, they will most likely apply different grouping granularity. So I think we'd better make the distinction right from the start, what do you think?

To be honest, I doubt we will ever support this property for equality deletes.

In general, I do get that we may want to configure position and equality deletes differently. We can explore adding an extra namespace. I am still not sure this use case falls into that bucket.

@rdblue @RussellSpitzer @zhongyujiang, thoughts? Do we want a prefix for this config to make it explicit that it only applies to position deletes? Currently, I only note that in the docs.

This option makes no sense for equality deletes because they aren't targeted at a single file, so I agree that we won't support it for equality. This is also mostly advisory. It is unlikely that we will support it in Flink and will instead always use file-level granularity. Maybe we won't even want to support this in the long term, if we decide that Spark performs better with file granularity at all times.

I guess where I'm at for this is that I would probably not worry much about it -- but also not add it to documentation since we will probably not want people setting it themselves. I think I'd leave it as write.delete.granularity.

The idea not to document it for now is reasonable given that it acts like a recommendation and we are not sure we want to support it in the future. Let's keep the name as is then.

Adding a way to configure position and equality deletes separately is another discussion.

aokolnychyi · 2023-12-26T20:38:59Z

core/src/main/java/org/apache/iceberg/io/FanoutPositionOnlyDeleteWriter.java

-            writerFactory, fileFactory, io, targetFileSizeInBytes, spec, partition);
-    return new SortingPositionOnlyDeleteWriter<>(delegate);
+    return new SortingPositionOnlyDeleteWriter<>(
+        () ->


I don't like how this part is formatted but I don't have a better way. Ideas welcome.

zinking · 2023-12-27T01:54:25Z

core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java

+ * Despite the chosen granularity, regular delete compaction remains necessary. It is also possible
+ * to use one granularity for ingestion and another one for table maintenance.
+ */
+public enum DeleteGranularity {


what is the granularity currently ? file ? what is the impact to flink writers ?

The current behavior is partition granularity. The new default will match the existing behavior.

There is no immediate impact on Flink writes. Equality deletes can only be written with partition granularity at the moment. That said, we should make Flink write position deletes with file granularity no matter what to solve the concurrent data compaction issue.

Flink uses the old writer API right now. We will follow up to change that.

zinking · 2023-12-27T01:56:13Z

core/src/main/java/org/apache/iceberg/deletes/TargetedPositionDeleteWriter.java

+ * by file and position as required by the spec. If there is no external process to order the
+ * records, consider using {@link SortingPositionOnlyDeleteWriter} instead.
+ */
+public class TargetedPositionDeleteWriter<T>


DataFileTargetedPositionDeleteWriter?

I tend to prefer shorter names if possible given our new 100 line length limit. Do you think this name will be easier to understand?

I guess the naming needs to somehow reflect the granularity which will make things more clear.

I agree, let me think a bit more about this. If you have more ideas, please share them as well!

ClusteredFilePosDeleteWriter?

I believe Clustered is something we use for PartitioningWriter implementations to indicate that incoming records are grouped by spec and partition. If we use that prefix in this context, it may be a bit misleading.

I renamed this class to FileScopedPositionDeleteWriter. Let me know what you think.

I think FileScoped or if you want a whole new name PerFilePostionDeleteFileWriter?

aokolnychyi · 2023-12-27T08:38:28Z

core/src/main/java/org/apache/iceberg/deletes/TargetedPositionDeleteWriter.java

+      return true;
+    }
+
+    for (int index = s1Length - 1; index >= 0; index--) {


Comparing the paths from the end as the prefix is usually the same and is long.

aokolnychyi · 2023-12-27T08:41:05Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

@@ -708,4 +709,15 @@ private long sparkAdvisoryPartitionSize() {
  private double shuffleCompressionRatio(FileFormat outputFileFormat, String outputCodec) {
    return SparkCompressionUtil.shuffleCompressionRatio(spark, outputFileFormat, outputCodec);
  }
+
+  public DeleteGranularity deleteGranularity() {


I did not add a SQL property cause I am not sure it makes sense at the session level.
Thoughts?

Sounds reasonable to me. Config properties are more surface area to support.

jerqi · 2023-12-27T08:52:11Z

core/src/main/java/org/apache/iceberg/TableProperties.java

@@ -334,6 +335,9 @@ private TableProperties() {}
  public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms";
  public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE;

+  public static final String DELETE_GRANULARITY = "write.delete.granularity";


Should we add a document for this table property?

Will do, same for the write option.

Reverted based on this discussion.

jerqi · 2023-12-27T15:34:39Z

One question:
Iceberg has the rewritePositionDeletesAction. Will this pr influence this action?

rdblue · 2024-01-01T17:06:33Z

core/src/main/java/org/apache/iceberg/TableProperties.java

@@ -334,6 +335,9 @@ private TableProperties() {}
  public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms";
  public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE;

+  public static final String DELETE_GRANULARITY = "write.delete.granularity";
+  public static final String DELETE_GRANULARITY_DEFAULT = DeleteGranularity.PARTITION.toString();


Nit: I'd use a string so that we are forced to continue supporting it, like the other defaults. This would technically allow someone to change PARTITION in the code without breaking although it would change the property value.

Ah, I see you override toString so it's probably fine.

I started with a string constant but then saw what we did for RowLevelOperationMode and decided to follow that for consistency.

rdblue · 2024-01-01T17:16:02Z

core/src/main/java/org/apache/iceberg/deletes/SortingPositionOnlyDeleteWriter.java

@@ -60,7 +72,7 @@ public void write(PositionDelete<T> positionDelete) {

  @Override
  public long length() {
-    return writer.length();
+    return result != null ? result.totalFileSizeInBytes() : 0L;


Can you note this behavior in the Javadoc? I think it is correct to only produce the size once it has been closed an produces a result, since that would avoid any problem from wrapping this in a RollingFileWriter. But it is still unexpected that this isn't accurate during the write.

I'll probably switch to not implementing it at all, just like we do in the other writer.

rdblue · 2024-01-01T17:20:41Z

core/src/main/java/org/apache/iceberg/deletes/SortingPositionOnlyDeleteWriter.java

-    List<CharSequence> paths = Lists.newArrayList(positionsByPath.keySet());
-    paths.sort(Comparators.charSequences());
-    return paths;
+  private Iterable<CharSequence> sort(Iterable<CharSequence> paths) {


What about using Collection for the incoming so you can test its size? Then you could check whether the size is 1 and avoid copying the list.

rdblue · 2024-01-01T17:27:44Z

core/src/main/java/org/apache/iceberg/deletes/TargetedPositionDeleteWriter.java

+    }
+  }
+
+  private static boolean unequal(CharSequence s1, CharSequence s2) {


It would be nice to put this in CharSeqComparator for reuse. I think it is fine that it doesn't worry about high surrogates.

Our CharSeqComparator is private. I've put this into a new utility class.

rdblue · 2024-01-01T17:28:36Z

core/src/main/java/org/apache/iceberg/deletes/TargetedPositionDeleteWriter.java

+
+  @Override
+  public long length() {
+    throw new UnsupportedOperationException(getClass().getName() + " does not implement length");


If this doesn't need to be implemented, should we avoid implementing it in SortingPositionOnlyDeleteWriter?

We don't need cause this writer wraps the rolling writer, not the other way around.

rdblue · 2024-01-01T17:33:14Z

core/src/main/java/org/apache/iceberg/io/ClusteredPositionDeleteWriter.java

    this.deleteFiles = Lists.newArrayList();
    this.referencedDataFiles = CharSequenceSet.empty();
  }

  @Override
  protected FileWriter<PositionDelete<T>, DeleteWriteResult> newWriter(
      PartitionSpec spec, StructLike partition) {
+    switch (granularity) {
+      case FILE:
+        return new TargetedPositionDeleteWriter<>(() -> newRollingWriter(spec, partition));


This doesn't need to use rolling writers, right? The sorting writer won't ever roll because its length is 0L until it is closed.

We can actually roll correctly here because this is the "clustered" path. We are not going to use the sorting writer and will not buffer deletes. We can also roll correctly in the "fanout" path cause the sorting writer acts as a wrapper on top of the rolling writer.

rdblue · 2024-01-01T17:40:47Z

Looks good overall. Thanks for adding this!

RussellSpitzer · 2024-01-02T16:38:21Z

core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java

+/**
+ * An enum that represents the granularity of deletes.
+ *
+ * <p>Under partition granularity, delete writers are allowed to group deletes for different data


"are allowed"? Perhaps maybe we should say something like "are directed to group deletes". I think the text in this doc goes a bit back and forth between saying that the delete writers will do something and the delete writers may do something.

I think it may also help to kind of express these as (Many data files -> One Delete file) and (One data file -> One Delete File) or something like that?

Makes sense.

RussellSpitzer · 2024-01-02T16:49:05Z

core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java

+ *
+ * <p>Under partition granularity, delete writers are allowed to group deletes for different data
+ * files into one delete file. This strategy tends to reduce the total number of delete files in the
+ * table. However, it may lead to the assignment of irrelevant deletes to some data files during the


Potential Rewrite? Trying to make this a but more directly worded

However, a scan for a single data file will require reading delete information for multiple data files in the partition even if those other files are not required for the scan. This information will be ignored during the reads but reading this extra delete information will cause overhead. The overhead can potentially be mitigated via delete file caching (link here?).

I like it, incorporated.

RussellSpitzer · 2024-01-02T16:51:55Z

core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java

+ *
+ * <p>Under file granularity, delete writers always organize deletes by their target data file,
+ * creating separate delete files for each referenced data file. This strategy ensures the job
+ * planning does not assign irrelevant deletes to data files. However, it also increases the total


"to data files which means no possibly extranousious delete information will be read unlike in partition granularity"?

Rewrote this part as well.

RussellSpitzer · 2024-01-02T16:52:22Z

core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java

+ * <p>Currently, this configuration is only applicable to position deletes.
+ *
+ * <p>Each granularity has its own benefits and drawbacks and should be picked based on a use case.
+ * Despite the chosen granularity, regular delete compaction remains necessary. It is also possible


Despite -> Regardless of the

or maybe
"Regular delete compaction is still required regardless of which granularity is chosen."

RussellSpitzer

+1, I think we can tighten up the Java doc a bit but I think all the code and tests are good

aokolnychyi · 2024-01-02T17:49:44Z

One question:
Iceberg has the rewritePositionDeletesAction. Will this pr influence this action?

@jerqi, yes, it will. There is a new test in TestRewritePositionDeleteFilesAction

aokolnychyi · 2024-01-02T21:27:41Z

Thanks for reviewing, @jerqi @zinking @zhongyujiang @rdblue @RussellSpitzer!

…pache#9384)

This change backports #9384 to Spark 3.4.

…pache#9384)

This change backports apache#9384 to Spark 3.4.

github-actions bot added spark core data labels Dec 26, 2023

aokolnychyi commented Dec 26, 2023

View reviewed changes

zinking reviewed Dec 27, 2023

View reviewed changes

aokolnychyi changed the title ~~Core, Spark 3.5: Support file and partition delete granularity~~ Core, Data, Spark 3.5: Support file and partition delete granularity Dec 27, 2023

aokolnychyi commented Dec 27, 2023

View reviewed changes

jerqi reviewed Dec 27, 2023

View reviewed changes

rdblue reviewed Jan 1, 2024

View reviewed changes

rdblue approved these changes Jan 1, 2024

View reviewed changes

RussellSpitzer reviewed Jan 2, 2024

View reviewed changes

RussellSpitzer approved these changes Jan 2, 2024

View reviewed changes

Core, Data, Spark 3.5: Support file and partition delete granularity

fbd206c

aokolnychyi force-pushed the targeted-deletes branch from ad37f18 to fbd206c Compare January 2, 2024 17:47

github-actions bot added API docs labels Jan 2, 2024

Revert docs

21fdc1a

aokolnychyi merged commit e7999a1 into apache:main Jan 2, 2024
42 checks passed

lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024

Core, Data, Spark 3.5: Support file and partition delete granularity (a…

1900c78

…pache#9384)

geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024

Core, Data, Spark 3.5: Support file and partition delete granularity (a…

aa3f925

…pache#9384)

aokolnychyi mentioned this pull request Jan 31, 2024

Spark 3.4: Support file and partition delete granularity #9602

Merged

aokolnychyi added a commit that referenced this pull request Feb 1, 2024

Spark 3.4: Support file and partition delete granularity (#9602)

3547a99

This change backports #9384 to Spark 3.4.

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Core, Data, Spark 3.5: Support file and partition delete granularity (a…

d9a00b4

…pache#9384)

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Spark 3.4: Support file and partition delete granularity (apache#9602)

a840dbc

This change backports apache#9384 to Spark 3.4.

stevenzwu mentioned this pull request Apr 22, 2024

Flink: Apply DeleteGranularity for writes #10200

Merged

Core, Data, Spark 3.5: Support file and partition delete granularity #9384

Core, Data, Spark 3.5: Support file and partition delete granularity #9384

Conversation

aokolnychyi commented Dec 26, 2023

Choose a reason for hiding this comment

aokolnychyi Dec 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Jan 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerqi commented Dec 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Jan 1, 2024 • edited Loading

Choose a reason for hiding this comment

aokolnychyi Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

rdblue commented Jan 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer left a comment

Choose a reason for hiding this comment

aokolnychyi commented Jan 2, 2024

aokolnychyi commented Jan 2, 2024

aokolnychyi Dec 27, 2023 •

edited

Loading

rdblue Jan 1, 2024 •

edited

Loading

jerqi commented Dec 27, 2023 •

edited

Loading

rdblue Jan 1, 2024 •

edited

Loading

aokolnychyi Jan 2, 2024 •

edited

Loading

RussellSpitzer Jan 2, 2024 •

edited

Loading