Spark 3.4: Add write options to override the compression properties of the table #8313

jerqi · 2023-08-13T08:05:58Z

What changes were proposed in this pull request?

Flink has supported to use write option to override the compression properties of the table.
I refer to the pull request #6049 and then I add write options to override the compression properties of the table for Spark 3.4.

Why are the changes needed?

First, there exist some old tables using gzip compression codec in the production environment. If we use zstd compression codec to rewrite the data, we can reduce the data volume and reduce the cost of storage.
Second, this pr is also meaning after we choose zstd as default compression codec. Because we can choose different compression levels when we write the Iceberg data and rewrite the Iceberg data if we have this pr.

Does this PR introduce any user-facing change?

Yes.
This pr introduces the config option compression-codec, compression-level and compression-strategy. (document added)
This pr introduces the Spark config properties spark.sql.iceberg.write.compression-codec, spark.sql.iceberg.write.compression-level and spark.sql.iceberg.write.compression-strategy. (I will add the document after this pr is merged.)

How was this patch tested?

Add a new ut and manual verification.
I use the ut to verify that the compression codec is correct.
I verify the compression config option is correct by hand.

企业微信截图_fdb11295-8800-4ccc-ab5a-ec0abf12e913

…f the Table

jerqi · 2023-08-13T08:42:15Z

@chenjunjiedada @ConeyLiu Could you help me review this pr?

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestCompressionSettings.java

chenjunjiedada · 2023-08-13T10:36:53Z

@jerqi Could you add some doc like in #6049 since compression-level and compression-strategy are targeted for different formats separately? Also, it would be better to add a description of the default value and the available options. Others LGTM. Thanks for the contribution.

jerqi · 2023-08-13T11:54:32Z

@jerqi Could you add some doc like in #6049 since compression-level and compression-strategy are targeted for different formats separately? Also, it would be better to add a description of the default value and the available options. Others LGTM. Thanks for the contribution.

Thanks for your review. I have added the write option document for Spark.
And I find that we don't have Spark properties document. I want to raise another pr to add it after this pr is merged. I create an issue to track this problem. #8314

jerqi · 2023-08-13T13:29:19Z

@pvary @nastra Could you help me review this pr?

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestCompressionSettings.java

ConeyLiu

+1

nastra · 2023-08-14T07:20:43Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestCompressionSettings.java

+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestCompressionSettings {


The project is currently in the process of moving away from JUnit4, meaning that new tests should be written purely in Junit5 + using AssertJ assertions. See also https://iceberg.apache.org/contribute/#testing for some additional info

Thanks for your review. I got it. I have migrated the test to the Junit 5 framework.

I need to add some new ut for SparkPositionDeltaWrite and SparkPositionDeletesRewrite.java. I need to extend the SparkCatalogTestBase. The SparkCatalogTestBase still use JUnit4, I will change ut to Junit4 to avoid conflicts.

aokolnychyi · 2023-08-15T15:52:31Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

@@ -63,4 +63,9 @@ private SparkSQLProperties() {}
  // Controls the WAP branch used for write-audit-publish workflow.
  // When set, new snapshots will be committed to this branch.
  public static final String WAP_BRANCH = "spark.wap.branch";
+
+  // Controls write compress options
+  public static final String COMPRESSION_CODEC = "spark.sql.iceberg.write.compression-codec";


We rarely use extra write and read namespaces in SQL properties, the only exception is preserving grouping as it was not clear otherwise. What about dropping write from all names?

spark.sql.iceberg.compression-codec spark.sql.iceberg.compression-level spark.sql.iceberg.compression-strategy

aokolnychyi · 2023-08-15T16:05:15Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkFileWriterFactory.java

@@ -76,6 +79,12 @@ class SparkFileWriterFactory extends BaseFileWriterFactory<InternalRow> {
    this.dataSparkType = dataSparkType;
    this.equalityDeleteSparkType = equalityDeleteSparkType;
    this.positionDeleteSparkType = positionDeleteSparkType;
+


Optional: What about inlining if it fits on one line?

... this.equalityDeleteSparkType = equalityDeleteSparkType; this.positionDeleteSparkType = positionDeleteSparkType; this.writeProperties = writeProperties != null ? writeProperties : ImmutableMap.of();

jerqi · 2023-08-16T11:23:23Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestCompressionSettings.java

+          .isEqualToIgnoringCase(properties.get(COMPRESSION_CODEC));
+    }
+
+    if (PARQUET.equals(format)) {


I find that we can't set table property for the action RewritePositionDeletes. The action RewritePositionDeletes will always use parquet as the format.

In my ut, I set the table property `DEFAULT_FILE_FORMAT` and `DELETE_DEFAULT_FILE_FORMAT` is orc, but the table `position_deletes` properties is empty.

That seems like an issue worth looking into after this change.

cc @szehon-ho.

OK, I will investigate it.

I find the root cause of this case.
The class BaseMetadataTable override the method properties

@Override public Map<String, String> properties() { return ImmutableMap.of(); }

Option 1:
we can modify the class BaseMetadataTable.
#8428

Option2:
We can modify the class PositionDeletesTable
#8429

I prefer option 1. I feel the properties of meta data table should respect the ones of base table. We also don't have ways to modify the properites of meta data table. @aokolnychyi @szehon-ho WDYT?

Hi, yea I am ok with option 1 as well, @aokolnychyi what do you think?

jerqi · 2023-08-17T01:41:53Z

Do we also have to cover other places that use SparkFileWriterFactory? For instance, SparkPositionDeltaWrite?

Thanks for you review, my mistake. I will add the modification of the code and add the ut.

@aokolnychyi I have addressed the comments. Could you take another look if you have time?

jerqi · 2023-08-23T10:51:01Z

@nastra @aokolnychyi Gently ping. Sorry to bother you. Could you have another look if you have time?

aokolnychyi · 2023-08-25T17:56:40Z

Let me take another look.

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/WritePropertiesUtil.java

aokolnychyi · 2023-08-25T18:15:32Z

This change looks good to me. I am not sure about adding a new utility class vs just using SparkWriteConf.
@jerqi, what do you think?

jerqi · 2023-08-26T09:01:11Z

This change looks good to me. I am not sure about adding a new utility class vs just using SparkWriteConf. @jerqi, what do you think?

The second solution seems more elegant. Thanks for your suggestion. I have refactored the code to follow the suggestion.

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java

aokolnychyi · 2023-08-29T01:50:49Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

+        writeProperties.put(ORC_COMPRESSION_STRATEGY, orcCompressionStrategy());
+        break;
+      default:
+        throw new IllegalArgumentException(String.format("Unknown file format %s", format));


I wonder whether it is worth failing the query. Maybe, just do nothing here?

I refer to the Flink's implement #6049. Flink choose to fail the query. It also has benefits if we choose to fail the query. We can find the wrong config option more easily.

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

aokolnychyi

Looks good to me. Left a few suggestions.

aokolnychyi · 2023-08-29T18:29:46Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

@@ -104,6 +104,7 @@ class SparkPositionDeltaWrite implements DeltaWrite, RequiresDistributionAndOrde
  private final Context context;

  private boolean cleanupOnAbort = true;
+  private final Map<String, String> writeProperties;


Minor: This variable should be co-located with other final variables above. It is super minor but let's do this in a follow-up.

OK, I have raised a follow-up pr #8421. The pr has been merged.

aokolnychyi · 2023-08-29T18:31:54Z

Thanks, @jerqi! Thanks for reviewing, @ConeyLiu @chenjunjiedada.

jerqi · 2023-08-30T02:03:08Z

Thanks @aokolnychyi ! Code master. Thanks @ConeyLiu @chenjunjiedada @nastra , too.

szehon-ho · 2023-08-31T00:25:51Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

+        writeProperties.put(PARQUET_COMPRESSION, parquetCompressionCodec());
+        String parquetCompressionLevel = parquetCompressionLevel();
+        if (parquetCompressionLevel != null) {
+          writeProperties.put(PARQUET_COMPRESSION_LEVEL, parquetCompressionLevel);


Hi guys, while working on #8299, I noticed the test failing and that this is missing delete file compression override. I think the TestCompression delete file compression check passes by chance, I have a fix over there for it: 43e3cab

Sorry for this mistake. I will check how to valid the correctness of the case.

If we don't set the delete compression codec, we will reuse data compression codec. So the test case passed.
I have raised a pr #8438 to fix this issue.

jerqi added 2 commits August 13, 2023 16:02

Spark 3.4: Add write options to override the compression properties o…

bdf359c

…f the Table

add ut

8b8affc

jerqi marked this pull request as draft August 13, 2023 08:06

github-actions bot added the spark label Aug 13, 2023

Fix code style

49063cb

jerqi changed the title ~~Spark 3.4: Add write options to override the compression properties of the Table~~ Spark 3.4: Add write options to override the compression properties of the table Aug 13, 2023

jerqi marked this pull request as ready for review August 13, 2023 08:42

chenjunjiedada reviewed Aug 13, 2023

View reviewed changes

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestCompressionSettings.java Show resolved Hide resolved

address comments

d08226b

github-actions bot added the docs label Aug 13, 2023

fix

e2ac0c3

jerqi requested a review from chenjunjiedada August 13, 2023 12:11

chenjunjiedada approved these changes Aug 13, 2023

View reviewed changes

ConeyLiu reviewed Aug 14, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java Show resolved Hide resolved

ConeyLiu reviewed Aug 14, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java Show resolved Hide resolved

ConeyLiu reviewed Aug 14, 2023

View reviewed changes

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestCompressionSettings.java Show resolved Hide resolved

ConeyLiu reviewed Aug 14, 2023

View reviewed changes

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestCompressionSettings.java Outdated Show resolved Hide resolved

jerqi added 2 commits August 14, 2023 10:25

address comments

bb11538

fix

86458e9

jerqi requested a review from ConeyLiu August 14, 2023 02:35

ConeyLiu approved these changes Aug 14, 2023

View reviewed changes

nastra reviewed Aug 14, 2023

View reviewed changes

Use junit5

923e03a

jerqi requested a review from nastra August 14, 2023 09:02

aokolnychyi reviewed Aug 15, 2023

View reviewed changes

jerqi added 3 commits August 16, 2023 15:06

fix

102c119

fix

faf8f5d

fix

371ac02

jerqi commented Aug 16, 2023

View reviewed changes

jerqi requested a review from aokolnychyi August 16, 2023 11:32

fix

9321807

aokolnychyi reviewed Aug 25, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/WritePropertiesUtil.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Aug 25, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/WritePropertiesUtil.java Outdated Show resolved Hide resolved

fix

c6218f7

jerqi requested a review from aokolnychyi August 26, 2023 10:48

aokolnychyi reviewed Aug 29, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Aug 29, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Aug 29, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java Outdated Show resolved Hide resolved

aokolnychyi approved these changes Aug 29, 2023

View reviewed changes

jerqi added 3 commits August 29, 2023 10:11

address comment

91a88a9

trigger ci

f0e295a

fix

2c0308c

aokolnychyi reviewed Aug 29, 2023

View reviewed changes

aokolnychyi merged commit ab0a2fd into apache:master Aug 29, 2023
31 checks passed

jerqi mentioned this pull request Aug 30, 2023

Spark3.4: Fix minor code style issue #8421

Merged

szehon-ho reviewed Aug 31, 2023

View reviewed changes

jerqi mentioned this pull request Sep 3, 2023

Core, Spark 3.4: Write properties of PositionDeletesTable should respect ones of BaseTable #8428

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.4: Add write options to override the compression properties of the table #8313

Spark 3.4: Add write options to override the compression properties of the table #8313

jerqi commented Aug 13, 2023 •

edited

Loading

jerqi commented Aug 13, 2023

chenjunjiedada commented Aug 13, 2023

jerqi commented Aug 13, 2023 •

edited

Loading

jerqi commented Aug 13, 2023 •

edited

Loading

ConeyLiu left a comment

nastra Aug 14, 2023

jerqi Aug 14, 2023

jerqi Aug 16, 2023

aokolnychyi Aug 15, 2023

jerqi Aug 16, 2023

aokolnychyi Aug 15, 2023

jerqi Aug 16, 2023

jerqi Aug 16, 2023

jerqi Aug 16, 2023 •

edited

Loading

aokolnychyi Aug 29, 2023

jerqi Aug 30, 2023

jerqi Aug 30, 2023

szehon-ho Aug 31, 2023

jerqi commented Aug 17, 2023 •

edited

Loading

jerqi commented Aug 23, 2023

aokolnychyi commented Aug 25, 2023

aokolnychyi commented Aug 25, 2023

jerqi commented Aug 26, 2023 •

edited

Loading

aokolnychyi Aug 29, 2023

jerqi Aug 29, 2023 •

edited

Loading

aokolnychyi left a comment

aokolnychyi Aug 29, 2023

jerqi Aug 30, 2023 •

edited

Loading

aokolnychyi commented Aug 29, 2023

jerqi commented Aug 30, 2023

szehon-ho Aug 31, 2023 •

edited

Loading

jerqi Aug 31, 2023 •

edited

Loading

Spark 3.4: Add write options to override the compression properties of the table #8313

Spark 3.4: Add write options to override the compression properties of the table #8313

Conversation

jerqi commented Aug 13, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

jerqi commented Aug 13, 2023

chenjunjiedada commented Aug 13, 2023

jerqi commented Aug 13, 2023 • edited Loading

jerqi commented Aug 13, 2023 • edited Loading

ConeyLiu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerqi Aug 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerqi commented Aug 17, 2023 • edited Loading

jerqi commented Aug 23, 2023

aokolnychyi commented Aug 25, 2023

aokolnychyi commented Aug 25, 2023

jerqi commented Aug 26, 2023 • edited Loading

Choose a reason for hiding this comment

jerqi Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

aokolnychyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerqi Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

aokolnychyi commented Aug 29, 2023

jerqi commented Aug 30, 2023

szehon-ho Aug 31, 2023 • edited Loading

Choose a reason for hiding this comment

jerqi Aug 31, 2023 • edited Loading

Choose a reason for hiding this comment

jerqi commented Aug 13, 2023 •

edited

Loading

jerqi commented Aug 13, 2023 •

edited

Loading

jerqi commented Aug 13, 2023 •

edited

Loading

jerqi Aug 16, 2023 •

edited

Loading

jerqi commented Aug 17, 2023 •

edited

Loading

jerqi commented Aug 26, 2023 •

edited

Loading

jerqi Aug 29, 2023 •

edited

Loading

jerqi Aug 30, 2023 •

edited

Loading

szehon-ho Aug 31, 2023 •

edited

Loading

jerqi Aug 31, 2023 •

edited

Loading