Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.4: Add write options to override the compression properties of the table #8313

Merged
merged 18 commits into from
Aug 29, 2023

Conversation

jerqi
Copy link
Contributor

@jerqi jerqi commented Aug 13, 2023

What changes were proposed in this pull request?

Flink has supported to use write option to override the compression properties of the table.
I refer to the pull request #6049 and then I add write options to override the compression properties of the table for Spark 3.4.

Why are the changes needed?

First, there exist some old tables using gzip compression codec in the production environment. If we use zstd compression codec to rewrite the data, we can reduce the data volume and reduce the cost of storage.
Second, this pr is also meaning after we choose zstd as default compression codec. Because we can choose different compression levels when we write the Iceberg data and rewrite the Iceberg data if we have this pr.

Does this PR introduce any user-facing change?

Yes.
This pr introduces the config option compression-codec, compression-level and compression-strategy. (document added)
This pr introduces the Spark config properties spark.sql.iceberg.write.compression-codec, spark.sql.iceberg.write.compression-level and spark.sql.iceberg.write.compression-strategy. (I will add the document after this pr is merged.)

How was this patch tested?

Add a new ut and manual verification.
I use the ut to verify that the compression codec is correct.
I verify the compression config option is correct by hand.
企业微信截图_93ca4fed-cd04-4315-819b-a56afdea106c

企业微信截图_fdb11295-8800-4ccc-ab5a-ec0abf12e913

@jerqi jerqi marked this pull request as draft August 13, 2023 08:06
@github-actions github-actions bot added the spark label Aug 13, 2023
@jerqi jerqi changed the title Spark 3.4: Add write options to override the compression properties of the Table Spark 3.4: Add write options to override the compression properties of the table Aug 13, 2023
@jerqi
Copy link
Contributor Author

jerqi commented Aug 13, 2023

@chenjunjiedada @ConeyLiu Could you help me review this pr?

@jerqi jerqi marked this pull request as ready for review August 13, 2023 08:42
@chenjunjiedada
Copy link
Collaborator

@jerqi Could you add some doc like in #6049 since compression-level and compression-strategy are targeted for different formats separately? Also, it would be better to add a description of the default value and the available options. Others LGTM. Thanks for the contribution.

@github-actions github-actions bot added the docs label Aug 13, 2023
@jerqi
Copy link
Contributor Author

jerqi commented Aug 13, 2023

@jerqi Could you add some doc like in #6049 since compression-level and compression-strategy are targeted for different formats separately? Also, it would be better to add a description of the default value and the available options. Others LGTM. Thanks for the contribution.

Thanks for your review. I have added the write option document for Spark.
And I find that we don't have Spark properties document. I want to raise another pr to add it after this pr is merged. I create an issue to track this problem. #8314

@jerqi
Copy link
Contributor Author

jerqi commented Aug 13, 2023

@pvary @nastra Could you help me review this pr?

@jerqi jerqi requested a review from ConeyLiu August 14, 2023 02:35
Copy link
Contributor

@ConeyLiu ConeyLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

import org.junit.runners.Parameterized;

@RunWith(Parameterized.class)
public class TestCompressionSettings {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project is currently in the process of moving away from JUnit4, meaning that new tests should be written purely in Junit5 + using AssertJ assertions. See also https://iceberg.apache.org/contribute/#testing for some additional info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. I got it. I have migrated the test to the Junit 5 framework.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add some new ut for SparkPositionDeltaWrite and SparkPositionDeletesRewrite.java. I need to extend the SparkCatalogTestBase. The SparkCatalogTestBase still use JUnit4, I will change ut to Junit4 to avoid conflicts.

@jerqi jerqi requested a review from nastra August 14, 2023 09:02
@@ -63,4 +63,9 @@ private SparkSQLProperties() {}
// Controls the WAP branch used for write-audit-publish workflow.
// When set, new snapshots will be committed to this branch.
public static final String WAP_BRANCH = "spark.wap.branch";

// Controls write compress options
public static final String COMPRESSION_CODEC = "spark.sql.iceberg.write.compression-codec";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We rarely use extra write and read namespaces in SQL properties, the only exception is preserving grouping as it was not clear otherwise. What about dropping write from all names?

spark.sql.iceberg.compression-codec
spark.sql.iceberg.compression-level
spark.sql.iceberg.compression-strategy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -76,6 +79,12 @@ class SparkFileWriterFactory extends BaseFileWriterFactory<InternalRow> {
this.dataSparkType = dataSparkType;
this.equalityDeleteSparkType = equalityDeleteSparkType;
this.positionDeleteSparkType = positionDeleteSparkType;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: What about inlining if it fits on one line?

...
this.equalityDeleteSparkType = equalityDeleteSparkType;
this.positionDeleteSparkType = positionDeleteSparkType;
this.writeProperties = writeProperties != null ? writeProperties : ImmutableMap.of();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

.isEqualToIgnoringCase(properties.get(COMPRESSION_CODEC));
}

if (PARQUET.equals(format)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that we can't set table property for the action RewritePositionDeletes. The action RewritePositionDeletes will always use parquet as the format.

Copy link
Contributor Author

@jerqi jerqi Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

企业微信截图_491440a5-40ce-404f-86c0-aec1addfe029 In my ut, I set the table property `DEFAULT_FILE_FORMAT` and `DELETE_DEFAULT_FILE_FORMAT` is orc, but the table `position_deletes` properties is empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like an issue worth looking into after this change.

cc @szehon-ho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will investigate it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the root cause of this case.
The class BaseMetadataTable override the method properties

  @Override
  public Map<String, String> properties() {
    return ImmutableMap.of();
  }

Option 1:
we can modify the class BaseMetadataTable.
#8428

Option2:
We can modify the class PositionDeletesTable
#8429

I prefer option 1. I feel the properties of meta data table should respect the ones of base table. We also don't have ways to modify the properites of meta data table. @aokolnychyi @szehon-ho WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, yea I am ok with option 1 as well, @aokolnychyi what do you think?

@jerqi jerqi requested a review from aokolnychyi August 16, 2023 11:32
@jerqi
Copy link
Contributor Author

jerqi commented Aug 17, 2023

Do we also have to cover other places that use SparkFileWriterFactory? For instance, SparkPositionDeltaWrite?

Thanks for you review, my mistake. I will add the modification of the code and add the ut.

@aokolnychyi I have addressed the comments. Could you take another look if you have time?

@jerqi
Copy link
Contributor Author

jerqi commented Aug 23, 2023

@nastra @aokolnychyi Gently ping. Sorry to bother you. Could you have another look if you have time?

@aokolnychyi
Copy link
Contributor

Let me take another look.

@aokolnychyi
Copy link
Contributor

This change looks good to me. I am not sure about adding a new utility class vs just using SparkWriteConf.
@jerqi, what do you think?

@jerqi
Copy link
Contributor Author

jerqi commented Aug 26, 2023

This change looks good to me. I am not sure about adding a new utility class vs just using SparkWriteConf. @jerqi, what do you think?

The second solution seems more elegant. Thanks for your suggestion. I have refactored the code to follow the suggestion.

@jerqi jerqi requested a review from aokolnychyi August 26, 2023 10:48
writeProperties.put(ORC_COMPRESSION_STRATEGY, orcCompressionStrategy());
break;
default:
throw new IllegalArgumentException(String.format("Unknown file format %s", format));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it is worth failing the query. Maybe, just do nothing here?

Copy link
Contributor Author

@jerqi jerqi Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refer to the Flink's implement #6049. Flink choose to fail the query. It also has benefits if we choose to fail the query. We can find the wrong config option more easily.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Left a few suggestions.

@@ -104,6 +104,7 @@ class SparkPositionDeltaWrite implements DeltaWrite, RequiresDistributionAndOrde
private final Context context;

private boolean cleanupOnAbort = true;
private final Map<String, String> writeProperties;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: This variable should be co-located with other final variables above. It is super minor but let's do this in a follow-up.

Copy link
Contributor Author

@jerqi jerqi Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have raised a follow-up pr #8421. The pr has been merged.

@aokolnychyi aokolnychyi merged commit ab0a2fd into apache:master Aug 29, 2023
31 checks passed
@aokolnychyi
Copy link
Contributor

Thanks, @jerqi! Thanks for reviewing, @ConeyLiu @chenjunjiedada.

@jerqi
Copy link
Contributor Author

jerqi commented Aug 30, 2023

Thanks @aokolnychyi ! Code master. Thanks @ConeyLiu @chenjunjiedada @nastra , too.

writeProperties.put(PARQUET_COMPRESSION, parquetCompressionCodec());
String parquetCompressionLevel = parquetCompressionLevel();
if (parquetCompressionLevel != null) {
writeProperties.put(PARQUET_COMPRESSION_LEVEL, parquetCompressionLevel);
Copy link
Collaborator

@szehon-ho szehon-ho Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi guys, while working on #8299, I noticed the test failing and that this is missing delete file compression override. I think the TestCompression delete file compression check passes by chance, I have a fix over there for it: 43e3cab

Copy link
Contributor Author

@jerqi jerqi Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for this mistake. I will check how to valid the correctness of the case.
企业微信截图_9057e533-5033-48a3-817e-a22238277d8d
If we don't set the delete compression codec, we will reuse data compression codec. So the test case passed.
I have raised a pr #8438 to fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants