Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core, Spark 3.4: Write properties of PositionDeletesTable should respect ones of BaseTable #8428

Merged
merged 6 commits into from
Sep 19, 2023

Conversation

jerqi
Copy link
Contributor

@jerqi jerqi commented Aug 30, 2023

What changes were proposed in this pull request?

Make write properties of PositionDeletesTable respect ones of BaseTable.

Why are the changes needed?

When we use PositionDeletesRewriteAction, we will use the properties of PositionDeletesTable, but the properties are empty before this pr. We will use default properties all the time. The write file format of PositionDeletesRewriteAction will be parquet although the table use orc as write format. It's unreasonable. More information you can see #8313 (comment)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT

@jerqi jerqi changed the title Spark 3.4: BaseMetadataTable should respect the base table properties [WIP] Spark 3.4: BaseMetadataTable should respect the base table properties Aug 30, 2023
@jerqi jerqi marked this pull request as draft August 30, 2023 14:55
@jerqi jerqi changed the title [WIP] Spark 3.4: BaseMetadataTable should respect the base table properties [WIP] Spark 3.4: The properties of BaseMetadataTable should respect the properties of BaseTable Aug 30, 2023
@szehon-ho
Copy link
Collaborator

I think this option works for me, @aokolnychyi for any concerns?

@jerqi jerqi marked this pull request as ready for review August 31, 2023 22:55
@jerqi jerqi changed the title [WIP] Spark 3.4: The properties of BaseMetadataTable should respect the properties of BaseTable Spark 3.4: The properties of BaseMetadataTable should respect the properties of BaseTable Aug 31, 2023
@jerqi jerqi changed the title Spark 3.4: The properties of BaseMetadataTable should respect the properties of BaseTable Spark3.4, Core: The properties of BaseMetadataTable should respect the properties of BaseTable Sep 1, 2023
@jerqi jerqi changed the title Spark3.4, Core: The properties of BaseMetadataTable should respect the properties of BaseTable Core: The properties of BaseMetadataTable should respect the properties of BaseTable Sep 1, 2023
@jerqi jerqi changed the title Core: The properties of BaseMetadataTable should respect the properties of BaseTable Core: Properties of BaseMetadataTable should respect properties of BaseTable Sep 1, 2023
@jerqi jerqi changed the title Core: Properties of BaseMetadataTable should respect properties of BaseTable Core, Spark 3.4: Properties of BaseMetadataTable should respect properties of BaseTable Sep 2, 2023
@szehon-ho
Copy link
Collaborator

@jerqi sorry about this, I was re-thinking about it and not 100% sure it makes sense, as there are some table properties that look weird on all metadata tables.

What do you think about first trying to solve it in [SparkBinPackPositionDeletesRewriter] (using ds write options) (https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackPositionDeletesRewriter.java) I think that's a bit less impact.

@jerqi
Copy link
Contributor Author

jerqi commented Sep 6, 2023

@jerqi sorry about this, I was re-thinking about it and not 100% sure it makes sense, as there are some table properties that look weird on all metadata tables.

What do you think about first trying to solve it in [SparkBinPackPositionDeletesRewriter] (using ds write options) (https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackPositionDeletesRewriter.java) I think that's a bit less impact.

  1. I can raise a this solution pr.

  2. And the properties of PositionDeletesTable is empty.
    There are many properties will be useful to write data.

write.delete.orc.compression-codec
write.delete.parquet.row-group-check-max-record-count
write.parquet.bloom-filter-enable
......

We need to set every write option for the data frame. It may be difficult for users to use.
What about my second solution? We only modify the properties of PositionDeletesTable.
Moreover, We can filter out the unreasonable table property. WDYT?

@szehon-ho
Copy link
Collaborator

@jerqi ok that sounds good with me, let's go with that approach. Mostly write properties I assume?

Writing position deletes through position_deletes table is not exposed to user, as we dont support that outside rewrite_position_deletes. But I see your point that doing it inside the action makes the code harder as there's quite a few possible properties.

@jerqi
Copy link
Contributor Author

jerqi commented Sep 7, 2023

Mostly write properties

Maybe we should include read properties and commit properties, too, let me see what properties the rewrite_position_deletes used.

@jerqi
Copy link
Contributor Author

jerqi commented Sep 12, 2023

Mostly write properties

Maybe we should include read properties and commit properties, too, let me see what properties the rewrite_position_deletes used.

Yes, you're right. We need mostly write properties.

@jerqi jerqi changed the title Core, Spark 3.4: Properties of BaseMetadataTable should respect properties of BaseTable Core, Spark 3.4: Properties of PositionDeletesTable should respect properties of BaseTable Sep 12, 2023
@jerqi jerqi changed the title Core, Spark 3.4: Properties of PositionDeletesTable should respect properties of BaseTable Core, Spark 3.4: Write properties of PositionDeletesTable should respect properties of BaseTable Sep 12, 2023
// these properties should respect the ones of BaseTable.
return Collections.unmodifiableMap(
table().properties().entrySet().stream()
.filter(entry -> entry.getKey().startsWith("write."))
Copy link
Contributor Author

@jerqi jerqi Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that all the write properties are needed for our PositionDeletesRewriteAction. So I choose to match the key prefix here instead of copying some specific entries.

@jerqi jerqi changed the title Core, Spark 3.4: Write properties of PositionDeletesTable should respect properties of BaseTable Core, Spark 3.4: Write properties of PositionDeletesTable should respect ones of BaseTable Sep 12, 2023
@jerqi
Copy link
Contributor Author

jerqi commented Sep 12, 2023

@szehon-ho Could you review this pr again if you have time?

@szehon-ho szehon-ho merged commit faab813 into apache:master Sep 19, 2023
41 checks passed
@szehon-ho
Copy link
Collaborator

Merged, thanks @jerqi . Can you please update the pr description to make it clearer what problem we are fixing? And do we need to make the fix for other Spark versions?

@jerqi
Copy link
Contributor Author

jerqi commented Sep 19, 2023

Merged, thanks @jerqi . Can you please update the pr description to make it clearer what problem we are fixing? And do we need to make the fix for other Spark versions?

  1. ok, I will update the pr description.
  2. ok, I will raise another pr to fix Spark 3.5.

@jerqi
Copy link
Contributor Author

jerqi commented Sep 19, 2023

@szehon-ho I have raised a new pr for Spark 3.5. https://github.com/apache/iceberg/pull/8584/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants