Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] DV Reads Stability Improvement in Delta by removing Broadcasting DV Information #2888

Merged
merged 2 commits into from
Apr 23, 2024

Conversation

longvu-db
Copy link
Contributor

@longvu-db longvu-db commented Apr 12, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Back then, we relied on an expensive Broadcast of DV files to pass the DV files to the associated Parquet Files. With the introduction of adding custom metadata to files introduced in Spark 3.5, we can now pass the DV through the custom metadata field, this is expected to improve the stability of DV reads in Delta.

How was this patch tested?

Adjusted the existing UTs that cover our changes.

Does this PR introduce any user-facing changes?

No.

Copy link
Contributor

@xupefei xupefei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thank you! Looking forward to seeing how much speed-up we can get with this change :)

@longvu-db longvu-db force-pushed the delta-dv-reading-perf-improvement branch 2 times, most recently from 9f1cb83 to 8c988ab Compare April 15, 2024 13:46
@longvu-db longvu-db requested a review from xupefei April 15, 2024 13:47
@longvu-db longvu-db force-pushed the delta-dv-reading-perf-improvement branch 3 times, most recently from 648ddde to e049778 Compare April 16, 2024 22:15
super.metadataSchemaFields ++ rowTrackingFields
}

val isDVsEnabled = DeltaConfigs.ENABLE_DELETION_VECTORS_CREATION.fromMetaData(metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata shows the current status of the table, i.e., if DV is enabled in the current version, isn't it? How about some old versions that have DV, but now the user has decided to disable it? I think we should look at the table protocol: if DV table feature is turned on, then we assume the table contains DV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xupefei Quick question, can an existing table add new table feature? Like if a table does not have DV from the beginning, in the property, and then at some point later in time get added DVs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xupefei I think you can do this using ALTER TABLE right? But once it is in, you cannot drop it at the moment.

Copy link
Contributor

@xupefei xupefei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the code, I feel that we need a test case with the following logic:

  1. Create a table without DV. Insert some values.
  2. Turn on DV (delta.enableDeletionVectors = true).
  3. Delete some values.
  4. Turn off DV (delta.enableDeletionVectors = false).
  5. Delete some values.

Check:

  1. Version 1 does not contain DV metadata column.
  2. Version 2 does not contain DV metadata column (because there's no DV in this table - correct me if I'm wrong).
  3. Version 3 does contain DV metadata columns.
  4. Version 4 does contain DV metadata columns.
  5. Version 5 does contain DV metadata columns.

@longvu-db
Copy link
Contributor Author

@xupefei I addressed your comments =)).

Regarding the new test case:

  1. Version 2 should contain DV metadata column (because having the metadata columns depends on whether or not the table's protocol supports DV, not on whether any DVs is written or not)
  2. What is the value of doing action 5? If we disable the DV for the table metadata, and if the DV metadata indeed stay, then whatever actions we do, it would always stay right?

@longvu-db longvu-db requested a review from xupefei April 18, 2024 17:08
@longvu-db longvu-db force-pushed the delta-dv-reading-perf-improvement branch from e049778 to 9618b0e Compare April 18, 2024 17:13
Copy link
Contributor

@xupefei xupefei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few improvements to the tests then we're good to go

@xupefei
Copy link
Contributor

xupefei commented Apr 19, 2024

Cc @larsk-db

@@ -104,7 +102,7 @@ case class DeltaParquetFileFormat(
override def isSplitable(
sparkSession: SparkSession, options: Map[String, String], path: Path): Boolean = isSplittable

def hasDeletionVectorMap: Boolean = broadcastDvMap.isDefined && broadcastHadoopConf.isDefined
def hasBroadcastHadoopConf: Boolean = broadcastHadoopConf.isDefined
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get rid of broadcastHadoopConf and hasBroadcastHadoopConf altogether here.

The only use we have for it now is to pass it to rowIndexFilter.createInstance when creating the DV filter but we also have hadoopConf: Configuration around (passed as an argument to buildReaderWithPartitionValues) which I believe should be sufficient here: we don't need this configuration to be broadcast first for the purpose of loading DVs from storage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanl-db No clue why we had it back in the days hmm

Copy link
Contributor Author

@longvu-db longvu-db Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanl-db Removing broadcastHadoopConf doesn't seem to work, I haven't investigated yet, changing back to using the broadcastHadoopConf makes the tests passed

@larsk-db
Copy link
Contributor

Cc @larsk-db

I took a quick look, but it seems you and @johanl-db are on top of this. Ping me if you have something specific that needs my input.

@longvu-db longvu-db force-pushed the delta-dv-reading-perf-improvement branch 2 times, most recently from 3c52eb0 to 479ef5d Compare April 22, 2024 02:09
@longvu-db longvu-db force-pushed the delta-dv-reading-perf-improvement branch from 479ef5d to 63e9527 Compare April 22, 2024 14:57
Copy link
Contributor

@tdas tdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I highly approve this change :D
Thank you for doing this! This will improve the stability on large tables with lots of DVs significantly.

@tdas tdas merged commit be7183b into delta-io:master Apr 23, 2024
7 checks passed
@longvu-db longvu-db changed the title [Spark] DV Reads Performance Improvement in Delta by removing Broadcasting DV Information [Spark] DV Reads Stability Improvement in Delta by removing Broadcasting DV Information May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants