[Spark] DV Reads Stability Improvement in Delta by removing Broadcasting DV Information #2888

longvu-db · 2024-04-12T13:59:45Z

Which Delta project/connector is this regarding?

Description

Back then, we relied on an expensive Broadcast of DV files to pass the DV files to the associated Parquet Files. With the introduction of adding custom metadata to files introduced in Spark 3.5, we can now pass the DV through the custom metadata field, this is expected to improve the stability of DV reads in Delta.

How was this patch tested?

Adjusted the existing UTs that cover our changes.

Does this PR introduce any user-facing changes?

No.

spark/src/main/scala/org/apache/spark/sql/delta/files/TahoeFileIndex.scala

spark/src/test/scala/org/apache/spark/sql/delta/DeltaParquetFileFormatSuite.scala

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala

xupefei

Very nice, thank you! Looking forward to seeing how much speed-up we can get with this change :)

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala

xupefei · 2024-04-17T14:56:09Z

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala

+        super.metadataSchemaFields ++ rowTrackingFields
+      }
+
+    val isDVsEnabled = DeltaConfigs.ENABLE_DELETION_VECTORS_CREATION.fromMetaData(metadata)


metadata shows the current status of the table, i.e., if DV is enabled in the current version, isn't it? How about some old versions that have DV, but now the user has decided to disable it? I think we should look at the table protocol: if DV table feature is turned on, then we assume the table contains DV.

@xupefei Quick question, can an existing table add new table feature? Like if a table does not have DV from the beginning, in the property, and then at some point later in time get added DVs.

@xupefei I think you can do this using ALTER TABLE right? But once it is in, you cannot drop it at the moment.

xupefei

After looking at the code, I feel that we need a test case with the following logic:

Create a table without DV. Insert some values.
Turn on DV (delta.enableDeletionVectors = true).
Delete some values.
Turn off DV (delta.enableDeletionVectors = false).
Delete some values.

Check:

Version 1 does not contain DV metadata column.
Version 2 does not contain DV metadata column (because there's no DV in this table - correct me if I'm wrong).
Version 3 does contain DV metadata columns.
Version 4 does contain DV metadata columns.
Version 5 does contain DV metadata columns.

longvu-db · 2024-04-18T17:08:09Z

@xupefei I addressed your comments =)).

Regarding the new test case:

Version 2 should contain DV metadata column (because having the metadata columns depends on whether or not the table's protocol supports DV, not on whether any DVs is written or not)
What is the value of doing action 5? If we disable the DV for the table metadata, and if the DV metadata indeed stay, then whatever actions we do, it would always stay right?

spark/src/test/scala/org/apache/spark/sql/delta/DeltaParquetFileFormatSuite.scala

xupefei

A few improvements to the tests then we're good to go

xupefei · 2024-04-19T08:04:45Z

Cc @larsk-db

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala

johanl-db · 2024-04-19T08:00:52Z

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala

@@ -104,7 +102,7 @@ case class DeltaParquetFileFormat(
  override def isSplitable(
    sparkSession: SparkSession, options: Map[String, String], path: Path): Boolean = isSplittable

-  def hasDeletionVectorMap: Boolean = broadcastDvMap.isDefined && broadcastHadoopConf.isDefined
+  def hasBroadcastHadoopConf: Boolean = broadcastHadoopConf.isDefined


I think you can get rid of broadcastHadoopConf and hasBroadcastHadoopConf altogether here.

The only use we have for it now is to pass it to rowIndexFilter.createInstance when creating the DV filter but we also have hadoopConf: Configuration around (passed as an argument to buildReaderWithPartitionValues) which I believe should be sufficient here: we don't need this configuration to be broadcast first for the purpose of loading DVs from storage

@johanl-db No clue why we had it back in the days hmm

@johanl-db Removing broadcastHadoopConf doesn't seem to work, I haven't investigated yet, changing back to using the broadcastHadoopConf makes the tests passed

spark/src/main/scala/org/apache/spark/sql/delta/PreprocessTableWithDVs.scala

spark/src/main/scala/org/apache/spark/sql/delta/files/TahoeFileIndex.scala

spark/src/test/scala/org/apache/spark/sql/delta/DeltaParquetFileFormatSuite.scala

larsk-db · 2024-04-19T12:45:12Z

Cc @larsk-db

I took a quick look, but it seems you and @johanl-db are on top of this. Ping me if you have something specific that needs my input.

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala

spark/src/main/scala/org/apache/spark/sql/delta/files/TahoeFileIndex.scala

tdas

I highly approve this change :D
Thank you for doing this! This will improve the stability on large tables with lots of DVs significantly.

xupefei reviewed Apr 12, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/files/TahoeFileIndex.scala Outdated Show resolved Hide resolved

xupefei reviewed Apr 12, 2024

View reviewed changes

spark/src/test/scala/org/apache/spark/sql/delta/DeltaParquetFileFormatSuite.scala Outdated Show resolved Hide resolved

xupefei reviewed Apr 12, 2024

View reviewed changes

spark/src/test/scala/org/apache/spark/sql/delta/DeltaParquetFileFormatSuite.scala Outdated Show resolved Hide resolved

xupefei reviewed Apr 12, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala Outdated Show resolved Hide resolved

xupefei reviewed Apr 12, 2024

View reviewed changes

longvu-db force-pushed the delta-dv-reading-perf-improvement branch 2 times, most recently from 9f1cb83 to 8c988ab Compare April 15, 2024 13:46

longvu-db commented Apr 15, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala Outdated Show resolved Hide resolved

longvu-db requested a review from xupefei April 15, 2024 13:47

longvu-db force-pushed the delta-dv-reading-perf-improvement branch 3 times, most recently from 648ddde to e049778 Compare April 16, 2024 22:15

xupefei reviewed Apr 17, 2024

View reviewed changes

longvu-db requested a review from xupefei April 18, 2024 17:08

longvu-db force-pushed the delta-dv-reading-perf-improvement branch from e049778 to 9618b0e Compare April 18, 2024 17:13

xupefei reviewed Apr 19, 2024

View reviewed changes

spark/src/test/scala/org/apache/spark/sql/delta/DeltaParquetFileFormatSuite.scala Outdated Show resolved Hide resolved

xupefei reviewed Apr 19, 2024

View reviewed changes

spark/src/test/scala/org/apache/spark/sql/delta/DeltaParquetFileFormatSuite.scala Outdated Show resolved Hide resolved

xupefei reviewed Apr 19, 2024

View reviewed changes

johanl-db reviewed Apr 19, 2024

View reviewed changes

longvu-db force-pushed the delta-dv-reading-perf-improvement branch 2 times, most recently from 3c52eb0 to 479ef5d Compare April 22, 2024 02:09

longvu-db requested review from xupefei and johanl-db April 22, 2024 06:01

johanl-db reviewed Apr 22, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala Outdated Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/files/TahoeFileIndex.scala Outdated Show resolved Hide resolved

longvu-db added 2 commits April 22, 2024 16:37

Update

7c6476a

Add tablePath only when DVEnabled

63e9527

longvu-db force-pushed the delta-dv-reading-perf-improvement branch from 479ef5d to 63e9527 Compare April 22, 2024 14:57

longvu-db requested a review from johanl-db April 22, 2024 15:10

johanl-db approved these changes Apr 23, 2024

View reviewed changes

tdas approved these changes Apr 23, 2024

View reviewed changes

tdas merged commit be7183b into delta-io:master Apr 23, 2024
7 checks passed

longvu-db changed the title ~~[Spark] DV Reads Performance Improvement in Delta by removing Broadcasting DV Information~~ [Spark] DV Reads Stability Improvement in Delta by removing Broadcasting DV Information May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] DV Reads Stability Improvement in Delta by removing Broadcasting DV Information #2888

[Spark] DV Reads Stability Improvement in Delta by removing Broadcasting DV Information #2888

longvu-db commented Apr 12, 2024 •

edited

Loading

xupefei left a comment •

edited

Loading

xupefei Apr 17, 2024

longvu-db Apr 18, 2024

longvu-db Apr 18, 2024

xupefei left a comment

longvu-db commented Apr 18, 2024

xupefei left a comment

xupefei commented Apr 19, 2024

johanl-db Apr 19, 2024

longvu-db Apr 19, 2024

longvu-db Apr 22, 2024 •

edited

Loading

larsk-db commented Apr 19, 2024

tdas left a comment •

edited

Loading

[Spark] DV Reads Stability Improvement in Delta by removing Broadcasting DV Information #2888

[Spark] DV Reads Stability Improvement in Delta by removing Broadcasting DV Information #2888

Conversation

longvu-db commented Apr 12, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

xupefei left a comment • edited Loading

Choose a reason for hiding this comment

xupefei Apr 17, 2024

Choose a reason for hiding this comment

longvu-db Apr 18, 2024

Choose a reason for hiding this comment

longvu-db Apr 18, 2024

Choose a reason for hiding this comment

xupefei left a comment

Choose a reason for hiding this comment

longvu-db commented Apr 18, 2024

xupefei left a comment

Choose a reason for hiding this comment

xupefei commented Apr 19, 2024

johanl-db Apr 19, 2024

Choose a reason for hiding this comment

longvu-db Apr 19, 2024

Choose a reason for hiding this comment

longvu-db Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

larsk-db commented Apr 19, 2024

tdas left a comment • edited Loading

Choose a reason for hiding this comment

longvu-db commented Apr 12, 2024 •

edited

Loading

xupefei left a comment •

edited

Loading

longvu-db Apr 22, 2024 •

edited

Loading

tdas left a comment •

edited

Loading