Update `DeltaParquetFileFormat` to add `isRowDeleted` column populated from DV #1542

vkorukanti · 2023-01-05T17:11:14Z

This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485)

It modifies the DeltaParquetFileFormat to append an additional column called __delta_internal_skip_row__. This column is populated by reading the DV associated with the Parquet file. We assume the rows returned are in order given in the file. To ensure the order we disable file splitting and filter pushdown to Parquet reader. This has performance penalty for Delta tables with deletion vectors until we upgrade Delta to Spark version to 3.4 (which has Parquet reader that can generate row-indexes correctly with file splitting and filter pushdown).

Currently added a single test. There will be e2e tests that cover the code better.

GitOrigin-RevId: 2067958ffc770a89df15fd165c9999d49b2dd1c4

GitOrigin-RevId: 93382ac54f836fb4c14f23f97b28eea6e663d0be

GitOrigin-RevId: 0a3e11e1e50e852f9b5e601b46ff491bf8fe060d

GitOrigin-RevId: e03192bbb6a6bc0ce0397e7ef1f4ad4958a20f47

GitOrigin-RevId: 5724f81bb88d8e4c725aeafd49db5d5433860fd4

GitOrigin-RevId: b19019efc5b3c1beb9ce464bf7ac087f0bd01182

scottsand-db · 2023-01-09T18:29:07Z

core/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala

+    val broadcastHadoopConf: Option[Broadcast[SerializableConfiguration]] = None)
+  extends ParquetFileFormat {
+  // Validate either we have all arguments for DV enabled read or none of them.
+  require(!(broadcastHadoopConf.isDefined ^ broadcastDvMap.isDefined ^ tablePath .isDefined ^


nit: tablePath .isDefined there's a space there? is that valid syntax?

Will remove the extra space. It is valid from the compiler perspective.

GitOrigin-RevId: 6730b4211535bbfac0bc410d5dbe78d568dd6e50

GitOrigin-RevId: d6204d5afcedd72f2f05da684cd001374e76b5dd

…an output This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485) Add a trait (used by `PrepareDeltaScan` to modify its output) to modify DV enabled tables to prune the deleted rows from scan output Planner trait to inject a Filter just after the Delta Parquet scan. This transformer modifies the plan: * Before rule: `<Parent Node> -> Delta Scan (key, value)` * Here we are reading `key`, `value` columns from the Delta table * After rule: `<Parent Node> -> Project(key, value) -> Filter (udf(__skip_row == 0) -> Delta Scan (key, value, __skip_row)` * Here we insert a new column in Delta scan `__skip_row`. This value is populated by the Parquet reader using the DV corresponding to the Parquet file read (refer [to the change](#1542)) and it contains `0` if we want to keep the row. * The scan created also disables Parquet file splitting and filter pushdowns, because in order to generate the `__skip_row` we need to read the rows in a file consecutively in order to generate the row index. This is a drawback we need to pay until we upgrade to latest Apache Spark which contains Parquet reader changes that automatically generate the row_index irrespective of the file splitting and filter pushdowns. * The scan created also contains a broadcast variable of Parquet File -> DV File map. The Parquet reader created uses this map to find the DV file corresponding to the Parquet file. * Filter created just filters out rows with `__skip_row` equals to 0 * And at the end we have a `Project` to keep the plan node output same as before the rule is applied In addition * it adds the `deletionVector` to DeltaLog protocol objects (`AddFile`, `RemoveFile`) * It also updates the `OptimizeMetadataOnlyDeltaQuery` to take into consideration of the DVs when calculating the row count. * end-to-end integration of reading Delta tables with DVs in `DeletionVectorsSuite` In following up PRs, will be adding extensive tests. Close #1560 GitOrigin-RevId: 3d67b6240865d880493f1d15a80b00cb079dacdc

…sting DV Information (#2888)  #### Which Delta project/connector is this regarding?  - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description  Back then, we relied on an [expensive Broadcast of DV files](#1542) to pass the DV files to the associated Parquet Files. With the introduction of [adding custom metadata to files](apache/spark#40677) introduced in Spark 3.5, we can now pass the DV through the custom metadata field, this is expected to improve the performance of DV reads in Delta. ## How was this patch tested?  Adjusted the existing UTs that cover our changes. ## Does this PR introduce _any_ user-facing changes? No.

vkorukanti added 5 commits January 6, 2023 10:46

Update DeltaParquetFileFormat to add skip row flag column

dbc5072

GitOrigin-RevId: 93382ac54f836fb4c14f23f97b28eea6e663d0be

scalastyle

43b0667

GitOrigin-RevId: 0a3e11e1e50e852f9b5e601b46ff491bf8fe060d

scalastyle

c68ccf6

GitOrigin-RevId: e03192bbb6a6bc0ce0397e7ef1f4ad4958a20f47

fix

7b5e8ce

GitOrigin-RevId: 5724f81bb88d8e4c725aeafd49db5d5433860fd4

fix

eb85653

GitOrigin-RevId: b19019efc5b3c1beb9ce464bf7ac087f0bd01182

vkorukanti force-pushed the dv8 branch from a672883 to 814b832 Compare January 6, 2023 18:46

vkorukanti changed the title ~~Update DeltaParquetFileFormat to add skip row flag column for files with DVs~~ Update DeltaParquetFileFormat to add isRowDeleted column populated from DV Jan 6, 2023

scottsand-db reviewed Jan 9, 2023

View reviewed changes

scottsand-db approved these changes Jan 9, 2023

View reviewed changes

vkorukanti added 2 commits January 9, 2023 13:11

review

6964f4a

GitOrigin-RevId: 6730b4211535bbfac0bc410d5dbe78d568dd6e50

test when vectorized reader is disabled and comments

38676a5

GitOrigin-RevId: d6204d5afcedd72f2f05da684cd001374e76b5dd

vkorukanti force-pushed the dv8 branch from 814b832 to 38676a5 Compare January 9, 2023 21:16

vkorukanti closed this in 7bb6871 Jan 10, 2023

This was referenced Jan 10, 2023

[Feature Request] Support reading Delta tables with Deletion Vectors #1485

Closed

Add DV table plan transformer trait to prune the deleted rows from scan output #1560

Closed

longvu-db mentioned this pull request Apr 12, 2024

[Spark] DV Reads Stability Improvement in Delta by removing Broadcasting DV Information #2888

Merged

5 tasks

vkorukanti deleted the dv8 branch May 9, 2024 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `DeltaParquetFileFormat` to add `isRowDeleted` column populated from DV #1542

Update `DeltaParquetFileFormat` to add `isRowDeleted` column populated from DV #1542

vkorukanti commented Jan 5, 2023

scottsand-db Jan 9, 2023

vkorukanti Jan 9, 2023

Update DeltaParquetFileFormat to add isRowDeleted column populated from DV #1542

Update DeltaParquetFileFormat to add isRowDeleted column populated from DV #1542

Conversation

vkorukanti commented Jan 5, 2023

scottsand-db Jan 9, 2023

Choose a reason for hiding this comment

vkorukanti Jan 9, 2023

Choose a reason for hiding this comment

Update `DeltaParquetFileFormat` to add `isRowDeleted` column populated from DV #1542

Update `DeltaParquetFileFormat` to add `isRowDeleted` column populated from DV #1542