Add DV table plan transformer trait to prune the deleted rows from scan output #1560

vkorukanti · 2023-01-11T23:53:20Z

This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485)

Add a trait (used by PrepareDeltaScan to modify its output) to modify DV enabled tables to prune the deleted rows from scan output

Planner trait to inject a Filter just after the Delta Parquet scan. This transformer modifies the plan:
.

Before rule: <Parent Node> -> Delta Scan (key, value) . Here we are reading key, value columns from the Delta table
After rule: <Parent Node> -> Project(key, value) -> Filter (udf(__skip_row == 0) -> Delta Scan (key, value, __skip_row)
- Here we insert a new column in Delta scan __skip_row. This value is populated by the Parquet reader using the DV corresponding to the Parquet file read (refer to the change) and it contains 0 if we want to keep the row.
- The scan created also disables Parquet file splitting and filter pushdowns, because in order to generate the __skip_row we need to read the rows in a file consecutively in order to generate the row index. This is a drawback we need to pay until we upgrade to latest Apache Spark which contains Parquet reader changes that automatically generate the row_index irrespective of the file splitting and filter pushdowns.
- The scan created also contains a broadcast variable of Parquet File -> DV File map. The Parquet reader created uses this map to find the DV file corresponding to the Parquet file.
- Filter created just filters out rows with __skip_row equals to 0
- And at the end we have a Project to keep the plan node output same as before the rule is applied

In addition

it adds the deletionVector to DeltaLog protocol objects (AddFile, RemoveFile)
It also updates the OptimizeMetadataOnlyDeltaQuery to take into consideration of the DVs when calculating the row count.
end-to-end integration of reading Delta tables with DVs in DeletionVectorsSuite

In following up PRs, will be adding extensive tests.

GitOrigin-RevId: 1a371c60129b789b92f494a86e164e2dd18da03d

zsxwing · 2023-01-25T16:24:17Z

core/src/main/scala/org/apache/spark/sql/delta/SubqueryTransformerHelper.scala

+   * It requires that the given plan already gone through [[OptimizeSubqueries]] and the
+   * root node denoting a subquery is removed and optimized appropriately.
+   */
+  def transformWithSubqueries(plan: LogicalPlan)


It's better to give it a different name as Spark has a transformWithSubqueries method which processes the tree in a different way. We also need to document this behavior clearly as now this becomes a utility method.

In PrepareDeltaScan, we want to scan all subqueries from the leaf nodes to the root (transformUp) but for each subquery, we want to scan from the root to the leaf nodes (transformDown). But transformWithSubqueries does't have this behavior.

updated the docs.

core/src/main/scala/org/apache/spark/sql/delta/PreprocessTableWithDVs.scala

core/src/main/scala/org/apache/spark/sql/delta/stats/PrepareDeltaScan.scala

zsxwing

LGTM

vkorukanti mentioned this pull request Jan 12, 2023

[Feature Request] Support reading Delta tables with Deletion Vectors #1485

Closed

3 tasks

vkorukanti force-pushed the dv10 branch 5 times, most recently from 3b564c9 to c08a48c Compare January 20, 2023 22:12

vkorukanti changed the title ~~Add a planner rule to modify DV enabled tables to prune the deleted rows from scan output~~ Add DV table plan transformer trait to prune the deleted rows from scan output Jan 20, 2023

vkorukanti requested a review from zsxwing January 20, 2023 22:19

vkorukanti force-pushed the dv10 branch from c08a48c to ce5d793 Compare January 20, 2023 22:22

wip

8c7d855

GitOrigin-RevId: 1a371c60129b789b92f494a86e164e2dd18da03d

vkorukanti force-pushed the dv10 branch from ce5d793 to 8c7d855 Compare January 20, 2023 22:33

zsxwing reviewed Jan 25, 2023

View reviewed changes

review

dbd42dc

vkorukanti requested a review from zsxwing January 26, 2023 00:05

remove unnecessary crc files from golden tables

a250ed0

zsxwing approved these changes Jan 26, 2023

View reviewed changes

vkorukanti closed this in deeb59d Jan 26, 2023

vkorukanti deleted the dv10 branch October 2, 2023 05:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DV table plan transformer trait to prune the deleted rows from scan output #1560

Add DV table plan transformer trait to prune the deleted rows from scan output #1560

vkorukanti commented Jan 11, 2023 •

edited

Loading

zsxwing Jan 25, 2023

vkorukanti Jan 25, 2023

zsxwing left a comment

Add DV table plan transformer trait to prune the deleted rows from scan output #1560

Add DV table plan transformer trait to prune the deleted rows from scan output #1560

Conversation

vkorukanti commented Jan 11, 2023 • edited Loading

zsxwing Jan 25, 2023

Choose a reason for hiding this comment

vkorukanti Jan 25, 2023

Choose a reason for hiding this comment

zsxwing left a comment

Choose a reason for hiding this comment

vkorukanti commented Jan 11, 2023 •

edited

Loading