From f7177bdd82c2e182748f2337f2427372a7cd80ee Mon Sep 17 00:00:00 2001 From: "R. Tyler Croy" Date: Sat, 19 Oct 2024 13:40:22 +0000 Subject: [PATCH] [Protocol] Clarify handling of duplicate add/remove actions With more recent DBRs (14.x, 15.x) a previously written table by delta-rs became unreadable due to the following: 24/09/26 01:12:43 ERROR Uncaught throwable from user code: com.databricks.sql.transaction.tahoe.DeltaRuntimeException: [DELTA_DUPLICATE_ACTIONS_FOUND] File operation 'remove' for path ds=2024-09-25/part-00631-d7048577-f7b0-3b87-9f2e-336d394e0387-c000.gz.parquet was specified several times. It conflicts with ds=2024-09-25/part-00631-d7048577-f7b0-3b87-9f2e-336d394e0387-c000.gz.parquet. It is not valid for multiple file operations with the same path to exist in a single commit. This particular scenario resulted in a extremely rare race condition we discovered in AWS for some delta-rs related code, but I could not find any statements in the protocol to indicate that duplicate actions were actually an invalid state. I believe that they _should_ be considered an invalid state and the validation error provided by newer Databricks runtimes to be a reasonable one. Therefore this change adds some verbiage to the protocol stating that DBR's behavior is acceptable for such a Delta table. [Slack thread on the topic](https://delta-users.slack.com/archives/C03FVMHT93Q/p1728321032555229) --- PROTOCOL.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/PROTOCOL.md b/PROTOCOL.md index f7b4a3bff0..3df7a97d31 100644 --- a/PROTOCOL.md +++ b/PROTOCOL.md @@ -398,6 +398,8 @@ That means specifically that for any commit… The `dataChange` flag on either an `add` or a `remove` can be set to `false` to indicate that an action when combined with other actions in the same atomic version only rearranges existing data or adds new statistics. For example, streaming queries that are tailing the transaction log can use this flag to skip actions that would not affect the final results. +A single transaction should not contain duplicate `add` or `remove` actions. Readers may treat such transactions as invalid. + The schema of the `add` action is as follows: Field Name | Data Type | Description | optional/required