Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Protocol] Clarify handling of duplicate add/remove actions #3784

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

rtyler
Copy link
Member

@rtyler rtyler commented Oct 19, 2024

With more recent DBRs (14.x, 15.x) a previously written table by delta-rs became unreadable due to the following:

    24/09/26 01:12:43 ERROR Uncaught throwable from user code: com.databricks.sql.transaction.tahoe.DeltaRuntimeException: [DELTA_DUPLICATE_ACTIONS_FOUND] File operation 'remove' for path ds=2024-09-25/part-00631-d7048577-f7b0-3b87-9f2e-336d394e0387-c000.gz.parquet was specified several times.
    It conflicts with ds=2024-09-25/part-00631-d7048577-f7b0-3b87-9f2e-336d394e0387-c000.gz.parquet.
    It is not valid for multiple file operations with the same path to exist in a single commit.

This particular scenario resulted in a extremely rare race condition we discovered in AWS for some delta-rs related code, but I could not find any statements in the protocol to indicate that duplicate actions were actually an invalid state.

I believe that they should be considered an invalid state and the validation error provided by newer Databricks runtimes to be a reasonable one. Therefore this change adds some verbiage to the protocol stating that DBR's behavior is acceptable for such a Delta table.

Slack thread on the topic

With more recent DBRs (14.x, 15.x) a previously written table by
delta-rs became unreadable due to the following:

        24/09/26 01:12:43 ERROR Uncaught throwable from user code: com.databricks.sql.transaction.tahoe.DeltaRuntimeException: [DELTA_DUPLICATE_ACTIONS_FOUND] File operation 'remove' for path ds=2024-09-25/part-00631-d7048577-f7b0-3b87-9f2e-336d394e0387-c000.gz.parquet was specified several times.
        It conflicts with ds=2024-09-25/part-00631-d7048577-f7b0-3b87-9f2e-336d394e0387-c000.gz.parquet.
        It is not valid for multiple file operations with the same path to exist in a single commit.

This particular scenario resulted in a extremely rare race condition we
discovered in AWS for some delta-rs related code, but I could not find
any statements in the protocol to indicate that duplicate actions were
actually an invalid state.

I believe that they _should_ be considered an invalid state and the
validation error provided by newer Databricks runtimes to be a
reasonable one. Therefore this change adds some verbiage to the protocol
stating that DBR's behavior is acceptable for such a Delta table.

[Slack thread on the
topic](https://delta-users.slack.com/archives/C03FVMHT93Q/p1728321032555229)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant