Automatic conflict resolution #3068

wjones127 · 2024-10-30T17:34:26Z

Right now, it's common for any two concurrent update/delete transactions to have one return CommitConflict error. This makes it hard to make many updates. The best workaround for now is to do updates serially, but it would be nice if it wasn't required.

We have a retry loop for if transactions are compatible. If they need resolution, however, we just return an error. What if instead we can automatically resolve the conflicts?

Resolving conflicts

In the commit loop, transactions would just be a single transaction. But in other use cases, we might want to be able to consolidate all the changes into a single transaction.

This loop would need to load all the previous transactions since each pending transactions read_version. To make this faster, we should consider the caching described in #3057

A few examples of what this could do can be illustrative:

Two concurrent deletes: each delete transaction will have their own deletion files, and potentially removed fragments. Should union set of removed fragment ids. Should write a new deletion file that is a union of the two.
A concurrent append and delete: should transform in to a transaction that has both
Two concurrent appends: Should handle modified fragments the same way as deletions. And then combine appends the same way.
Add column and append: if new column is nullable, we will be able to append subschema.
1. What if it is not nullable? Then it will fail?
Overwrite and append: append can be ignored. In fact, any other transaction could be ignored.

The text was updated successfully, but these errors were encountered:

westonpace · 2024-10-30T18:36:42Z

This makes sense. Both A and B are happening at the same time. We should consider what would happen if they were not concurrent. I think there are several situations:

Both A then B and B then A have the same output: We can safely merge
A then B has different output from B then A and both are valid: We should have some kind of user-flag here (I think databricks calls this WriteSerializable isolation level). If the flag is set then we can pick one of the outputs and merge. If not, we should raise an error.
A then B fails but B then A succeeds: We could maybe use the same flag. Or we could just force the B then A order.
A then B fails and B then A fails: We should fail in this situation

Two concurrent deletes

Both orders have the same output. We can merge.

A concurrent append and delete

Append then delete would have output X
Delete then append would have output Y

We should only merge is user flag is set to allow this kind of thing. Probably just merge to "Delete then Append" because that's the only one we can calculate (since we don't store delete filter on transaction).

Two concurrent appends

Both orders have the same output. We can merge.

Add column and append: nullable column

Both orders have the same output. We can merge.

Add column and append: non-nullable column

Both directions fail. We can raise an error.

Overwrite and append

Overwrite then append would fail (or at least could, if schema was changed)
Append then overwrite would succeed

Probably don't need to check user flag here and just assume it was "append then overwrite"

What if it is not nullable? Then it will fail?

Yes, I think failure makes sense here. In the other cases we are just restoring what would happen if the two operations ran serially without conflict. e.g. if the two deletes conflict and get merged you get the same result as you would if the second delete had the result of the first delete as the read version.

So if we apply that same logic here we get an error in both cases. Either "you ran add_columns and didn't give a value for all rows" or "you ran append after the add columns and the append schema was missing nullable columns".

wjones127 added the enhancement New feature or request label Oct 30, 2024

wjones127 mentioned this issue Nov 6, 2024

Add bulk commit API #3097

Open

wjones127 self-assigned this Nov 6, 2024

wjones127 added this to the Lance Papercuts milestone Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic conflict resolution #3068

Automatic conflict resolution #3068

wjones127 commented Oct 30, 2024 •

edited

Loading

westonpace commented Oct 30, 2024

Automatic conflict resolution #3068

Automatic conflict resolution #3068

Comments

wjones127 commented Oct 30, 2024 • edited Loading

Resolving conflicts

westonpace commented Oct 30, 2024

wjones127 commented Oct 30, 2024 •

edited

Loading