Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic conflict resolution #3068

Open
wjones127 opened this issue Oct 30, 2024 · 1 comment
Open

Automatic conflict resolution #3068

wjones127 opened this issue Oct 30, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@wjones127
Copy link
Contributor

wjones127 commented Oct 30, 2024

Right now, it's common for any two concurrent update/delete transactions to have one return CommitConflict error. This makes it hard to make many updates. The best workaround for now is to do updates serially, but it would be nice if it wasn't required.

We have a retry loop for if transactions are compatible. If they need resolution, however, we just return an error. What if instead we can automatically resolve the conflicts?

Resolving conflicts

In the commit loop, transactions would just be a single transaction. But in other use cases, we might want to be able to consolidate all the changes into a single transaction.

This loop would need to load all the previous transactions since each pending transactions read_version. To make this faster, we should consider the caching described in #3057

A few examples of what this could do can be illustrative:

  1. Two concurrent deletes: each delete transaction will have their own deletion files, and potentially removed fragments. Should union set of removed fragment ids. Should write a new deletion file that is a union of the two.
  2. A concurrent append and delete: should transform in to a transaction that has both
  3. Two concurrent appends: Should handle modified fragments the same way as deletions. And then combine appends the same way.
  4. Add column and append: if new column is nullable, we will be able to append subschema.
    1. What if it is not nullable? Then it will fail?
  5. Overwrite and append: append can be ignored. In fact, any other transaction could be ignored.
@wjones127 wjones127 added the enhancement New feature or request label Oct 30, 2024
@westonpace
Copy link
Contributor

This makes sense. Both A and B are happening at the same time. We should consider what would happen if they were not concurrent. I think there are several situations:

  • Both A then B and B then A have the same output: We can safely merge
  • A then B has different output from B then A and both are valid: We should have some kind of user-flag here (I think databricks calls this WriteSerializable isolation level). If the flag is set then we can pick one of the outputs and merge. If not, we should raise an error.
  • A then B fails but B then A succeeds: We could maybe use the same flag. Or we could just force the B then A order.
  • A then B fails and B then A fails: We should fail in this situation

Two concurrent deletes

Both orders have the same output. We can merge.

A concurrent append and delete

Append then delete would have output X
Delete then append would have output Y

We should only merge is user flag is set to allow this kind of thing. Probably just merge to "Delete then Append" because that's the only one we can calculate (since we don't store delete filter on transaction).

Two concurrent appends

Both orders have the same output. We can merge.

Add column and append: nullable column

Both orders have the same output. We can merge.

Add column and append: non-nullable column

Both directions fail. We can raise an error.

Overwrite and append

Overwrite then append would fail (or at least could, if schema was changed)
Append then overwrite would succeed

Probably don't need to check user flag here and just assume it was "append then overwrite"

What if it is not nullable? Then it will fail?

Yes, I think failure makes sense here. In the other cases we are just restoring what would happen if the two operations ran serially without conflict. e.g. if the two deletes conflict and get merged you get the same result as you would if the second delete had the result of the first delete as the read version.

So if we apply that same logic here we get an error in both cases. Either "you ran add_columns and didn't give a value for all rows" or "you ran append after the add columns and the append schema was missing nullable columns".

@wjones127 wjones127 self-assigned this Nov 6, 2024
@wjones127 wjones127 added this to the Lance Papercuts milestone Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants