You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, it's common for any two concurrent update/delete transactions to have one return CommitConflict error. This makes it hard to make many updates. The best workaround for now is to do updates serially, but it would be nice if it wasn't required.
We have a retry loop for if transactions are compatible. If they need resolution, however, we just return an error. What if instead we can automatically resolve the conflicts?
Resolving conflicts
In the commit loop, transactions would just be a single transaction. But in other use cases, we might want to be able to consolidate all the changes into a single transaction.
This loop would need to load all the previous transactions since each pending transactions read_version. To make this faster, we should consider the caching described in #3057
A few examples of what this could do can be illustrative:
Two concurrent deletes: each delete transaction will have their own deletion files, and potentially removed fragments. Should union set of removed fragment ids. Should write a new deletion file that is a union of the two.
A concurrent append and delete: should transform in to a transaction that has both
Two concurrent appends: Should handle modified fragments the same way as deletions. And then combine appends the same way.
Add column and append: if new column is nullable, we will be able to append subschema.
What if it is not nullable? Then it will fail?
Overwrite and append: append can be ignored. In fact, any other transaction could be ignored.
The text was updated successfully, but these errors were encountered:
This makes sense. Both A and B are happening at the same time. We should consider what would happen if they were not concurrent. I think there are several situations:
Both A then B and B then A have the same output: We can safely merge
A then B has different output from B then A and both are valid: We should have some kind of user-flag here (I think databricks calls this WriteSerializable isolation level). If the flag is set then we can pick one of the outputs and merge. If not, we should raise an error.
A then B fails but B then A succeeds: We could maybe use the same flag. Or we could just force the B then A order.
A then B fails and B then A fails: We should fail in this situation
Two concurrent deletes
Both orders have the same output. We can merge.
A concurrent append and delete
Append then delete would have output X
Delete then append would have output Y
We should only merge is user flag is set to allow this kind of thing. Probably just merge to "Delete then Append" because that's the only one we can calculate (since we don't store delete filter on transaction).
Two concurrent appends
Both orders have the same output. We can merge.
Add column and append: nullable column
Both orders have the same output. We can merge.
Add column and append: non-nullable column
Both directions fail. We can raise an error.
Overwrite and append
Overwrite then append would fail (or at least could, if schema was changed)
Append then overwrite would succeed
Probably don't need to check user flag here and just assume it was "append then overwrite"
What if it is not nullable? Then it will fail?
Yes, I think failure makes sense here. In the other cases we are just restoring what would happen if the two operations ran serially without conflict. e.g. if the two deletes conflict and get merged you get the same result as you would if the second delete had the result of the first delete as the read version.
So if we apply that same logic here we get an error in both cases. Either "you ran add_columns and didn't give a value for all rows" or "you ran append after the add columns and the append schema was missing nullable columns".
Right now, it's common for any two concurrent update/delete transactions to have one return
CommitConflict
error. This makes it hard to make many updates. The best workaround for now is to do updates serially, but it would be nice if it wasn't required.We have a retry loop for if transactions are compatible. If they need resolution, however, we just return an error. What if instead we can automatically resolve the conflicts?
Resolving conflicts
In the commit loop,
transactions
would just be a single transaction. But in other use cases, we might want to be able to consolidate all the changes into a single transaction.This loop would need to load all the previous transactions since each pending transactions
read_version
. To make this faster, we should consider the caching described in #3057A few examples of what this could do can be illustrative:
The text was updated successfully, but these errors were encountered: