-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retried commit after ODistributedRecordLockedException fails and leaks distributed transaction slot #10289
Comments
This appears to do what I want:
I think the same patch would need to be applied to |
Hi @timw, For the stalled transaction sequential it should be there a reset when the transaction fail, that free the sequential position, if you want to track it are the call to For the problem with the ids reset I do remember I've fixed a similar problem in 3.2.x series a while ago, but the only commit I can find are 4ef8c44 and b1375cb that seems to be already part of your changes. Removing all the mapping in The process that happen on commit will be transforming an id from When reverting this id assignment because of concurrent assignation of So if before resetting in the
after reset the map should look something like
so need to remove the the id reset should not reset the clusterId ( int the Yes (P.S. I know that many thing I wrote here you already know, making it clear to everyone ) |
Thanks @tglman - your explanation was very helpful in understanding the intent here (although the updated rids map works in the opposite direction to your examples, tracking back references, so it's a bit confusing). Having another look at 3.2/develop, it looks like you already fixed this in a667b24, which I should have noticed earlier, so this looks like it's only a 3.1.x bug. That patch appears to work for me. With the patch applied, the net effect is to set the id to |
Backport of a667b24 This fixes transaction failure when a transaction is retried because of a ODistributedRecordLockedException. Fixes orientechnologies#10289
Backport of a667b24 This fixes transaction failure when a transaction is retried because of a ODistributedRecordLockedException. Fixes orientechnologies#10289
OrientDB Version: 3.1.21 (also present in 3.2)
Java Version: 1.8
OS: Linux/Mac
We have experienced production outages where all commit activity in a distributed database stalls, which by analysing heap dumps we have isolated to all of the
promisedSequential
slots inOTransactionSequenceManager
being occupied by transactions that have long since completed.We've reproduced the effect locally by running two client threads that rapidly (i.e. with no pauses) query and update a shared record, and create a separate record at the same time.
By instrumenting with logging we isolated the slot leak behaviour to when the transaction fails with a
OClusterDoesNotExistException: Cluster id '-1' is outside the of range of configured
.When this error occurs, the transaction in the sequence manager is never cleared (either by
notifySuccess
ornotifyFailure
).The root cause of the
OClusterDoesNotExistException
appears to be:ONewDistributedTransactionManager.retriedCommit
with aODistributedRecordLockedException
OTransactionRealAbstract.resetAllocatedIds
, which resets the RID for the object creation (to-1,-2
) and invokesupdateIdentityAfterCommit
.ONewDistributedTransactionManager.getInvolvedClusters
fails, because one of the Record has a-1
cluster id.It looks like a call to
ODatabaseDocumentAbstract.assignAndCheckCluster
is missing during the reset (which is what happens in the original assignment inOTransactionOptimisticServer.assignClusters
, and introducing that call does appear to fix the problem.What I'm not clear on is whether the
ODatabaseDocumentAbstract.assignAndCheckCluster
call should be added toOTransactionRealAbstract.updateIdentityAfterCommit
or just toOTransactionRealAbstract.resetAllocatedIds
, and how exactlyupdatedRids
should be corrected when that happens.The text was updated successfully, but these errors were encountered: