Truncate record failing when attempting to recover broken record #10286

timw · 2024-08-26T05:12:35Z

OrientDB Version: 3.1.21-SNAPSHOT (3.1.20 + our HA/replicated stability patches)

Java Version: 1.8

OS: Linux/arm64

We have a database in production with 9 broken records: #24:5773252 through #24:5773260
We suspect they were created by an unclean shutdown of the database server.

When these records are accessed (e.g. a SELECT or a DELETE), they fail with a BufferUnderflowException.
Reading some other similar issues, the solution recommended in those is to use TRUNCATE RECORD, however this is also failing.

Results for the first broken record:

orientdb {db=index}> select from #24:5773252

!ERROR:com.orientechnologies.orient.core.exception.ODatabaseException: Error on retrieving record #24:5773252 (cluster: searchdata)
        DB name="index"
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.executeReadRecord(ODatabaseDocumentEmbedded.java:1546)
        at com.orientechnologies.orient.core.tx.OTransactionNoTx.loadRecord(OTransactionNoTx.java:123)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.load(ODatabaseDocumentAbstract.java:1036)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.load(ODatabaseDocumentAbstract.java:136)
        at com.orientechnologies.orient.core.sql.executor.FetchFromRidsStep$1.fetchNext(FetchFromRidsStep.java:51)
        at com.orientechnologies.orient.core.sql.executor.FetchFromRidsStep$1.hasNext(FetchFromRidsStep.java:68)
        at com.orientechnologies.orient.core.sql.parser.OLocalResultSet.fetchNext(OLocalResultSet.java:37)
        at com.orientechnologies.orient.core.sql.parser.OLocalResultSet.<init>(OLocalResultSet.java:27)
        at com.orientechnologies.orient.core.sql.parser.OSelectStatement.execute(OSelectStatement.java:276)
        at com.orientechnologies.orient.core.sql.parser.OStatement.execute(OStatement.java:72)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.query(ODatabaseDocumentEmbedded.java:585)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.select(OConsoleDatabaseApp.java:1170)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.orientechnologies.common.console.OConsoleApplication.execute(OConsoleApplication.java:445)
        at com.orientechnologies.common.console.OConsoleApplication.executeCommands(OConsoleApplication.java:283)
        at com.orientechnologies.common.console.OConsoleApplication.run(OConsoleApplication.java:144)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.main(OConsoleDatabaseApp.java:159)
Caused by: java.nio.BufferUnderflowException
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:270)
        at java.nio.ByteBuffer.get(ByteBuffer.java:723)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.base.ODurablePage.getBinaryValue(ODurablePage.java:204)
        at com.orientechnologies.orient.core.storage.cluster.OClusterPage.getRecordBinaryValue(OClusterPage.java:545)
        at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.readFullEntry(OPaginatedClusterV2.java:1391)
        at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.internalReadRecord(OPaginatedClusterV2.java:700)
        at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.readRecord(OPaginatedClusterV2.java:657)
        at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.readRecord(OPaginatedClusterV2.java:640)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.doReadRecord(OAbstractPaginatedStorage.java:5431)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.readRecord(OAbstractPaginatedStorage.java:5141)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.readRecord(OAbstractPaginatedStorage.java:1892)
        at com.orientechnologies.orient.core.db.document.SimpleRecordReader.readRecord(SimpleRecordReader.java:24)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.executeReadRecord(ODatabaseDocumentEmbedded.java:1493)
        ... 19 more

orientdb {db=index}> truncate record #24:5773252


!ERROR:com.orientechnologies.orient.core.exception.OCommandExecutionException: Error on executing command
        DB name="index"
        at com.orientechnologies.orient.core.sql.parser.OTruncateRecordStatement.executeSimple(OTruncateRecordStatement.java:57)
        at com.orientechnologies.orient.core.sql.executor.OSingleOpExecutionPlan.executeInternal(OSingleOpExecutionPlan.java:95)
        at com.orientechnologies.orient.core.sql.parser.OSimpleExecStatement.execute(OSimpleExecStatement.java:45)
        at com.orientechnologies.orient.core.sql.parser.OStatement.execute(OStatement.java:72)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.command(ODatabaseDocumentEmbedded.java:647)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.sqlCommand(OConsoleDatabaseApp.java:3469)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.truncateRecord(OConsoleDatabaseApp.java:642)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.orientechnologies.common.console.OConsoleApplication.execute(OConsoleApplication.java:445)
        at com.orientechnologies.common.console.OConsoleApplication.executeCommands(OConsoleApplication.java:283)
        at com.orientechnologies.common.console.OConsoleApplication.run(OConsoleApplication.java:144)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.main(OConsoleDatabaseApp.java:159)
Caused by: com.orientechnologies.orient.core.exception.OStorageException: Exception during execution of atomic operation inside of storage index
        DB name="index"
        at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.calculateInsideAtomicOperation(OAtomicOperationsManager.java:144)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.deleteRecord(OAbstractPaginatedStorage.java:2002)
        at com.orientechnologies.orient.core.sql.parser.OTruncateRecordStatement.executeSimple(OTruncateRecordStatement.java:47)
        ... 14 more
Caused by: java.lang.IndexOutOfBoundsException: 1330403569
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:252)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.base.ODurablePage.getByteValue(ODurablePage.java:245)
        at com.orientechnologies.orient.core.storage.cluster.OClusterPage.getRecordByteValue(OClusterPage.java:568)
        at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.getPhysicalPosition(OPaginatedClusterV2.java:1010)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.doDeleteRecord(OAbstractPaginatedStorage.java:5386)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.lambda$deleteRecord$8(OAbstractPaginatedStorage.java:2003)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.calculateInsideAtomicOperation(OAtomicOperationsManager.java:138)
        ... 16 more

Subsequent broken records all fail similarly, but with distinct indexes:

orientdb {db=index}> truncate record #24:5773253                             


!ERROR:com.orientechnologies.orient.core.exception.OCommandExecutionException: Error on executing command
        DB name="index"
        at com.orientechnologies.orient.core.sql.parser.OTruncateRecordStatement.executeSimple(OTruncateRecordStatement.java:57)
        at com.orientechnologies.orient.core.sql.executor.OSingleOpExecutionPlan.executeInternal(OSingleOpExecutionPlan.java:95)
        at com.orientechnologies.orient.core.sql.parser.OSimpleExecStatement.execute(OSimpleExecStatement.java:45)
        at com.orientechnologies.orient.core.sql.parser.OStatement.execute(OStatement.java:72)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.command(ODatabaseDocumentEmbedded.java:647)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.sqlCommand(OConsoleDatabaseApp.java:3469)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.truncateRecord(OConsoleDatabaseApp.java:642)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.orientechnologies.common.console.OConsoleApplication.execute(OConsoleApplication.java:445)
        at com.orientechnologies.common.console.OConsoleApplication.executeCommands(OConsoleApplication.java:283)
        at com.orientechnologies.common.console.OConsoleApplication.run(OConsoleApplication.java:144)
        at com.orientechnologies.orient.console.OConsoleDatabaseApp.main(OConsoleDatabaseApp.java:159)
Caused by: com.orientechnologies.orient.core.exception.OStorageException: Exception during execution of atomic operation inside of storage index
        DB name="index"
        at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.calculateInsideAtomicOperation(OAtomicOperationsManager.java:144)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.deleteRecord(OAbstractPaginatedStorage.java:2002)
        at com.orientechnologies.orient.core.sql.parser.OTruncateRecordStatement.executeSimple(OTruncateRecordStatement.java:47)
        ... 14 more
Caused by: java.lang.IndexOutOfBoundsException: 1752187594
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:252)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.base.ODurablePage.getByteValue(ODurablePage.java:245)
        at com.orientechnologies.orient.core.storage.cluster.OClusterPage.getRecordByteValue(OClusterPage.java:568)
        at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.getPhysicalPosition(OPaginatedClusterV2.java:1010)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.doDeleteRecord(OAbstractPaginatedStorage.java:5386)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.lambda$deleteRecord$8(OAbstractPaginatedStorage.java:2003)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.calculateInsideAtomicOperation(OAtomicOperationsManager.java:138)
        ... 16 more

Short of getting a fix for this, are there any other steps (short of exporting/importing the data) that can be taken to recover this database?
(The records in question are derived, so simply removing them is acceptable in this situation).
Unfortunately the data in the databases is confidential information, so we can't provide a copy for examination, but the engineering team supporting it is quite experienced with btree structures, so we can probably manage detailed file investigation/patching if we can get guidance on what is going on.

The text was updated successfully, but these errors were encountered:

timw · 2024-08-26T10:16:13Z

On the same database, we're observing a different persistent error (same cluster, different records):

java.lang.NullPointerException
at com.orientechnologies.orient.core.storage.cache.chm.AsyncReadCache.releaseFromRead(AsyncReadCache.java:310)
at com.orientechnologies.orient.core.storage.impl.local.paginated.base.ODurableComponent.releasePageFromRead(ODurableComponent.java:167)
at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.readFullEntry(OPaginatedClusterV2.java:1409)
at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.internalReadRecord(OPaginatedClusterV2.java:700)
at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.readRecord(OPaginatedClusterV2.java:657)
at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.readRecord(OPaginatedClusterV2.java:640)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.doReadRecord(OAbstractPaginatedStorage.java:5431)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.readRecord(OAbstractPaginatedStorage.java:5141)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.readRecord(OAbstractPaginatedStorage.java:1892)
at com.orientechnologies.orient.server.distributed.impl.ODistributedStorage.readRecord(ODistributedStorage.java:173)
at com.orientechnologies.orient.core.db.document.SimpleRecordReader.readRecord(SimpleRecordReader.java:24)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.executeReadRecord(ODatabaseDocumentEmbedded.java:1493)
at com.orientechnologies.orient.core.tx.OTransactionOptimistic.loadRecord(OTransactionOptimistic.java:208)
at com.orientechnologies.orient.core.tx.OTransactionOptimistic.loadRecord(OTransactionOptimistic.java:174)
at com.orientechnologies.orient.core.tx.OTransactionOptimistic.loadRecord(OTransactionOptimistic.java:335)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.load(ODatabaseDocumentAbstract.java:1036)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.load(ODatabaseDocumentAbstract.java:136)
at com.orientechnologies.orient.core.id.ORecordId.getRecord(ORecordId.java:328)
at com.orientechnologies.orient.core.sql.executor.OResultInternal.setElement(OResultInternal.java:416)
at com.orientechnologies.orient.core.sql.executor.FetchFromIndexedFunctionStep$1.next(FetchFromIndexedFunctionStep.java:68)
at com.orientechnologies.orient.core.sql.executor.FilterByClustersStep$1.fetchNextItem(FilterByClustersStep.java:72)
at com.orientechnologies.orient.core.sql.executor.FilterByClustersStep$1.hasNext(FilterByClustersStep.java:94)
at com.orientechnologies.orient.core.sql.executor.LetExpressionStep$1.hasNext(LetExpressionStep.java:35)
at com.orientechnologies.orient.core.sql.executor.FilterStep$1.fetchNextItem(FilterStep.java:54)
at com.orientechnologies.orient.core.sql.executor.FilterStep$1.hasNext(FilterStep.java:87)
at com.orientechnologies.orient.core.sql.executor.FilterByClassStep$1.fetchNextItem(FilterByClassStep.java:50)
at com.orientechnologies.orient.core.sql.executor.FilterByClassStep$1.hasNext(FilterByClassStep.java:82)
at com.orientechnologies.orient.core.sql.executor.LetExpressionStep$1.hasNext(LetExpressionStep.java:35)
at com.orientechnologies.orient.core.sql.executor.ProjectionCalculationStep$1.hasNext(ProjectionCalculationStep.java:31)
at com.orientechnologies.orient.core.sql.executor.DistinctExecutionStep.fetchNext(DistinctExecutionStep.java:96)
at com.orientechnologies.orient.core.sql.executor.DistinctExecutionStep.access$100(DistinctExecutionStep.java:15)
at com.orientechnologies.orient.core.sql.executor.DistinctExecutionStep$1.hasNext(DistinctExecutionStep.java:53)
at com.orientechnologies.orient.core.sql.executor.ProjectionCalculationStep$1.hasNext(ProjectionCalculationStep.java:31)
at com.orientechnologies.orient.core.sql.parser.OLocalResultSet.hasNext(OLocalResultSet.java:53)
at com.orientechnologies.orient.core.sql.parser.OLocalResultSetLifecycleDecorator.hasNext(OLocalResultSetLifecycleDecorator.java:41)
at com.orientechnologies.orient.core.sql.executor.OResultSet.tryAdvance(OResultSet.java:38)
at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at com.orientechnologies.orient.server.OConnectionBinaryExecutor.executeQuery(OConnectionBinaryExecutor.java:1513)
at com.orientechnologies.orient.client.remote.message.OQueryRequest.execute(OQueryRequest.java:143)
at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.sessionRequest(ONetworkProtocolBinary.java:355)
at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.execute(ONetworkProtocolBinary.java:239)
at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:67)

timw · 2024-08-29T10:04:26Z

@tglman - any chance of getting some pointers on this one?
Our only path to resolving this at present is migrating some of the data to another database and rebuilding the rest, which is a lot of effort.

tglman · 2024-08-29T11:41:25Z

So, going strait to how to recover:

OrientDB store the cluster data in two files one with is the id file that keep the position where the data is stored and some minimal metadata, and the other where the data is stored, the one that is failing is the one that keep the data. So should be possible to mark the record as delete and create a "leak" in the data file.

If you are happy to hack a bit in the code, this:
https://github.com/orientechnologies/orientdb/blob/3.1.x/core/src/main/java/com/orientechnologies/orient/core/storage/cluster/v2/OPaginatedClusterV2.java#L790-L843

is the part that remove the data from the data file, so is the part that is failing on truncate record. If you comment that out and make sure to keep this call: https://github.com/orientechnologies/orientdb/blob/3.1.x/core/src/main/java/com/orientechnologies/orient/core/storage/cluster/v2/OPaginatedClusterV2.java#L847

you should be able to run a truncate record that logically delete the data without actually remove that from the data file and triggering the issue.

Obviously if you do this hack use that code just to remove the failed records, do not run in production will cause record to leak data, even if you change the code to handle the previous exception without deleting it it may cause some other issues, so better to not deploy hack in production.

Going to the origin of the bug, also in this case it feel something I already saw, checking the history of the file in 3.2.x https://github.com/orientechnologies/orientdb/commits/3.2.x/core/src/main/java/com/orientechnologies/orient/core/storage/cluster/v2/OPaginatedClusterV2.java I do have a commit with a fix that is not present in 3.1.x in the specific: 21c49f9

by the way, is there anything that is holding you in 3.1.x ? it may be a good to try to switch to 3.2.x

Bye

timw · 2024-08-30T00:19:51Z

Thanks @tglman

When you say do not run in production, I'm assuming you mean we could use this offline with a hacked OrientDB to fix these records, but not let that hacked delete/truncate code run in production as it would cause other issues?

Would it make sense to make this a FORCE mode on TRUNCATE RECORD?

Similarly, would this be something the check/repair database should be able to detect and rectify, or is the DB structure too broken at that point? The problem is that you can't do a complete database export with these broken records in place (thankfully the affected records in our case are generated, so we can do a partial export/import + regenerate in this case), so having a repair tool that could patch them up enough to get back to a workable state (e.g. that we could manually fix the graph on or run a REPAIR -fix-graph to completion) and then allow the user to export/import to recover the leaked record space if required would add a lot of reassurance.

We're planning to get to 3.2.x, but we need to find time to stress test the distributed code in 3.2 (which has changed a fair amount since 3.1). Given we're running into a lot of things fixed in 3.2 now, we might look to bring that forward.

tglman · 2024-08-30T13:24:18Z

Hi,

When you say do not run in production, I'm assuming you mean we could use this offline with a hacked OrientDB to fix these records, but not let that hacked delete/truncate code run in production as it would cause other issues?

Yes

Would it make sense to make this a FORCE mode on TRUNCATE RECORD?

Yes, but this require some proper development time, as today the low level code is shared between the normal delete and the truncate, so there is need of some work to spit the logic and make sure to have the option of "leak the data"

And yes as well as soon a truncate is done it could be used in a check/ repair tool.

Anyway as today there is something that can be used already in the database export that is -excludeClass and -excludeCluster that can be used to ignore all the data of a specific class/cluster, which in the case of specific data that can be rebuild will allow you to export all the rest.

Bye

tglman added this to the 3.1.x milestone Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncate record failing when attempting to recover broken record #10286

Truncate record failing when attempting to recover broken record #10286

timw commented Aug 26, 2024 •

edited

Loading

timw commented Aug 26, 2024

timw commented Aug 29, 2024

tglman commented Aug 29, 2024

timw commented Aug 30, 2024 •

edited

Loading

tglman commented Aug 30, 2024 •

edited

Loading

Truncate record failing when attempting to recover broken record #10286

Truncate record failing when attempting to recover broken record #10286

Comments

timw commented Aug 26, 2024 • edited Loading

OrientDB Version: 3.1.21-SNAPSHOT (3.1.20 + our HA/replicated stability patches)

Java Version: 1.8

OS: Linux/arm64

timw commented Aug 26, 2024

timw commented Aug 29, 2024

tglman commented Aug 29, 2024

timw commented Aug 30, 2024 • edited Loading

tglman commented Aug 30, 2024 • edited Loading

timw commented Aug 26, 2024 •

edited

Loading

timw commented Aug 30, 2024 •

edited

Loading

tglman commented Aug 30, 2024 •

edited

Loading