Sync fail due to possible DB corruption #4036

serejandmyself · 2019-10-04T09:10:43Z

Current Behavior

I came across an issue while running a validator node on a Tendermint based chain.
The issue is that every so often, the system finds a mismatch in a block and crashes.
"Corruption on data-block checksum mismatch error".
All the obvious thing, like deleting DB, re-syncing, starting a new validator, new accounts, reinstalling dependencies, etc. have been tried.
The mistake keeps reoccurring.
The blocks are different each time, and the head block that the chain is synced up to, is much higher than the mismatch.
In fact the validator works perfectly for a while, before falling.
NOTE: OFTEN the chain keeps on syncing (6 - 12 hours after) if I leave it, it of course, crashes again thereafter

Expected Behavior

Chain should be syncing stably and constantly

Reproduction

Not sure if its possible to reproduce on purpose.

But it has been mentioned in one way or another in some places across other DB's i.e. BTC, ETH:

Log

This is how the mistake itself looks, where the chain crashes, although the block number can differ from time to time:

CONSENSUS FAILURE!!!                         module=consensus err="leveldb/table: corruption on data-block (pos=399680): checksum mismatch, want=0xcf6de1ec got=0x99ba8252 [file=97839418.ldb]" stack="goroutine 1022538 [running]:\nruntime/debug.Stack(0xc0f3301870, 0xfd53c0, 0xc0578403c0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9d\ngithub.com/tendermint/tendermint/consensus.

This is how the log looks after it tries to sync with the mismatch already in place:
(Different crush to the above, but it looks exactly the same)

E[2019-10-01|07:01:45.455] Connection failed @ sendRoutine              module=p2p [email protected]:26656 conn=MConn{93.125.26.210:26656} err="pong timeout"
E[2019-10-01|07:01:45.455] Stopping peer for error                      module=p2p peer="Peer{MConn{93.125.26.210:26656} 561ac562a79db5c7aebc4dbefd2d728836ce412e out}" err="pong timeout"
E[2019-10-01|07:01:45.539] Connection failed @ sendRoutine              module=p2p [email protected]:46656 conn=MConn{95.216.244.235:46656} err="pong timeout"
E[2019-10-01|07:01:45.539] Stopping peer for error                      module=p2p peer="Peer{MConn{95.216.244.235:46656} b34bcaa7536d0f7e09f775d56ceced3c29ba62c0 out}" err="pong timeout"
E[2019-10-01|07:02:00.651] Failed Sanity Check! Cant add old address to new bucket module=p2p book=/root/.cyberd/config/addrbook.json ka="&{Addr:[email protected]:46656 Src:[email protected]:26656 Attempts:0 LastAttempt:2019-09-30 16:23:23.102398865 +0000 UTC m=+13558.026448007 LastSuccess:2019-09-30 16:23:23.102398865 +0000 UTC m=+13558.026448007 BucketType:2 Buckets:[50]}" bucket=102
E[2019-10-01|07:02:05.332] Error on broadcastTxCommit                   module=rpc err="Timed out waiting for tx to be included in a block

Additional Information

System (local machine):

Ubuntu 18.04 64bits
X570 aorus elite MB
32gb ram (3200 MHz)
Ryzen 5 3600 6 core

Some information from tendermint users (no one actually has a solution, I will open a similar issue on tendermint git):

Possible issue with nondeterminism in the state machine (i.e. the tendermint app)
Possible issue with tendermint blocks database, not the abci app
Possible LevelDB corruption
Possible faulty memory (hardware) or with disk subsystem

The text was updated successfully, but these errors were encountered:

melekes · 2019-12-11T09:14:47Z

Another corruption bug #4220

melekes · 2020-04-13T08:38:05Z

LevelDB is known for its corruption issues (1, 2).

Have you tried opening an issue in github.com/syndtr/goleveldb? I am afraid there's not much we can do except this and maybe considering switching to another database (1).

serejandmyself · 2020-04-13T09:18:43Z

Hey @melekes I haven't really tried since then. Currently, I dont have an access to a powerful enough server to try again. So I will only be able to try out your suggestions within a few weeks.

I have suspected that the issue is in LevelDB from my own investigation too or memory based

melekes · 2020-04-13T10:14:22Z

#4630 should help too

melekes · 2020-04-17T07:11:23Z

I'm going to close this. Please open an issue in https://github.com/syndtr/goleveldb/issues. Thank you!

serejandmyself mentioned this issue Oct 4, 2019

Sync fail due to possible DB corruption cybercongress/go-cyber#397

Closed

melekes added T:bug Type Bug (Confirmed) user labels Oct 7, 2019

serejandmyself mentioned this issue Feb 4, 2020

Wrong app hash calculated after node restart. cybercongress/go-cyber#453

Closed

ebuchman added the C:sync Component: Fast Sync, State Sync label Feb 9, 2020

tac0turtle added T:ux Type: Issue or Pull Request related to developer experience and removed user labels Feb 25, 2020

melekes closed this as completed Apr 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync fail due to possible DB corruption #4036

Sync fail due to possible DB corruption #4036

serejandmyself commented Oct 4, 2019 •

edited by melekes

Loading

melekes commented Dec 11, 2019

melekes commented Apr 13, 2020 •

edited

Loading

serejandmyself commented Apr 13, 2020

melekes commented Apr 13, 2020

melekes commented Apr 17, 2020

Sync fail due to possible DB corruption #4036

Sync fail due to possible DB corruption #4036

Comments

serejandmyself commented Oct 4, 2019 • edited by melekes Loading

Current Behavior

Expected Behavior

Reproduction

Log

Additional Information

melekes commented Dec 11, 2019

melekes commented Apr 13, 2020 • edited Loading

serejandmyself commented Apr 13, 2020

melekes commented Apr 13, 2020

melekes commented Apr 17, 2020

serejandmyself commented Oct 4, 2019 •

edited by melekes

Loading

melekes commented Apr 13, 2020 •

edited

Loading