Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync fail due to possible DB corruption #4036

Closed
serejandmyself opened this issue Oct 4, 2019 · 5 comments
Closed

Sync fail due to possible DB corruption #4036

serejandmyself opened this issue Oct 4, 2019 · 5 comments
Labels
C:sync Component: Fast Sync, State Sync T:bug Type Bug (Confirmed) T:ux Type: Issue or Pull Request related to developer experience

Comments

@serejandmyself
Copy link

serejandmyself commented Oct 4, 2019

Reference to a Tendermint related issue

Current Behavior

I came across an issue while running a validator node on a Tendermint based chain.
The issue is that every so often, the system finds a mismatch in a block and crashes.
"Corruption on data-block checksum mismatch error".
All the obvious thing, like deleting DB, re-syncing, starting a new validator, new accounts, reinstalling dependencies, etc. have been tried.
The mistake keeps reoccurring.
The blocks are different each time, and the head block that the chain is synced up to, is much higher than the mismatch.
In fact the validator works perfectly for a while, before falling.
NOTE: OFTEN the chain keeps on syncing (6 - 12 hours after) if I leave it, it of course, crashes again thereafter

Expected Behavior

Chain should be syncing stably and constantly

Reproduction

Not sure if its possible to reproduce on purpose.

But it has been mentioned in one way or another in some places across other DB's i.e. BTC, ETH:

Log

This is how the mistake itself looks, where the chain crashes, although the block number can differ from time to time:

CONSENSUS FAILURE!!!                         module=consensus err="leveldb/table: corruption on data-block (pos=399680): checksum mismatch, want=0xcf6de1ec got=0x99ba8252 [file=97839418.ldb]" stack="goroutine 1022538 [running]:\nruntime/debug.Stack(0xc0f3301870, 0xfd53c0, 0xc0578403c0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9d\ngithub.com/tendermint/tendermint/consensus.

This is how the log looks after it tries to sync with the mismatch already in place:
(Different crush to the above, but it looks exactly the same)

E[2019-10-01|07:01:45.455] Connection failed @ sendRoutine              module=p2p [email protected]:26656 conn=MConn{93.125.26.210:26656} err="pong timeout"
E[2019-10-01|07:01:45.455] Stopping peer for error                      module=p2p peer="Peer{MConn{93.125.26.210:26656} 561ac562a79db5c7aebc4dbefd2d728836ce412e out}" err="pong timeout"
E[2019-10-01|07:01:45.539] Connection failed @ sendRoutine              module=p2p [email protected]:46656 conn=MConn{95.216.244.235:46656} err="pong timeout"
E[2019-10-01|07:01:45.539] Stopping peer for error                      module=p2p peer="Peer{MConn{95.216.244.235:46656} b34bcaa7536d0f7e09f775d56ceced3c29ba62c0 out}" err="pong timeout"
E[2019-10-01|07:02:00.651] Failed Sanity Check! Cant add old address to new bucket module=p2p book=/root/.cyberd/config/addrbook.json ka="&{Addr:[email protected]:46656 Src:[email protected]:26656 Attempts:0 LastAttempt:2019-09-30 16:23:23.102398865 +0000 UTC m=+13558.026448007 LastSuccess:2019-09-30 16:23:23.102398865 +0000 UTC m=+13558.026448007 BucketType:2 Buckets:[50]}" bucket=102
E[2019-10-01|07:02:05.332] Error on broadcastTxCommit                   module=rpc err="Timed out waiting for tx to be included in a block

Additional Information

System (local machine):

  • Ubuntu 18.04 64bits
  • X570 aorus elite MB
  • 32gb ram (3200 MHz)
  • Ryzen 5 3600 6 core

Some information from tendermint users (no one actually has a solution, I will open a similar issue on tendermint git):

  • Possible issue with nondeterminism in the state machine (i.e. the tendermint app)
  • Possible issue with tendermint blocks database, not the abci app
  • Possible LevelDB corruption
  • Possible faulty memory (hardware) or with disk subsystem
@melekes
Copy link
Contributor

melekes commented Dec 11, 2019

Another corruption bug #4220

@ebuchman ebuchman added the C:sync Component: Fast Sync, State Sync label Feb 9, 2020
@tac0turtle tac0turtle added T:ux Type: Issue or Pull Request related to developer experience and removed user labels Feb 25, 2020
@melekes
Copy link
Contributor

melekes commented Apr 13, 2020

LevelDB is known for its corruption issues (1, 2).

Have you tried opening an issue in github.com/syndtr/goleveldb? I am afraid there's not much we can do except this and maybe considering switching to another database (1).

@serejandmyself
Copy link
Author

Hey @melekes I haven't really tried since then. Currently, I dont have an access to a powerful enough server to try again. So I will only be able to try out your suggestions within a few weeks.

I have suspected that the issue is in LevelDB from my own investigation too or memory based

@melekes
Copy link
Contributor

melekes commented Apr 13, 2020

#4630 should help too

@melekes
Copy link
Contributor

melekes commented Apr 17, 2020

I'm going to close this. Please open an issue in https://github.com/syndtr/goleveldb/issues. Thank you!

@melekes melekes closed this as completed Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:sync Component: Fast Sync, State Sync T:bug Type Bug (Confirmed) T:ux Type: Issue or Pull Request related to developer experience
Projects
None yet
Development

No branches or pull requests

4 participants