-
Notifications
You must be signed in to change notification settings - Fork 20.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid merkle root leads to skip duplicated bad block for a block that is actually valid. #23531
Comments
I'm having a similar problem which started after upgrading to v1.10.8-stable. I'm not saying it's the cause, just a fact. The problem was corrected has not occurred after I rebooted this morning. System Info:
This is on goerli/prater testnet and upgraded to v1.10.8 on Aug 28:
I searched my logs and I have nearly 3500 "Error: invalid merkle .." messages:
Here is the first and last in eth1.service.invalid-merkle.logs:
And here is the first from the full logs:
And here is the last one just before I rebooted:
Here is a link to the 900M of logs that span from 2021-07-19 to 2021-09-03. If you only want a subset or you'd like let me upload them elsewhere I can. |
Yes, restarting geth seems to clear the issue and allow it to continue syncing. So it seems to be a caching related problem. |
In my case I wasn't syncing it had already been sync'd and was running "fine" since July. But on the reboot it said "Block synchronisation started":
And seemed to get back in sync about 5 minutes after boot:
|
Note; in my case it took a full reboot, just doing a |
In the cases I have seen it had also not been syncing prior to getting into the stuck state, however restarting geth did get things going again. |
Looks like this issue was also reported here: #23442 |
In our case, nodes were previously running 1.10.6. We have seen an increase in this issue occurring after upgrading from 1.10.6 to 1.10.8. Looking over the commits between those two releases that could have impacted state related to caching, I see this change #23299, that may or may not be related. |
Looks like another possible duplicate issue of this was opened: #23546 |
@jsolman have you solved this problem? |
geth 1.10.6 also had a similar issue trying to resync following the Ropsten fork issue Provided details here: I also notice your tx's are reverse ordered compared to etherscan or does etherscan show tx's in reverse order? |
@sanjue The problem still persists. One theory is that something is going wrong with the acceleration snapshot support related to in memory data structures, since the problem is resolved by restarting geth. |
yeah, may be cache or some memory data structures. Thank you! |
I think this issue is a bit of several things mixed up.
This particular output is not important, i.e: it does not signify "I won't process this block because I've already seen it, and it's bad". It only means "When persisting this 'badblock' to db, I found that it already is stored, so I won't store it again". So it just means that geth processed a block, found it bad, and tried to store it. So the problem is the
These can have different causes. For @winksaville , I think the database is corrupt.
-- Furthermore, for @winksaville , we have this:
It appears that the process is run as a daemon. When geth is shut down, the operator should give it ample time to store the caches to disk --- up to a couple of minutes is recommended. Not doing so may cause issues (but it should not be the cause for the leveldb corruptions). If we instead focus on the original report, the odd thing is this:
And yes, if this were just a "the trie data is inconsistent", then it is very odd that it manages to import a block after a restart. @jsolman Do you have the logs for this case:
I'd be curious to take a look at what those logs say. For example, I'm curious of whether the snapshots are in use during the second run. Otherwise, it may be that the snapshot data is corrupt, and the restart then flips it back to run in trie-only mode. If that is the case, then we could run a verification on snap versus trie to verify it. |
I have also found the same issue after updating geth from 1.10.6 to 1.10.8. At that time, And still, after restart the problem disappeared, however I cannot reproduce it now. the log looks like other people mentioned above
I think it might be some concurrent problem, which leads to bad cached data, because not all the nodes had the issue at the same block and become normal after restart. |
@holiman I'll take an action to get the logs that occurred after restarting the node. |
when I change 1.10.6 to 1.10.8, I open the snapshot feature at the first time, |
Yes, let me describe the conditions a bit more
|
@zzyalbert that's good info, thanks! |
There're some clues about this issue:
But when restarting the node , it becomes normal, so the incorrect old value may just in the Will the following situations happens: // Push all updated accounts into the database
for hash, data := range bottom.accountData {
// Skip any account not covered yet by the snapshot
if base.genMarker != nil && bytes.Compare(hash[:], base.genMarker) > 0 {
continue
}
// Push the account to disk
rawdb.WriteAccountSnapshot(batch, hash, data)
base.cache.Set(hash[:], data)
snapshotCleanAccountWriteMeter.Mark(int64(len(data)))
snapshotFlushAccountItemMeter.Mark(1)
snapshotFlushAccountSizeMeter.Mark(int64(len(data)))
// Ensure we don't write too much data blindly. It's ok to flush, the
// root will go missing in case of a crash and we'll detect and regen
// the snapshot.
if batch.ValueSize() > ethdb.IdealBatchSize {
if err := batch.Write(); err != nil {
log.Crit("Failed to write storage deletions", "err", err)
}
batch.Reset()
}
}
// Push all the storage slots into the database
for accountHash, storage := range bottom.storageData {
// Skip any account not covered yet by the snapshot
if base.genMarker != nil && bytes.Compare(accountHash[:], base.genMarker) > 0 {
continue
}
// Generation might be mid-account, track that case too
midAccount := base.genMarker != nil && bytes.Equal(accountHash[:], base.genMarker[:common.HashLength])
for storageHash, data := range storage {
// Skip any slot not covered yet by the snapshot
if midAccount && bytes.Compare(storageHash[:], base.genMarker[common.HashLength:]) > 0 {
continue
}
if len(data) > 0 {
rawdb.WriteStorageSnapshot(batch, accountHash, storageHash, data)
base.cache.Set(append(accountHash[:], storageHash[:]...), data)
snapshotCleanStorageWriteMeter.Mark(int64(len(data)))
} else {
rawdb.DeleteStorageSnapshot(batch, accountHash, storageHash)
base.cache.Set(append(accountHash[:], storageHash[:]...), nil)
}
snapshotFlushStorageItemMeter.Mark(1)
snapshotFlushStorageSizeMeter.Mark(int64(len(data)))
}
} May be at the time the value should be updated from v1_value to v2_value, but it result to be Will this happens?? Hope it will gives some help. |
Here are the last logs before shutting down, and the first logs after restarting on an instance that had the problem. Last logs before shutting down:
First logs after restarting including getting past the block where we were stuck previously.
|
So there appears to be data missing in the trie:
It's odd that it manages to import the block (and a few more) after a restart. |
After verifying v1.10.11 fixes this from including #23635 I will report back here and we can close this. |
For what it's worth I ran into this on 1.10.11-stable. Haven't tried restarting it yet Here's where I noticed around when it began:
Here's the version of geth
I run the client as a service here's the info from ps
|
Same problem, I had a random reboot and it appears to have corrupted the database or w/e this error means. Restarting didn't seem to help so it looks like I'm back to resyncing. |
I have had this problem too. Originally, my node was running on an Intel NUC (i5 or i7 i think) when I first experienced this and then I did switch hardware completely step by step; first the disk, then the RAM and then the entire NUC to an 11th gen i3. The same error still happened. Then I moved to yet another machine with an AMD Ryzen CPU. This was running with the same disk but other RAM that was already installed. This error never happened on the AMD machine. Recently I decided to try it on the previous NUC once again and today it failed after a couple of weeks of working fine. I'm not saying it has to do with the CPU but it really looks like a hardware related issue to me. Let me know if I should provied more details. This is the logs from the first occurence when going back to the NUC:
|
Yeah, an i3 seems too weak to run a node. I have an i5 running a test node and that’s maxing out the cpu more than I would like and crashes/overheats if it doesn’t get enough airflow. In my case I had a number of problems that were compounding, throwing arbitrary data corruption errors. First, I had an Asus AIProtection firewall on my router that was blocking anything related to crypto. There was no way to disable the rule in the firewall so I ended up disabling it entirely and swapping it for a pfsense. I spent weeks syncing/resyncing, working through the state corrupting every few hours. Nothing in the logs/metrics were indicating that there was anything wrong. Cpu, ram, temp was all reasonable. I scanned my drives & memory, they were perfect. just by chance, I noticed a log that clamAV was kicking in right before the node corrupted itself so I disabled it and I haven’t had a corrupted database since. I dug a little deeper and it sounds like someone uploaded something to the database that looks like a virus and that some AVs immediately quarantine during the sync. I also ran into this issue once because my swap space was misconfigured and then again because I shut down geth without giving the script the obligatory 5m to exit. Pretty any gust of wind will result in a corrupt database. It seems like it would be irresponsible to use this with real money without a backup system to fall back on. |
Right, so if you run an AV which places database-files into quarantine, then the database is more or less per definition corrupt. And if the swap-space is unreliable, then I can definitely see what a data corruption can result. A misbehaving router should not lead to database corruption (but definitely can lead to failure to establish peer connections, and failure to sync). A too-quick shutdown should only result in "reprocessing" thousands of blocks, and should not corrupt the database, but unfortunately we've seen lots of reports where db corruption does occur when this happens.
Geth 1.10.11 (https://github.com/ethereum/go-ethereum/releases/tag/v1.10.11) is over a year old. This ticket is becoming unwieldy, since the original report was for I'm going to close this ticket, not because "everything is fixed", but because we need to distinguish between individual faults and not let tickets grow into "catch-all" tickets. |
I wonder if the take away for this is that there could be some simple work to do on the logs to make them a little more helpful. It’s not obvious how to interpret this and other logs which is probably why we pile into tickets like this. It would be nice if there was another line like “the database might be corrupt” then maybe pointing to a corrupt database troubleshooting list. Otherwise any time there’s an issue that corrupts the database, we’re going to have 10 different unhelpful solutions. |
System information
Geth version: 1.10.8 (potentially a new issue at this release)
OS & Version: Linux (Recent release of AL2 x64)
Expected behaviour
Actual behaviour
The block number mentioned here with the given hash is actually a valid block.
Here we see that the node is actually 1 block before the height of the duplicated bad block mentioned above.
For this instance, we see also these logs a bit before the other:
Steps to reproduce the behaviour
This problem seems to have began occuring or increased its rate of occurrence very significantly since the hotfix release 1.10.8. The problem is very hard to produce, if you leave a number of nodes you are managing running for a while it may happen seemingly randomly. The rate of occurrence is too high to be due to cosmic radiation. Based on empirical data, a node seems to have somewhere on the order of a 1/100 chance per day of stalling due to this issue.
Here is an example of the full logs of another instance of this same problem:
Once the problem occurs we see the BAD BLOCK message repeated and peers starting to be dropped repeatedly:
If necessary to help debug, complete logs for the node can be provided.
Suspicions
It appears that the block with bad merkle root is getting cached even though it is invalid, and that subsequent good blocks are being rejected as bad.
Note that the node will continue syncing and stop having the issue by just restarting geth, so it seems there is some sort of cache issue.
The text was updated successfully, but these errors were encountered: