Chain crashed with consensus failure, about 8 hrs after a hard fork. Delegator staking error. #7506

njmurarka · 2020-10-10T00:31:46Z

Summary of Bug

I recently exported the genesis from my old network, as per ticket #7505. I was able to "successfully" launch a new network of two nodes, using this exported genesis.

I ran the new network for a while, and then, after about 6,000 blocks, I got the following error:

CONSENSUS FAILURE: Calculated final stake for delegator... greater than current stake final stake.

The code has not really changed all that much and I am in fact still even using the same version of the COSMOS SDK. So it is quite puzzling. Furthermore, given that the old network has been running for months without an issue and now, having forked it, I get a crash after 8 hrs, is worrisome.

I can provide trace and logs if needed. I kept a copy of everything.

Some other similar mentions:

#4088
https://www.gitmemory.com/issue/cosmos/cosmos-sdk/4012/480477596

Version

cosmos-sdk v0.39.1

Steps to Reproduce

Above.

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned

alessio · 2020-10-10T10:48:11Z

This seems to affect 0.39.1 @clevinson @ethanfrey @alexanderbez

tac0turtle · 2020-10-10T11:04:21Z

I can provide trace and logs if needed. I kept a copy of everything.

Can you please! https://pastebin.com/ is a good place to paste large chunks. Your genesis file would be helpful as well.

alexanderbez · 2020-10-10T13:50:45Z

This is virtually impossible to debug unless we can detect this in simulations with a specific seed (which we haven't). My suspicion is that something in the export/upgrade process may have corrupted state somehow.

ethanfrey · 2020-10-10T15:21:58Z

Does it happen everytime you restart with the same db? Does it happen to a new node that syncs from genesis?

Please save a complete backup of the data and config dies and see how well you can reproduce it.

njmurarka · 2020-10-10T20:36:43Z

I have a total backup of my .blzd folders on the ONLY two validators for this network. If I start either validator up with this .blzd state, I get this error, very deterministically. I reset the network but I DID backup .blzd first, to ensure we could investigate.

.blzd is not large. Please let me know what to do. I could upload both copies of .blzd (for the two validators) someplace for you, if that helps to resolve this.

@alexanderbez I do not know how I could have done something wrong in the export or upgrade. I am no expert on this particular process, but I followed these instructions (which are very brief):

https://github.com/cosmos/cosmos-sdk/blob/d7df1ef868e480dc2bcdf852c11a7121c6afab67/NEWS.md#chain-hard-fork-also-know-as-the-tested-upgrade-path

Note that I had to start two validators even to get the new network running, which in itself, was counter to my understanding, given that one of the validators has enough power to create blocks. But that might be irrelevant.

njmurarka · 2020-10-10T20:37:57Z

@ethanfrey It happens when I start either of the two validators I have. They literally crash on start.

As mentioned in the previous comment, please let me know if I should just give you guys a tarball of my two .blzd folders. You can then duplicate the behaviour (well, at least, I can).

njmurarka · 2020-10-10T20:48:45Z

Also, this CANNOT be a coincidence. The crash happened at block 6031, both times. By both times, I mean that it happened when I initially launched the new upgraded chain. Then, that crashed and burned. I reset both validators, started again, and it died again at the same block!

Here is another dump from the first crash:

I[2020-10-09|20:22:21.511] Executed block                               module=state height=6029 validTxs=0 invalidTxs=0
I[2020-10-09|20:22:21.514] Committed state                              module=state height=6029 txs=0 appHash=D72DB9B94B6DBFD651DA194B8FD81DE182F7E398A27818D7AD449F315343A96B
I[2020-10-09|20:22:26.531] Executed block                               module=state height=6030 validTxs=0 invalidTxs=0
I[2020-10-09|20:22:26.543] Committed state                              module=state height=6030 txs=0 appHash=B4872F90111553A5AD08805D23F21A6073A959803B821EFDF2361E8D0EB49460
E[2020-10-09|20:22:35.564] CONSENSUS FAILURE!!!                         module=consensus err="calculated final stake for delegator bluzelle1ak32uvt5ta38kcz97m6g9ydvhesvnysk228hgw greater than current stake\n\tfinal stake:\t10000000.000000000000000000\n\tcurrent stake:\t9999000.000000000000000000" stack="goroutine 145 [running]:\nruntime/debug.Stack(0xc000e8aea0, 0x1199040, 0xc00276cdb0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9d\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine.func2(0xc00005f880, 0x152ba10)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:617 +0x57\npanic(0x1199040, 0xc00276cdb0)\n\t/usr/local/go/src/runtime/panic.go:967 +0x15d\ngithub.com/cosmos/cosmos-sdk/x/distribution/keeper.Keeper.calculateDelegationRewards(0x173cc20, 0xc000231f80, 0xc00017a1c0, 0xc00017a1c0, 0x173cc20, 0xc000231fd0, 0x173cc60, 0xc0000b0020, 0xc000211f40, 0xc, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/distribution/keeper/delegation.go:127 +0x873\ngithub.com/cosmos/cosmos-sdk/x/distribution/keeper.Keeper.withdrawDelegationRewards(0x173cc20, 0xc000231f80, 0xc00017a1c0, 0xc00017a1c0, 0x173cc20, 0xc000231fd0, 0x173cc60, 0xc0000b0020, 0xc000211f40, 0xc, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/distribution/keeper/delegation.go:147 +0x29f\ngithub.com/cosmos/cosmos-sdk/x/distribution/keeper.Hooks.BeforeDelegationSharesModified(0x173cc20, 0xc000231f80, 0xc00017a1c0, 0xc00017a1c0, 0x173cc20, 0xc000231fd0, 0x173cc60, 0xc0000b0020, 0xc000211f40, 0xc, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/distribution/keeper/hooks.go:88 +0x1ea\ngithub.com/cosmos/cosmos-sdk/x/staking/types.MultiStakingHooks.BeforeDelegationSharesModified(0xc00000eae0, 0x2, 0x2, 0x174eb60, 0xc00012a010, 0x1763da0, 0xc00389dd00, 0xa, 0x0, 0x0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/staking/types/hooks.go:47 +0xe0\ngithub.com/cosmos/cosmos-sdk/x/staking/keeper.Keeper.BeforeDelegationSharesModified(...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/staking/keeper/hooks.go:56\ngithub.com/cosmos/cosmos-sdk/x/staking/keeper.Keeper.unbond(0x173cc20, 0xc000231f60, 0xc00017a1c0, 0x1760a20, 0xc0002ad000, 0x1763f20, 0xc00000eb20, 0xc00017a1c0, 0x173cc20, 0xc000231fd0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/staking/keeper/delegation.go:555 +0xa31\ngithub.com/cosmos/cosmos-sdk/x/staking/keeper.Keeper.slashRedelegation(0x173cc20, 0xc000231f60, 0xc00017a1c0, 0x1760a20, 0xc0002ad000, 0x1763f20, 0xc00000eb20, 0xc00017a1c0, 0x173cc20, 0xc000231fd0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/staking/keeper/slash.go:257 +0x744\ngithub.com/cosmos/cosmos-sdk/x/staking/keeper.Keeper.Slash(0x173cc20, 0xc000231f60, 0xc00017a1c0, 0x1760a20, 0xc0002ad000, 0x1763f20, 0xc00000eb20, 0xc00017a1c0, 0x173cc20, 0xc000231fd0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/staking/keeper/slash.go:95 +0x795\ngithub.com/cosmos/cosmos-sdk/x/slashing/internal/keeper.Keeper.HandleValidatorSignature(0x173cc20, 0xc000231f90, 0xc00017a1c0, 0x17602a0, 0xc000200c60, 0x174fe60, 0xc0002e06e0, 0x174eb60, 0xc00012a010, 0x1763da0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/slashing/internal/keeper/infractions.go:94 +0xf40\ngithub.com/cosmos/cosmos-sdk/x/slashing.BeginBlocker(0x174eb60, 0xc00012a010, 0x1763da0, 0xc00389dd00, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/slashing/abci.go:16 +0x17d\ngithub.com/cosmos/cosmos-sdk/x/slashing.AppModule.BeginBlock(...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/slashing/module.go:144\ngithub.com/cosmos/cosmos-sdk/types/module.(*Manager).BeginBlock(0xc00017b260, 0x174eb60, 0xc00012a010, 0x1763da0, 0xc00389dd00, 0xa, 0x0, 0x0, 0x0, 0x0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/types/module/module.go:297 +0x1ca\ngithub.com/bluzelle/curium.(*CRUDApp).BeginBlocker(...)\n\t/home/ubuntu/go/src/github.com/bluzelle/curium/app.go:345\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).BeginBlock(0xc000e20000, 0xc00376b160, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/baseapp/abci.go:136 +0x469\ngithub.com/tendermint/tendermint/abci/client.(*localClient).BeginBlockSync(0xc000131680, 0xc00376b160, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/abci/client/local_client.go:231 +0xf8\ngithub.com/tendermint/tendermint/proxy.(*appConnConsensus).BeginBlockSync(0xc000231c40, 0xc00376b160, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/proxy/app_conn.go:69 +0x6b\ngithub.com/tendermint/tendermint/state.execBlockOnProxyApp(0x174f5a0, 0xc000de7180, 0x175df00, 0xc000231c40, 0xc003d9b880, 0x1766720, 0xc000010978, 0x6, 0xc000e6dcc0, 0x1e)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/state/execution.go:284 +0x3e1\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc0002fe310, 0xa, 0x0, 0xc0001879a0, 0x6, 0xc000e6dcc0, 0x1e, 0x178e, 0xc0037da660, 0x20, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/state/execution.go:135 +0x17a\ngithub.com/tendermint/tendermint/consensus.(*State).finalizeCommit(0xc00005f880, 0x178f)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:1459 +0x8f5\ngithub.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit(0xc00005f880, 0x178f)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:1377 +0x383\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit.func1(0xc00005f880, 0x1, 0x178f)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:1312 +0x90\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit(0xc00005f880, 0x178f, 0x1)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:1349 +0x61e\ngithub.com/tendermint/tendermint/consensus.(*State).addVote(0xc00005f880, 0xc003981860, 0x0, 0x0, 0x4528d6e1, 0x84ed06f31edcdb80, 0xc0033abaa0)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:1891 +0xa39\ngithub.com/tendermint/tendermint/consensus.(*State).tryAddVote(0xc00005f880, 0xc003981860, 0x0, 0x0, 0xc000e97b90, 0x212a6d0, 0x152dab8)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:1707 +0x67\ngithub.com/tendermint/tendermint/consensus.(*State).handleMsg(0xc00005f880, 0x172ba40, 0xc003f0a268, 0x0, 0x0)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:713 +0x525\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc00005f880, 0x0)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:664 +0x597\ncreated by github.com/tendermint/tendermint/consensus.(*State).OnStart\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:335 +0x13a\n"

njmurarka · 2020-10-10T20:49:46Z

Here you can see BigDipper running against the SECOND instance of the network, after I restarted it after the first crash, and it crashed at the same block.

ethanfrey · 2020-10-10T21:06:24Z

I am not an expert on the staking module at all... I have had nothing to do with it. I am just trying to figure out a reproduceable test case for someone to debug. I think those blzd dirs as tarballs would be good to share (link from dropbox or other such?)

Also, can you take a copy one of those dirs, delete data/application.db which is the application state, and then start a node on that dir? This should try to resync all issued blocks from genesis and recreate the app state. It would be good to see if this hit the same bug before halting on "waiting for next block".

ethanfrey · 2020-10-10T21:07:36Z

Also, pointing to the exact code commit you are running is very helpful. I wonder if you have any code that interacts with the staking or distribution system somehow? Moving coins in fee collector or community pool maybe?

njmurarka · 2020-10-11T22:36:38Z

@ethanfrey Here are links to the .blzd folder (click the download icon on the top right... not always obvious):

https://drive.google.com/file/d/1xYJPJCjUG0lGNwozsoIx5mZCoz1KeBy-/view?usp=sharing
https://drive.google.com/file/d/1nOYBRaHH7lNdQIH0sK6A1p6kQ4PQh5Lj/view?usp=sharing

Please untar and try two different daemons. You will obviously need to build the daemon with our code and point the peers at each other (config.toml)... obvious stuff, I am sure.

Link to the code (please use the test-cm-409 branch):

https://github.com/bluzelle/curium

We were using using the 0.39.1 version of COSMOS.

No code that directly talks to the staking or distribution system. Nothing quite that sophisticated, yet.

njmurarka · 2020-10-14T17:24:52Z

Anything else I can provide to help resolve this?

alexanderbez · 2020-10-14T19:13:29Z

@njmurarka there is virtually impossible to debug :-/ but the best suggestion I can give is that it's most likely operator error as we've executed many upgrades (both halted and live) w/o any issues.

njmurarka · 2020-10-14T20:55:20Z

@alexanderbez I am not too sure how I could have made an error. Of course, I am not denying it is possible.

But what are the instructions then to do an upgrade? Right now, I have only got the following link (that I posted above):

https://github.com/cosmos/cosmos-sdk/blob/d7df1ef868e480dc2bcdf852c11a7121c6afab67/NEWS.md#chain-hard-fork-also-know-as-the-tested-upgrade-path

For convenience, I have posted the instructions I followed below:

Chain Hard Fork (also know as The Tested Upgrade Path)

This strategy entails performing a hard fork of your chain. It takes time, coordination and a few technical steps that the validators of the network must follow.

In preparation of the upgrade, you need to export the current state. This operation should be be performed on one node:

Stop the node and export the current state, e.g.: appd export --for-zero-height > export_genesis.json.
Manually replace the chain id and genesis time fields in export_genesis.json with the values that the network had agreed upon.

Follow these steps to perform the upgrade:

Make a backup copy of the old genesis.json file in your server application's config directory (e.g. $HOME/.appd/config/genesis.json) and replace it with export_genesis.json. Note: do rename export_genesis.json to genesis.json.

Replace the old binary with the new one and restart the service using the new binary.

I did not deviate from this (did not even whitelist jailed validators, partly because I could not tell what the syntax was for the whitelist command), but do note that I had to start two validators just to get the new network to start. I don't know if and how this fact is related possibly to the crash, but I opened a separate bug for this "need to start two validators requirement".

Would really appreciate if you could provide me with the instructions you follow to do an upgrade.

On the matter of this bug I filed, shall I assume then that the .blzd folder tarballs I uploaded earlier as per @ethanfrey's request were not helpful?

Thanks.

alexanderbez · 2020-10-15T12:40:54Z

You'll need as many validators as you need to get enough power online -- could be two could be 80. I recall you manually modified power or something? Did you manually modify anything at all?

njmurarka · 2020-10-16T22:35:35Z

I specifically ensured that the one validator I was bringing up in the new network had far more than enough power (> 70%) before I did the export. So with the new network, this validator alone should have enough power to start alone, right?

Manually modified power?

alexanderbez · 2020-10-17T13:01:44Z

How did you "ensure" this w/o tweaking anything?

njmurarka · 2020-10-17T18:13:40Z

I stopped one of the nodes. I exported the genesis file. I did not change its contents, and then I followed the directions listed above, to start a new network with that exported genesis file.

I did ensure, before exporting (many blocks before), that I had a validator (let's call it validatorBob) that had over 70% voting power. The reason was when I started the new network, I ensured the first validator was the very same validatorBob node (same validator private key, etc...). So ok, I had validatorBob stake alot more, to ensure this validator singularly had supermajority voting power. The rationale was so that I could bring up just one validator on the network to test the new network.

Unrelated, but I also discovered that this validator alone would not start to create blocks, despite having that voting power. I had to start a "token" other validator (the power of this second validator was irrelevant) to get the new network going. Odd but does not in any obvious way seem related to the crash.

njmurarka · 2020-10-19T21:53:21Z

Does this help?

I really would love some guidance on what I could have done wrong. It is really difficult to know what to do as the instructions to upgrade a network are pretty short, so can't see where I might have done something wrong.

Thank you.

njmurarka · 2020-10-21T00:34:14Z

Update. I did the same as before but with a newer "export"... but used the same process.

No crash yet at block 9,000.

Still, isn't anyone here interested in finding out how and why it crashed? I did not "fix" anything.

I can reproduce the issue readily... so @alexanderbez @ethanfrey I am under the impression we don't need to reproduce this in a simulator. Am I wrong?

alexanderbez · 2020-10-21T14:55:13Z

Wdym by a "newer" export? Just at a later height?

njmurarka · 2020-10-21T19:17:29Z

Yes. Later height. No other difference.

I am assuming I am not the only one bothered by the fact this "problem" happened. I am delighted it has not happened again, but it begs the question why it happened.

Like I said, I did not do anything outside of the scope of the instructions for an upgrade.

Also, if you were to grab the two tarballs I provided above and try to deploy two quick validators pointing at each other, you will immediately see the crash, in the flesh.

Unless there is evidence I did something wrong, I have to rationalize this could happen again to me or someone else.

alexanderbez · 2020-10-22T15:04:44Z

Well we haven't seen this problem before and we're doing handful of upgrades for Staraget. But let's leave the issue for now. I don't have any suggestions for how to proceed atm.

njmurarka · 2020-10-23T23:50:09Z

Let's keep in open then.

While I am SOMEWHAT ok the problem is not re-occurring, as a rational person, it bothers me it has not specifically been replicated and fixed.

I have to take the safe stance it could happen again, if I am the only person unfortunate enough to have run into it.

Let me know your thoughts.

clevinson added the T:Bug label Oct 14, 2020

njmurarka mentioned this issue Oct 24, 2020

Export with "jail-whitelist" does not jail the specified validators #7666

Closed

4 tasks

tac0turtle closed this as completed Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chain crashed with consensus failure, about 8 hrs after a hard fork. Delegator staking error. #7506

Chain crashed with consensus failure, about 8 hrs after a hard fork. Delegator staking error. #7506

njmurarka commented Oct 10, 2020 •

edited

Loading

alessio commented Oct 10, 2020

tac0turtle commented Oct 10, 2020 •

edited

Loading

alexanderbez commented Oct 10, 2020

ethanfrey commented Oct 10, 2020 •

edited

Loading

njmurarka commented Oct 10, 2020

njmurarka commented Oct 10, 2020

njmurarka commented Oct 10, 2020

njmurarka commented Oct 10, 2020

ethanfrey commented Oct 10, 2020

ethanfrey commented Oct 10, 2020

njmurarka commented Oct 11, 2020 •

edited

Loading

njmurarka commented Oct 14, 2020

alexanderbez commented Oct 14, 2020

njmurarka commented Oct 14, 2020 •

edited

Loading

alexanderbez commented Oct 15, 2020

njmurarka commented Oct 16, 2020

alexanderbez commented Oct 17, 2020

njmurarka commented Oct 17, 2020

njmurarka commented Oct 19, 2020

njmurarka commented Oct 21, 2020

alexanderbez commented Oct 21, 2020

njmurarka commented Oct 21, 2020

alexanderbez commented Oct 22, 2020

njmurarka commented Oct 23, 2020

Chain crashed with consensus failure, about 8 hrs after a hard fork. Delegator staking error. #7506

Chain crashed with consensus failure, about 8 hrs after a hard fork. Delegator staking error. #7506

Comments

njmurarka commented Oct 10, 2020 • edited Loading

Summary of Bug

Version

Steps to Reproduce

For Admin Use

alessio commented Oct 10, 2020

tac0turtle commented Oct 10, 2020 • edited Loading

alexanderbez commented Oct 10, 2020

ethanfrey commented Oct 10, 2020 • edited Loading

njmurarka commented Oct 10, 2020

njmurarka commented Oct 10, 2020

njmurarka commented Oct 10, 2020

njmurarka commented Oct 10, 2020

ethanfrey commented Oct 10, 2020

ethanfrey commented Oct 10, 2020

njmurarka commented Oct 11, 2020 • edited Loading

njmurarka commented Oct 14, 2020

alexanderbez commented Oct 14, 2020

njmurarka commented Oct 14, 2020 • edited Loading

alexanderbez commented Oct 15, 2020

njmurarka commented Oct 16, 2020

alexanderbez commented Oct 17, 2020

njmurarka commented Oct 17, 2020

njmurarka commented Oct 19, 2020

njmurarka commented Oct 21, 2020

alexanderbez commented Oct 21, 2020

njmurarka commented Oct 21, 2020

alexanderbez commented Oct 22, 2020

njmurarka commented Oct 23, 2020

njmurarka commented Oct 10, 2020 •

edited

Loading

tac0turtle commented Oct 10, 2020 •

edited

Loading

ethanfrey commented Oct 10, 2020 •

edited

Loading

njmurarka commented Oct 11, 2020 •

edited

Loading

njmurarka commented Oct 14, 2020 •

edited

Loading