Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chain crashed with consensus failure, about 8 hrs after a hard fork. Delegator staking error. #7506

Closed
4 tasks
njmurarka opened this issue Oct 10, 2020 · 24 comments
Labels

Comments

@njmurarka
Copy link
Contributor

njmurarka commented Oct 10, 2020

Summary of Bug

I recently exported the genesis from my old network, as per ticket #7505. I was able to "successfully" launch a new network of two nodes, using this exported genesis.

I ran the new network for a while, and then, after about 6,000 blocks, I got the following error:

CONSENSUS FAILURE: Calculated final stake for delegator... greater than current stake final stake.

cosmos crash

The code has not really changed all that much and I am in fact still even using the same version of the COSMOS SDK. So it is quite puzzling. Furthermore, given that the old network has been running for months without an issue and now, having forked it, I get a crash after 8 hrs, is worrisome.

I can provide trace and logs if needed. I kept a copy of everything.

Some other similar mentions:

#4088
https://www.gitmemory.com/issue/cosmos/cosmos-sdk/4012/480477596

Version

cosmos-sdk v0.39.1

Steps to Reproduce

Above.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@alessio
Copy link
Contributor

alessio commented Oct 10, 2020

This seems to affect 0.39.1 @clevinson @ethanfrey @alexanderbez

@tac0turtle
Copy link
Member

tac0turtle commented Oct 10, 2020

I can provide trace and logs if needed. I kept a copy of everything.

Can you please! https://pastebin.com/ is a good place to paste large chunks. Your genesis file would be helpful as well.

@alexanderbez
Copy link
Contributor

This is virtually impossible to debug unless we can detect this in simulations with a specific seed (which we haven't). My suspicion is that something in the export/upgrade process may have corrupted state somehow.

@ethanfrey
Copy link
Contributor

ethanfrey commented Oct 10, 2020

Does it happen everytime you restart with the same db? Does it happen to a new node that syncs from genesis?

Please save a complete backup of the data and config dies and see how well you can reproduce it.

@njmurarka
Copy link
Contributor Author

I have a total backup of my .blzd folders on the ONLY two validators for this network. If I start either validator up with this .blzd state, I get this error, very deterministically. I reset the network but I DID backup .blzd first, to ensure we could investigate.

.blzd is not large. Please let me know what to do. I could upload both copies of .blzd (for the two validators) someplace for you, if that helps to resolve this.

@alexanderbez I do not know how I could have done something wrong in the export or upgrade. I am no expert on this particular process, but I followed these instructions (which are very brief):

https://github.com/cosmos/cosmos-sdk/blob/d7df1ef868e480dc2bcdf852c11a7121c6afab67/NEWS.md#chain-hard-fork-also-know-as-the-tested-upgrade-path

Note that I had to start two validators even to get the new network running, which in itself, was counter to my understanding, given that one of the validators has enough power to create blocks. But that might be irrelevant.

@njmurarka
Copy link
Contributor Author

@ethanfrey It happens when I start either of the two validators I have. They literally crash on start.

As mentioned in the previous comment, please let me know if I should just give you guys a tarball of my two .blzd folders. You can then duplicate the behaviour (well, at least, I can).

@njmurarka
Copy link
Contributor Author

Also, this CANNOT be a coincidence. The crash happened at block 6031, both times. By both times, I mean that it happened when I initially launched the new upgraded chain. Then, that crashed and burned. I reset both validators, started again, and it died again at the same block!

Here is another dump from the first crash:

I[2020-10-09|20:22:21.511] Executed block                               module=state height=6029 validTxs=0 invalidTxs=0
I[2020-10-09|20:22:21.514] Committed state                              module=state height=6029 txs=0 appHash=D72DB9B94B6DBFD651DA194B8FD81DE182F7E398A27818D7AD449F315343A96B
I[2020-10-09|20:22:26.531] Executed block                               module=state height=6030 validTxs=0 invalidTxs=0
I[2020-10-09|20:22:26.543] Committed state                              module=state height=6030 txs=0 appHash=B4872F90111553A5AD08805D23F21A6073A959803B821EFDF2361E8D0EB49460
E[2020-10-09|20:22:35.564] CONSENSUS FAILURE!!!                         module=consensus err="calculated final stake for delegator bluzelle1ak32uvt5ta38kcz97m6g9ydvhesvnysk228hgw greater than current stake\n\tfinal stake:\t10000000.000000000000000000\n\tcurrent stake:\t9999000.000000000000000000" stack="goroutine 145 [running]:\nruntime/debug.Stack(0xc000e8aea0, 0x1199040, 0xc00276cdb0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9d\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine.func2(0xc00005f880, 0x152ba10)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:617 +0x57\npanic(0x1199040, 0xc00276cdb0)\n\t/usr/local/go/src/runtime/panic.go:967 +0x15d\ngithub.com/cosmos/cosmos-sdk/x/distribution/keeper.Keeper.calculateDelegationRewards(0x173cc20, 0xc000231f80, 0xc00017a1c0, 0xc00017a1c0, 0x173cc20, 0xc000231fd0, 0x173cc60, 0xc0000b0020, 0xc000211f40, 0xc, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/distribution/keeper/delegation.go:127 +0x873\ngithub.com/cosmos/cosmos-sdk/x/distribution/keeper.Keeper.withdrawDelegationRewards(0x173cc20, 0xc000231f80, 0xc00017a1c0, 0xc00017a1c0, 0x173cc20, 0xc000231fd0, 0x173cc60, 0xc0000b0020, 0xc000211f40, 0xc, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/distribution/keeper/delegation.go:147 +0x29f\ngithub.com/cosmos/cosmos-sdk/x/distribution/keeper.Hooks.BeforeDelegationSharesModified(0x173cc20, 0xc000231f80, 0xc00017a1c0, 0xc00017a1c0, 0x173cc20, 0xc000231fd0, 0x173cc60, 0xc0000b0020, 0xc000211f40, 0xc, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/distribution/keeper/hooks.go:88 +0x1ea\ngithub.com/cosmos/cosmos-sdk/x/staking/types.MultiStakingHooks.BeforeDelegationSharesModified(0xc00000eae0, 0x2, 0x2, 0x174eb60, 0xc00012a010, 0x1763da0, 0xc00389dd00, 0xa, 0x0, 0x0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/staking/types/hooks.go:47 +0xe0\ngithub.com/cosmos/cosmos-sdk/x/staking/keeper.Keeper.BeforeDelegationSharesModified(...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/staking/keeper/hooks.go:56\ngithub.com/cosmos/cosmos-sdk/x/staking/keeper.Keeper.unbond(0x173cc20, 0xc000231f60, 0xc00017a1c0, 0x1760a20, 0xc0002ad000, 0x1763f20, 0xc00000eb20, 0xc00017a1c0, 0x173cc20, 0xc000231fd0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/staking/keeper/delegation.go:555 +0xa31\ngithub.com/cosmos/cosmos-sdk/x/staking/keeper.Keeper.slashRedelegation(0x173cc20, 0xc000231f60, 0xc00017a1c0, 0x1760a20, 0xc0002ad000, 0x1763f20, 0xc00000eb20, 0xc00017a1c0, 0x173cc20, 0xc000231fd0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/staking/keeper/slash.go:257 +0x744\ngithub.com/cosmos/cosmos-sdk/x/staking/keeper.Keeper.Slash(0x173cc20, 0xc000231f60, 0xc00017a1c0, 0x1760a20, 0xc0002ad000, 0x1763f20, 0xc00000eb20, 0xc00017a1c0, 0x173cc20, 0xc000231fd0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/staking/keeper/slash.go:95 +0x795\ngithub.com/cosmos/cosmos-sdk/x/slashing/internal/keeper.Keeper.HandleValidatorSignature(0x173cc20, 0xc000231f90, 0xc00017a1c0, 0x17602a0, 0xc000200c60, 0x174fe60, 0xc0002e06e0, 0x174eb60, 0xc00012a010, 0x1763da0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/slashing/internal/keeper/infractions.go:94 +0xf40\ngithub.com/cosmos/cosmos-sdk/x/slashing.BeginBlocker(0x174eb60, 0xc00012a010, 0x1763da0, 0xc00389dd00, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/slashing/abci.go:16 +0x17d\ngithub.com/cosmos/cosmos-sdk/x/slashing.AppModule.BeginBlock(...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/x/slashing/module.go:144\ngithub.com/cosmos/cosmos-sdk/types/module.(*Manager).BeginBlock(0xc00017b260, 0x174eb60, 0xc00012a010, 0x1763da0, 0xc00389dd00, 0xa, 0x0, 0x0, 0x0, 0x0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/types/module/module.go:297 +0x1ca\ngithub.com/bluzelle/curium.(*CRUDApp).BeginBlocker(...)\n\t/home/ubuntu/go/src/github.com/bluzelle/curium/app.go:345\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).BeginBlock(0xc000e20000, 0xc00376b160, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/cosmos/[email protected]/baseapp/abci.go:136 +0x469\ngithub.com/tendermint/tendermint/abci/client.(*localClient).BeginBlockSync(0xc000131680, 0xc00376b160, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/abci/client/local_client.go:231 +0xf8\ngithub.com/tendermint/tendermint/proxy.(*appConnConsensus).BeginBlockSync(0xc000231c40, 0xc00376b160, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/proxy/app_conn.go:69 +0x6b\ngithub.com/tendermint/tendermint/state.execBlockOnProxyApp(0x174f5a0, 0xc000de7180, 0x175df00, 0xc000231c40, 0xc003d9b880, 0x1766720, 0xc000010978, 0x6, 0xc000e6dcc0, 0x1e)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/state/execution.go:284 +0x3e1\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc0002fe310, 0xa, 0x0, 0xc0001879a0, 0x6, 0xc000e6dcc0, 0x1e, 0x178e, 0xc0037da660, 0x20, ...)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/state/execution.go:135 +0x17a\ngithub.com/tendermint/tendermint/consensus.(*State).finalizeCommit(0xc00005f880, 0x178f)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:1459 +0x8f5\ngithub.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit(0xc00005f880, 0x178f)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:1377 +0x383\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit.func1(0xc00005f880, 0x1, 0x178f)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:1312 +0x90\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit(0xc00005f880, 0x178f, 0x1)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:1349 +0x61e\ngithub.com/tendermint/tendermint/consensus.(*State).addVote(0xc00005f880, 0xc003981860, 0x0, 0x0, 0x4528d6e1, 0x84ed06f31edcdb80, 0xc0033abaa0)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:1891 +0xa39\ngithub.com/tendermint/tendermint/consensus.(*State).tryAddVote(0xc00005f880, 0xc003981860, 0x0, 0x0, 0xc000e97b90, 0x212a6d0, 0x152dab8)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:1707 +0x67\ngithub.com/tendermint/tendermint/consensus.(*State).handleMsg(0xc00005f880, 0x172ba40, 0xc003f0a268, 0x0, 0x0)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:713 +0x525\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc00005f880, 0x0)\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:664 +0x597\ncreated by github.com/tendermint/tendermint/consensus.(*State).OnStart\n\t/home/ubuntu/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:335 +0x13a\n"

@njmurarka
Copy link
Contributor Author

Here you can see BigDipper running against the SECOND instance of the network, after I restarted it after the first crash, and it crashed at the same block.

Screen Shot 2020-10-10 at 13 30 08

@ethanfrey
Copy link
Contributor

I am not an expert on the staking module at all... I have had nothing to do with it. I am just trying to figure out a reproduceable test case for someone to debug. I think those blzd dirs as tarballs would be good to share (link from dropbox or other such?)

Also, can you take a copy one of those dirs, delete data/application.db which is the application state, and then start a node on that dir? This should try to resync all issued blocks from genesis and recreate the app state. It would be good to see if this hit the same bug before halting on "waiting for next block".

@ethanfrey
Copy link
Contributor

Also, pointing to the exact code commit you are running is very helpful. I wonder if you have any code that interacts with the staking or distribution system somehow? Moving coins in fee collector or community pool maybe?

@njmurarka
Copy link
Contributor Author

njmurarka commented Oct 11, 2020

@ethanfrey Here are links to the .blzd folder (click the download icon on the top right... not always obvious):

https://drive.google.com/file/d/1xYJPJCjUG0lGNwozsoIx5mZCoz1KeBy-/view?usp=sharing
https://drive.google.com/file/d/1nOYBRaHH7lNdQIH0sK6A1p6kQ4PQh5Lj/view?usp=sharing

Please untar and try two different daemons. You will obviously need to build the daemon with our code and point the peers at each other (config.toml)... obvious stuff, I am sure.

Link to the code (please use the test-cm-409 branch):

https://github.com/bluzelle/curium

We were using using the 0.39.1 version of COSMOS.

No code that directly talks to the staking or distribution system. Nothing quite that sophisticated, yet.

@njmurarka
Copy link
Contributor Author

Anything else I can provide to help resolve this?

@alexanderbez
Copy link
Contributor

@njmurarka there is virtually impossible to debug :-/ but the best suggestion I can give is that it's most likely operator error as we've executed many upgrades (both halted and live) w/o any issues.

@njmurarka
Copy link
Contributor Author

njmurarka commented Oct 14, 2020

@alexanderbez I am not too sure how I could have made an error. Of course, I am not denying it is possible.

But what are the instructions then to do an upgrade? Right now, I have only got the following link (that I posted above):

https://github.com/cosmos/cosmos-sdk/blob/d7df1ef868e480dc2bcdf852c11a7121c6afab67/NEWS.md#chain-hard-fork-also-know-as-the-tested-upgrade-path

For convenience, I have posted the instructions I followed below:

Chain Hard Fork (also know as The Tested Upgrade Path)

This strategy entails performing a hard fork of your chain. It takes time, coordination and a few technical steps that the validators of the network must follow.

In preparation of the upgrade, you need to export the current state. This operation should be be performed on one node:

Stop the node and export the current state, e.g.: appd export --for-zero-height > export_genesis.json.
Manually replace the chain id and genesis time fields in export_genesis.json with the values that the network had agreed upon.

Follow these steps to perform the upgrade:

Make a backup copy of the old genesis.json file in your server application's config directory (e.g. $HOME/.appd/config/genesis.json) and replace it with export_genesis.json. Note: do rename export_genesis.json to genesis.json.

Replace the old binary with the new one and restart the service using the new binary.

I did not deviate from this (did not even whitelist jailed validators, partly because I could not tell what the syntax was for the whitelist command), but do note that I had to start two validators just to get the new network to start. I don't know if and how this fact is related possibly to the crash, but I opened a separate bug for this "need to start two validators requirement".

Would really appreciate if you could provide me with the instructions you follow to do an upgrade.

On the matter of this bug I filed, shall I assume then that the .blzd folder tarballs I uploaded earlier as per @ethanfrey's request were not helpful?

Thanks.

@alexanderbez
Copy link
Contributor

You'll need as many validators as you need to get enough power online -- could be two could be 80. I recall you manually modified power or something? Did you manually modify anything at all?

@njmurarka
Copy link
Contributor Author

I specifically ensured that the one validator I was bringing up in the new network had far more than enough power (> 70%) before I did the export. So with the new network, this validator alone should have enough power to start alone, right?

Manually modified power?

@alexanderbez
Copy link
Contributor

How did you "ensure" this w/o tweaking anything?

@njmurarka
Copy link
Contributor Author

I stopped one of the nodes. I exported the genesis file. I did not change its contents, and then I followed the directions listed above, to start a new network with that exported genesis file.

I did ensure, before exporting (many blocks before), that I had a validator (let's call it validatorBob) that had over 70% voting power. The reason was when I started the new network, I ensured the first validator was the very same validatorBob node (same validator private key, etc...). So ok, I had validatorBob stake alot more, to ensure this validator singularly had supermajority voting power. The rationale was so that I could bring up just one validator on the network to test the new network.

Unrelated, but I also discovered that this validator alone would not start to create blocks, despite having that voting power. I had to start a "token" other validator (the power of this second validator was irrelevant) to get the new network going. Odd but does not in any obvious way seem related to the crash.

@njmurarka
Copy link
Contributor Author

Does this help?

I really would love some guidance on what I could have done wrong. It is really difficult to know what to do as the instructions to upgrade a network are pretty short, so can't see where I might have done something wrong.

Thank you.

@njmurarka
Copy link
Contributor Author

Update. I did the same as before but with a newer "export"... but used the same process.

No crash yet at block 9,000.

Still, isn't anyone here interested in finding out how and why it crashed? I did not "fix" anything.

I can reproduce the issue readily... so @alexanderbez @ethanfrey I am under the impression we don't need to reproduce this in a simulator. Am I wrong?

@alexanderbez
Copy link
Contributor

Wdym by a "newer" export? Just at a later height?

@njmurarka
Copy link
Contributor Author

Yes. Later height. No other difference.

I am assuming I am not the only one bothered by the fact this "problem" happened. I am delighted it has not happened again, but it begs the question why it happened.

Like I said, I did not do anything outside of the scope of the instructions for an upgrade.

Also, if you were to grab the two tarballs I provided above and try to deploy two quick validators pointing at each other, you will immediately see the crash, in the flesh.

Unless there is evidence I did something wrong, I have to rationalize this could happen again to me or someone else.

@alexanderbez
Copy link
Contributor

Well we haven't seen this problem before and we're doing handful of upgrades for Staraget. But let's leave the issue for now. I don't have any suggestions for how to proceed atm.

@njmurarka
Copy link
Contributor Author

Let's keep in open then.

While I am SOMEWHAT ok the problem is not re-occurring, as a rational person, it bothers me it has not specifically been replicated and fixed.

I have to take the safe stance it could happen again, if I am the only person unfortunate enough to have run into it.

Let me know your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants