Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: snapshot recover from exporter error #13935

Conversation

chillyvee
Copy link
Contributor

@chillyvee chillyvee commented Nov 20, 2022

Description

During snapshots, it is difficult to tell which store was not able to export properly.

In the case of Dig chain, one store was not fully configured (no nodedb) so the returned exporter should be nil.

Cosmos-sdk can then decide whether to panic or skip the possibly unused store.

Added print when each store begins and completes an export.

This related update in cosmos/iavl properly returns a nil exporter on error:
https://github.com/cosmos/iavl/pull/622/files#diff-9d9ca52939a19f26358e2b5b35ca938679b381be35f4c54379df94c4fa4d1233R44

Logging for rootmulti was previous hardcoded to NewNopLogger which did NOT log.

Now developers/users can set an environment variable to turn on logging for this package:
LOG_ROOTMULTI=1 daemond start

Author Checklist

  • included the correct type prefix in the PR title
  • targeted the correct branch (see PR Targeting)
  • provided a link to the relevant issue or specification
  • followed the guidelines for building modules
  • included the necessary unit and integration tests
  • added a changelog entry to CHANGELOG.md
  • included comments for documenting Go code
  • updated the relevant documentation or specification
  • reviewed "Files changed" and left comments if necessary
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed ! in the type prefix if API or client breaking change
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic
  • reviewed API design and naming
  • reviewed documentation is accurate
  • reviewed tests and test coverage
  • manually tested (if applicable)

@chillyvee chillyvee requested a review from a team as a code owner November 20, 2022 17:06
@chillyvee chillyvee changed the title fix: snapshot recover from exporter error master fix: snapshot recover from exporter error Nov 20, 2022
@chillyvee
Copy link
Contributor Author

chillyvee commented Nov 20, 2022

Example output which is very helpful for debugging snapshots:

12:00PM INF creating state snapshot height=4412550
12:00PM INF indexed block exents height=4412550 module=txindex
Snapshot.Export Store Key acc
Snapshot.Export Store Key Done acc
Snapshot.Export Store Key authz
Snapshot.Export Store Key Done authz
Snapshot.Export Store Key bank
Snapshot.Export Store Key Done bank
Snapshot.Export Store Key capability
Snapshot.Export Store Key Done capability
Snapshot.Export Store Key distribution
Snapshot.Export Store Key Done distribution
Snapshot.Export Store Key evidence
Snapshot.Export Store Key Done evidence
Snapshot.Export Store Key feegrant
Snapshot.Export Store Key Done feegrant
Snapshot.Export Store Key gov
Snapshot.Export Store Key Done gov
Snapshot.Export Store Key ibc
Snapshot.Export Store Key Done ibc
Snapshot.Export Store Key icahost
iavl/export newExporter failed to create because tree.ndb is nil  <-- this is helpful
Snapshot.Export Store Key icahost exporter nil
Snapshot.Export Store Key mint
Snapshot.Export Store Key Done mint
Snapshot.Export Store Key params
Snapshot.Export Store Key Done params
Snapshot.Export Store Key slashing
Snapshot.Export Store Key Done slashing
Snapshot.Export Store Key staking
Snapshot.Export Store Key Done staking
Snapshot.Export Store Key transfer
Snapshot.Export Store Key Done transfer
Snapshot.Export Store Key upgrade
Snapshot.Export Store Key Done upgrade
Snapshot.Export Store Key wasm
Snapshot.Export Store Key Done wasm
12:05PM INF completed state snapshot format=2 height=4412550

It could be argued that a program should crash when a store doesn't have a nodedb since that could indicate a deeper problem. If that is desired let me know.

The idea here is to make it clear exactly which misconfigured store is crashing the daemon rather than a panic in a location which is called many times with different conditions.

In the case of Dig Chain, snapshots without the icahost store restored fine and validator nodes are signing with the proper apphash values. So it is likely that is an unused store that is mistakenly added, or mistakenly unconfigured.

@@ -757,8 +757,17 @@ func (rs *Store) Snapshot(height uint64, protoWriter protoio.Writer) error {
// and the following messages contain a SnapshotNode (i.e. an ExportNode). Store changes
// are demarcated by new SnapshotStore items.
for _, store := range stores {
fmt.Printf("Snapshot.Export Store Key %s\n", store.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use the logger and place this in debug mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix tomorrow. didn't see the logger earlier, but I see it now

store/iavl/store.go Outdated Show resolved Hide resolved
@chillyvee
Copy link
Contributor Author

chillyvee commented Nov 21, 2022

Changed all fmt.Printf to rs.logger

Found out that the original logger was a NopLogger which DID NOT respect the cli --log_level

The code path does not make it obvious how to attach to the global app logger, nor does it make sense to turn on consensus debugging when trying to debug rootmulti.

Thus we allow a runtime setting of environment variable LOG_ROOTMULTI to turn on logging for this package.

Example usage:

LOG_ROOTMULTI=1 daemond start

store/store.go Outdated
@@ -10,7 +12,15 @@ import (
)

func NewCommitMultiStore(db dbm.DB) types.CommitMultiStore {
return rootmulti.NewStore(db, log.NewNopLogger())
// Developers may enable rootmulti logging by setting env LOG_ROOTMULTI=1
if len(os.Getenv("LOG_ROOTMULTI")) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be cleaner in the app.toml. But not sure, lets wait to see what others think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these makes the most sense?

  • passing the app the NewCommitMultiStore
  • passing the logging level desired to NewCommitMultiStore
  • configuring the logger in the baseapp and supplying it to NewCommitMultiStore

Also not sure where such a setting would go in app.toml (other than root?) since store spans quite a number of functions/reactors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this change to eliminate complexity within this PR

@faddat
Copy link
Contributor

faddat commented Nov 22, 2022

#12228

@chillyvee
Copy link
Contributor Author

Do not merge, found a bug (noted above)

store/iavl/store.go Outdated Show resolved Hide resolved
if err != nil {
rs.logger.Error("Snapshot Failed", "store", store.name, "err", err)
return err
}
defer exporter.Close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has a bug, defers don't run until the function is returning, this assumes that they'll be closed per iteration of the loop. I've mailed PR #14027

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting note is that exporter.Close() is called at the end of this loop.

How about we remove the defer and call expoter.Close() manually only on error conditions (of which there are only a few)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your rewrite in #14027, that will work also of course.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in my PR I just put that code into a closure then kept the defer.

@chillyvee chillyvee force-pushed the v0466_snapshot_recover_from_exporter_error_master branch from ce26122 to fd0e4a7 Compare December 2, 2022 23:14
@chillyvee
Copy link
Contributor Author

chillyvee commented Dec 2, 2022

Here's an update on this PR.

Previously the effort was focused on preventing panic/nil/crashes during a snapshot.

After review of chains with the current crashing problem, we have determined that chains introducing a new store without a keeper and a call to SetStoreLoader(upgradetypes.UpgradeStoreLoader(upgradeInfo.Height will introduce the new store at version = 0.

This new store will continue to increment versions by 1 as enforced by cosmos/iavl on each block.

The crashing problem appear then when a store within rootmulti must be selected for the current block height for all stores.

However the new improperly introduced store, does not have a "version" matching the current height, and thus any effort to snapshot will fail.

For example, if your rootmulti has stores at height 1234

bank - version 1234
slashing - version 1234

Then a chain adds a module such as authz

This results in commit with stores in rootmulti like:

bank - version 1234
slashing - version 1234
authz - version 0

A proper snapshot now needs to consider the actual referenced version of each store during the commit. This PR now loads the commit, determines the actual version of each store, and places that information into the snapshot.

The snapshot protobuf format had to be modified because the restore function could no longer assume that all height match. Since IAVL checksums include the store version, nodes will AppHash error on the first modification to that store since the new checksum will include the mismatched version.

For example, all other nodes on the network 100 blocks after the upgrade will have:

bank - version 1334
slashing - version 1334
authz - version 100

But the restored node which assumes all restored stores have the same height will have

bank - version 1334
slashing - version 1334
authz - version 1334

To avoid this, we utilize the newly added Version to ensure that authz is restored to version 100 to match all other validator nodes.

This avoid the situation where a team deploying an upgrade in an imperfect function immediately penalizes all validators by breaking snapshots creation and statesync restore. The fact that a team is not perfect should not manifest as a panic especially since the the chain works otherwise.

Please note, we encountered issues using the protobuf generation functions and would appreciate if another member could run the protobuf generation and see if the results match or if additional corrections are needed.

@chillyvee chillyvee force-pushed the v0466_snapshot_recover_from_exporter_error_master branch from fd0e4a7 to 626a01d Compare December 2, 2022 23:36
@chillyvee
Copy link
Contributor Author

chillyvee commented Dec 2, 2022

The below documents the reason help is necessary to properly regenerate the snapshot protobuf pb.go files.

This error is current as of cosmos/cosmos-sdk main branch at commit 25449b5

 % make proto-format 
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./tx/textual/internal/testpb/1.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/base/store/snapshots/v1beta1/snapshot.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/base/store/v1beta1/listening.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/evidence/v1beta1/tx.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/evidence/v1beta1/evidence.proto


 % make proto-gen
Generating Protobuf files
Generating gogo proto code
cp: can't stat 'github.com/cosmos/cosmos-sdk/*': No such file or directory
make: *** [Makefile:400: proto-gen] Error 1

With a manual run inside the docker image:

/workspace/proto $ buf generate --template buf.gen.gogo.yaml ./cosmos/base/store/snapshots/v1beta1/snapshot.proto

Does not produce an output file. No errors are reported.

/workspace/proto $ buf --version
1.9.0

CHANGELOG.md Outdated Show resolved Hide resolved

for key, store := range rs.stores {
rs.logger.Debug("prune store", "key", key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the store as a field to the logger too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw the suggestion to log "name', store.Name, but store.name is a private lower case variable.

store/rootmulti/store.go Outdated Show resolved Hide resolved
store/rootmulti/store.go Outdated Show resolved Hide resolved
store/rootmulti/store.go Outdated Show resolved Hide resolved
store/rootmulti/store.go Outdated Show resolved Hide resolved
store/rootmulti/store.go Outdated Show resolved Hide resolved
store/rootmulti/store.go Outdated Show resolved Hide resolved
if item.Store.Version == 0 {
fmt.Printf("!!!!!!!!!!!!!!!!! Restore Error, Snapshot Missing Version, Use Another Peer !!!!!!!!!!!!!!!!")
rs.logger.Error("Restore Error, Snapshot Missing Version, Use Another Peer", "store", item.Store.Name, "version", item.Store.Version)
return snapshottypes.SnapshotItem{}, sdkerrors.Wrap(sdkerrors.ErrLogic, "Snapshot Without Store Version Not Supported, Use Another Peer")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please do not capitalize or punctuate errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion noted. This code is no longer relevant to the scope of this PR

store/rootmulti/store.go Outdated Show resolved Hide resolved
store/rootmulti/store.go Fixed Show fixed Hide fixed
store/rootmulti/store.go Fixed Show fixed Hide fixed
store/rootmulti/store.go Fixed Show fixed Hide fixed
store/rootmulti/store.go Fixed Show fixed Hide fixed
store/rootmulti/store.go Fixed Show fixed Hide fixed
@chillyvee
Copy link
Contributor Author

Thank you for the review @alexanderbez

Noted above is that protobuf to go code generation appears broken. Is the code generation working for your team? Would be best to have matching .proto and .go code in the final commit.

@tac0turtle
Copy link
Member

The below documents the reason help is necessary to properly regenerate the snapshot protobuf pb.go files.

This error is current as of cosmos/cosmos-sdk main branch at commit 25449b5

 % make proto-format 
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./tx/textual/internal/testpb/1.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/base/store/snapshots/v1beta1/snapshot.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/base/store/v1beta1/listening.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/evidence/v1beta1/tx.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/evidence/v1beta1/evidence.proto


 % make proto-gen
Generating Protobuf files
Generating gogo proto code
cp: can't stat 'github.com/cosmos/cosmos-sdk/*': No such file or directory
make: *** [Makefile:400: proto-gen] Error 1

With a manual run inside the docker image:

/workspace/proto $ buf generate --template buf.gen.gogo.yaml ./cosmos/base/store/snapshots/v1beta1/snapshot.proto

Does not produce an output file. No errors are reported.

/workspace/proto $ buf --version
1.9.0

@julienrbrt any idea why this is happening, I cant reproduce it

store/rootmulti/store.go Fixed Show fixed Hide fixed
@julienrbrt
Copy link
Member

julienrbrt commented Dec 8, 2022

@julienrbrt any idea why this is happening, I cant reproduce it

I cannot reproduce it either. Works for me in GitHub Codespaces as well. @chillyvee can you retry in codespaces too and what is your OS?
@likhita-809 I believe this you had something similar, have you discovered why?

store/rootmulti/store.go Fixed Show fixed Hide fixed
@chillyvee
Copy link
Contributor Author

chillyvee commented Dec 8, 2022

Works in github codespaces but not Ubuntu 20.04 x64

This is the OS I'm using for the x64 dev environment

NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"

REPOSITORY                     TAG           IMAGE ID       CREATED        SIZE
ghcr.io/cosmos/proto-builder   0.11.2        32668103e373   4 weeks ago    1.24GB

I will delete the docker image I have and pull it again to see what happens - does not work

# docker rmi ghcr.io/cosmos/proto-builder:0.11.2
# docker pull ghcr.io/cosmos/proto-builder:0.11.2

0.11.2: Pulling from cosmos/proto-builder
213ec9aee27d: Already exists 
4583459ba037: Pull complete 
93c1e223e6f2: Pull complete 
cbef1b99e503: Pull complete 
f99825e86aac: Pull complete 
be55fa1995db: Pull complete 
b05bebbcb2df: Pull complete 
54962337969b: Pull complete 
5aff3b65794d: Pull complete 
530c7ee0eccf: Pull complete 
d68ed598193d: Pull complete 
4e4099510e6a: Pull complete 
94369a9c35c8: Pull complete 
Digest: sha256:538f334c1069f2dd544e9f9997ada75c7fb03da21ae2816d8029de172d9c2365
Status: Downloaded newer image for ghcr.io/cosmos/proto-builder:0.11.2
ghcr.io/cosmos/proto-builder:0.11.2

 % make proto-gen
Generating Protobuf files
Generating gogo proto code
cp: can't stat 'github.com/cosmos/cosmos-sdk/*': No such file or directory
make: *** [Makefile:403: proto-gen] Error 1

REPOSITORY                     TAG           IMAGE ID       CREATED        SIZE
ghcr.io/cosmos/proto-builder   0.11.2        32668103e373   4 weeks ago    1.24GB

In codespaces

 $ docker images
REPOSITORY                     TAG       IMAGE ID       CREATED       SIZE
ghcr.io/cosmos/proto-builder   0.11.2    32668103e373   4 weeks ago   1.24GB

And the codespaces output is much more verbose

make proto-gen
Generating Protobuf files
Generating gogo proto code
go: downloading github.com/cosmos/gogoproto v1.4.3
go: downloading github.com/grpc-ecosystem/go-grpc-middleware v1.3.0
...

Looks like something is preventing the go build/mod process from completing properly.

Same docker image. Must just be something very strange with that particular release of Ubuntu+Docker?

Will set up alternate dev environment/use codespaces if that's all it takes to fix. Might not be worth the time to debug.

@likhita-809
Copy link
Contributor

@julienrbrt any idea why this is happening, I cant reproduce it

I cannot reproduce it either. Works for me in GitHub Codespaces as well. @chillyvee can you retry in codespaces too and what is your OS? @likhita-809 I believe this you had something similar, have you discovered why?

I believe it has something to do with Ubuntu+Docker release as mentioned by @chillyvee because I had tried debugging in several different ways but nothing seems to work locally, but it works fine in codespaces with the same docker image.

@chillyvee
Copy link
Contributor Author

FYI got my node synced up with the latest juno testnet so working on testing now.

@chillyvee chillyvee force-pushed the v0466_snapshot_recover_from_exporter_error_master branch from 52fe0c9 to 8027524 Compare March 3, 2023 22:36
@chillyvee
Copy link
Contributor Author

Previous branch renamed to v0466_snapshot_recover_from_exporter_error_master_preclean

Removed enhanced snapshot format related code

Kept debugging code which can assist validators with real-time monitoring.

@chillyvee chillyvee force-pushed the v0466_snapshot_recover_from_exporter_error_master branch 2 times, most recently from 92e3955 to aae329a Compare March 3, 2023 22:48
@chillyvee chillyvee force-pushed the v0466_snapshot_recover_from_exporter_error_master branch from aae329a to 267c64e Compare March 3, 2023 22:50
Copy link
Collaborator

@odeke-em odeke-em left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating these changes @chillyvee. Just one more change then good to go. Please also update the base branch to the latest.

store/rootmulti/store.go Outdated Show resolved Hide resolved
store/rootmulti/store.go Outdated Show resolved Hide resolved
@chillyvee
Copy link
Contributor Author

chillyvee commented Mar 5, 2023

Thank you for the feedback. encoding/hex dependency removed.

One more item:

@tac0turtle @odeke-em - Earlier someone asked to log store.Name, but store.name is a private variable. Okay to delete the comment from this PR? (code reference here). If there was a clean way to access the private .name, please recommend it.

https://github.com/cosmos/cosmos-sdk/pull/13935/files#diff-2bc7a86169d4c2f891d7ae74b21ca4bcc5787d10114e2eee215fe1ddba5c999eR614

Copy link
Collaborator

@odeke-em odeke-em left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thank you @chillyvee!

store/rootmulti/store.go Outdated Show resolved Hide resolved
@tac0turtle tac0turtle enabled auto-merge (squash) March 6, 2023 16:23
@tac0turtle tac0turtle added the A:automerge Automatically merge PR once all prerequisites pass. label Mar 6, 2023
@tac0turtle tac0turtle merged commit 183dde8 into cosmos:main Mar 6, 2023
@julienrbrt
Copy link
Member

Given it ended up only adding logs and improving UX, we should get this in v0.47 final imho.

@julienrbrt
Copy link
Member

@Mergifyio backport release/v0.47.x

@mergify
Copy link
Contributor

mergify bot commented Mar 6, 2023

backport release/v0.47.x

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Mar 6, 2023
Co-authored-by: Marko <[email protected]>
Co-authored-by: Emmanuel T Odeke <[email protected]>
Co-authored-by: Marko <[email protected]>
Co-authored-by: Aleksandr Bezobchuk <[email protected]>
(cherry picked from commit 183dde8)

# Conflicts:
#	store/rootmulti/store.go
exporter, err := store.Export(int64(height))
if exporter == nil {
rs.logger.Error("snapshot failed; exporter is nil", "store", store.name)
Copy link
Member

@julienrbrt julienrbrt Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at store.Export this will always trigger when there is an error, and the error check below is now not so useful. Was it intended?

func (st *Store) Export(version int64) (*iavl.Exporter, error) {
istore, err := st.GetImmutable(version)
if err != nil {
return nil, errorsmod.Wrapf(err, "iavl export failed for version %v", version)
}
tree, ok := istore.tree.(*immutableTree)
if !ok || tree == nil {
return nil, fmt.Errorf("iavl export failed: unable to fetch tree for version %v", version)
}
export, err := tree.Export()
if err != nil {
return nil, err
}
return export, nil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we modified the exporter function after the fact and probably never cleaned this up

julienrbrt added a commit that referenced this pull request Mar 6, 2023
Co-authored-by: Chill Validation <[email protected]>
Co-authored-by: marbar3778 <[email protected]>
Co-authored-by: Julien Robert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A:automerge Automatically merge PR once all prerequisites pass. C:Store
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants