Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] heal the worldstate #4972

Closed
wants to merge 24 commits into from
Closed

Conversation

matkt
Copy link
Contributor

@matkt matkt commented Jan 22, 2023

Signed-off-by: Karim TAAM [email protected]

PR description

This PR add a heal mechanism fo the worldtstate in case of inconsistency (unable to trie node). To fix this, we start a quick heal of the worldstate automatically and once the fix is done we restart the block import.

After the detection of an invalid path

  • we delete this path to force the healing of this part.
  • then we delete the trielogs.
  • we select a pivot block (before of after the heal)
  • we move the blokchain to this pivot block (rewind or download the missing blocks)
  • then we launch a worldstate heal.

This feature can also heal a node that has been inconsistent for a long time, but it will take longer because there will be more nodes to heal. With this PR the healing will be done as soon as the problem is detected so there will not a lot to heal and it will be fast

Performed tests

  • Trigger multiple inconsistencies to fix (passed)
  • Fixed a node that has been inconsistent for a long time (passed)
  • Run a snapsync from scratch on goerli (passed)
  • Run a checkpoint sync from scratch on goelri (passed)
  • Run a checkpoint sync from scratch on main (in progress)
  • Run a validator teku+besu on goerli (passed)
  • Profile performance on this node to avoid perf regression (passed)

Fixed Issue(s)

#4379
#4785
#4768

Documentation

  • I thought about documentation and added the doc-change-required label to this PR if
    updates are required.

Changelog

@matkt matkt force-pushed the feature/heal-worldstate branch 2 times, most recently from f173297 to d440264 Compare January 22, 2023 10:53
@matkt matkt force-pushed the feature/heal-worldstate branch 4 times, most recently from 750c53b to fd31a2d Compare January 25, 2023 09:02
Signed-off-by: Karim TAAM <[email protected]>
@matkt matkt force-pushed the feature/heal-worldstate branch 2 times, most recently from 6593ccd to 1f1f96b Compare January 30, 2023 18:59
Signed-off-by: Karim TAAM <[email protected]>
Signed-off-by: Karim TAAM <[email protected]>
Signed-off-by: Karim TAAM <[email protected]>
Signed-off-by: Karim TAAM <[email protected]>
Signed-off-by: Karim TAAM <[email protected]>
Signed-off-by: Matt Nelson <[email protected]>
@garyschulte garyschulte marked this pull request as ready for review February 3, 2023 20:54
Signed-off-by: Karim TAAM <[email protected]>
public boolean isInitialSyncPhaseDone() {
return isInitialSyncPhaseDone;
public boolean isResyncNeeded() {
return isResyncNeeded;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never reset this value afaict. once we need a resync, we always need a resync ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed I switch this flag on markInitialSyncPhaseAsDone

&& protocolContext.getBlockchain().getChainHeadBlockNumber()
!= BlockHeader.GENESIS_BLOCK_NUMBER) {
LOG.info(
"Checkpoint sync was requested, but cannot be enabled because the local blockchain is not empty.");
"Snap sync was requested, but cannot be enabled because the local blockchain is not empty.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be "checkpoint"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done thanks

trieLogStorage,
fallbackNodeFinder);
final KeyValueStorage trieLogStorage) {
super(accountStorage, codeStorage, storageStorage, trieBranchStorage, trieLogStorage);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -44,7 +44,8 @@ static FinalBlockConfirmation genesisConfirmation(final MutableBlockchain blockc
}

static FinalBlockConfirmation ancestorConfirmation(final MutableBlockchain blockchain) {
return firstHeader -> blockchain.contains(firstHeader.getParentHash());
return firstHeader ->
blockchain.getChainHeadHeader().getHash().equals(firstHeader.getParentHash());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might this break backward sync for a reorg. e.g. blockhain.contains() will return true if we have block A' but it isn't the chain head. This check requires it to be the chain head...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the blck number. Should be fine now but. if you can verify

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I pull this branch here I see:
return firstHeader -> blockchain.getChainHeadBlockNumber() + 1 >= firstHeader.getNumber();
I might be missing something, but I read the firstAncestor block as being the earliest block fetched by backward sync. and finalBlockConfirmation.ancestorHeaderReached as the termination condition for fetching further back in BackwardSyncAlgorithm.pickNextStep.

If we are backward syncing because of a fork from 5 or 10 blocks behind head, I think this condition could terminate early. it is almost logically identical to the conditional above it:
return firstHeader -> blockchain.getChainHeadBlockNumber() + 1 >= firstHeader.getNumber();

The first condition makes sense to me b/c we really just need to ensure this backward path resolves to something we have, rather than a particular height.

Copy link
Contributor Author

@matkt matkt Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I understand your remark. On the other hand, I have to keep this check because it allows me to have the backardsync works after the heal in case we rewind. Because we will have the blocks but I still want to go back to import them again . So I think I should add the initial condition + this one. In the case of a REORG the first condition will make sure that it does not stop too soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@garyschulte garyschulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, some of the backward sync changes are concerning.

Copy link
Contributor

@siladu siladu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ethereum/referencetests/src/reference-test/external-resources
be changing as part of this PR?

@@ -30,13 +31,16 @@ public class ProtocolContext {
private final WorldStateArchive worldStateArchive;
private final ConsensusContext consensusContext;

private Optional<Synchronizer> synchronizer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we should consider alternatives to bloating ProtocolContext, a global object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is indeed a problem. I tried a lot to do otherwise but this was the cleanest I found with the current code structure

@@ -85,6 +85,7 @@ protected CompletableFuture<FastSyncState> start(final FastSyncState fastSyncSta
if (worldStateStorage instanceof BonsaiWorldStateKeyValueStorage) {
LOG.info("Clearing bonsai flat account db");
worldStateStorage.clearFlatDatabase();
worldStateStorage.clearTrieLog();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will not be able to go back before this pivot block. In addition, we could have bad trielog so it is preferable to heal them too. Finally keeping old trielog could make a false positive for the isWorldStateAvailable.

protocolContext.getBlockchain().getChainHeadHash().equals(pivotBlockHeader.getHash());
if (!isValidChainHead) {
if (protocolContext.getBlockchain().contains(pivotBlockHeader.getHash())) {
protocolContext.getBlockchain().rewindToBlock(pivotBlockHeader.getHash());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the pivot block be > the max allowable bonsai trie layers away? If so, could this rewind take an unfeasible amount of time or maybe error? If so, then what state is the user left in and is it recoverable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part rewind only the blockchain part not the worldstate part.
If the pivot block is before the current head, we do a rewind otherwise we do nothing. In any case the worldstate will be healed with the new pivot block without doing a rollback or rollforward because we cannot do it if it is corrupted

it is important to choose a new pivot block and not to heal the current head. Because we are not sure that block will remain on the canonical chain

@@ -82,9 +81,8 @@ public static Optional<FastSyncDownloader<?>> create(
final FastSyncState fastSyncState =
fastSyncStateStorage.loadState(ScheduleBasedBlockHeaderFunctions.create(protocolSchedule));

if (isResync) {
worldStateStorage.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why no need to clear the worldStateStorage now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a resync there is already a clear in the Worldstate downloader part https://github.com/hyperledger/besu/pull/4972/files#diff-28d8c7a62dddded52099a727bf58aa864d58b055795b95ebfeee6ac8e2a8cea2R173
So it's clearly useless to do it again @garyschulte

@matkt
Copy link
Contributor Author

matkt commented Feb 5, 2023

Should ethereum/referencetests/src/reference-test/external-resources be changing as part of this PR?

yes good catch I fixed that

Signed-off-by: Karim TAAM <[email protected]>
@matkt
Copy link
Contributor Author

matkt commented Feb 6, 2023

Because of some release difficulties with this PR , I created another one @garyschulte @siladu -> #5059

@matkt matkt closed this Feb 6, 2023
@matkt matkt mentioned this pull request Feb 6, 2023
2 tasks
@matkt matkt deleted the feature/heal-worldstate branch July 5, 2023 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants