Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algod: support state proofs recoverability #4803

Merged
merged 125 commits into from
Mar 30, 2023

Conversation

id-ms
Copy link
Contributor

@id-ms id-ms commented Nov 16, 2022

Summary

The goal of this PR is to change the way nodes aggregate state proof information. In case of a problem with the state proof chain, nodes will not lose old state proof data so it will be possible to recover the stalled state proof chain.

High-level changes:

  • Persist proving data in the state proof database.
  • Create a new tracker that will help nodes to verify state proofs for old rounds.
  • Change the state proof creation policy in case of a stalled state proof chain.

Test Plan

Aharonee and others added 30 commits October 23, 2022 15:20
* [WIP] Migrate stateproof database to store builders and add PersistentBuilder type

* [WIP] Refactor builder and remove PersistentBuilder and dependency on VotersForRound

* [WIP] Add persistBuilders flag to stateproof.Worker, and delete old builders from database

* [WIP] .

* Finalize the builder struct, fix deleteOldBuilders and make msgpack

* Finalize implementation and fix tests

* Add and update unit tests to verify database is consistent with builders in memory

* rename stateproof top voters alloc bound name

* fx: removed comment

* refactor: fetchBuilder to treat errors from decoding.

* using sql.ErrNoRows

* rm: unused fields in stateproof/Worker

* fx: avoiding sql access for sigs in the future

* refactor: moving persisted fields into BuilderPersistingFields struct

* fx: stating that the worker is not a persistentBuilder in numerous unit-tests

* inspect sig is in db prior to signing in func signStateProof

* improved exists claus

* reviewdog-fix

* refactor: getBuilder func to not fill a ptr

* fx: fetchBuilderForRound behaviour

* fix: shutdown to worker after db-init in test

* fx: warning instead of error log

* rm: error var

* fx: separate SP DB schema upgrade into two functions

* adding unit test for db schema upgrade

* rm: close db inside makeStateProofDB

* improved unit-test

* renamed b as var name

* fx: comment

* fix: unneeded if statement in the sigExistsInDB

* fix: small bug in sigExistsInDv verify

* adding unit-test to ensure fetching from disk is performed

* adding unit-tests

* adding unit-test that catchs an issue

* fx: when fetching builder from DB - fill with sigs

* fx: fetchBuilderFromDB refactor

* fx:race

* reusing code

* fx: race

* fx: watch-dog issue

* fx: comment

* rename insertBuilder func

* improvement: reloading sigs-per-round

* removing unneeded field from func fillBuilder

* fix CR

* fixing edge cases with when no signatures are in the database

* support old databases having only signatures

* fix bug + work on tests

* adjusting tests

* close the query

* verify that builder can be loaded with no signatures

* remove the const flag for now

* fmt

Co-authored-by: Jonathan Weiss <[email protected]>
Co-authored-by: algoidan <[email protected]>
* added state proof verification tracker skeleton

* now saving data regarding state proofs on new block

* added test for state proof verification tracker

* Removed comments from interface functions

* added sp verification tracker to the ledger

* added license, returning the appropriate value from produceCommittingTask

* now handling 0 in StateProofInterval

* tracker test now test removal from tracker

* removal now works

* added db version and created sp verification table

* moved verification data type to ledgercode and generated encode functions

* added state proof round to verification data for db indexing

* added code to prune old state proofs

* added code to retrieve verification data from the tracker

* commented out some code, now also saving generated round in verification data

* now testing addition and committing

* import order fix

* now using deltas to decide whether a state proof has occurred in a block

* basic addition and removal work and are tested

* some test refactoring

* added test for chain in happy flow

* renamed and refactored a lot of code

* fixed emergent index usage bug

* added test with disabled state proofs

* changed variable usage for more readable code

* changed error reporting in tracker lookup

* changed lookup, added more commit tests

* increased confidence in produceCommittingTask

* added db pruning tests

* reordered tests, finished pruning tests

* handled initialization slice size

* fixed v6 migration test to not impact global state

* added state proof verification table creation to TestLedgerReloadTxTailHistoryAccess

* added missing table in ledger test

* all tests except interval change work

* fixed bug in receiving deletion data, added test for interval change

* changed tracker and refactored tests to not touch the dcr

* generated round no longer saved in db

* renamed delete data, split newBlock to functions

* reordered functions, now locking to protect internal data structures

* some lock reordering

* added necessary comment for compilation

* added comments where relevant

* closed todos in ledger

* some renaming

* fixed bug in deletion index check

* added tracker ledger test

* removed todo that is now a ticket

* wrapped verification data insert with db.retry

* added nil check on close

* updated dump_genesis.sh

* added lookup tests

* removed redundant passing of slice by reference

* CR: now testing data len to return

* CR: Renamed ProvenWeight to OnlineTotalWeight

* CR: separated imports

* CR: avoiding looking into future protocols in loadfromdisk

* CR: Now breaking to avoid redundant comparisons in an already sorted array

* CR: Removed redundant optimization from test.

* CR: added checks to verify that new blocks indeed arrive in order

* now panicking after logging invalid new block

* CR: No longer passing entire data structure where round would suffice

* removed usage of proto in loadfromdisk

* CR: now wrapping errors in lookup

* some lookup refactor

* improved error message

* missing parantheses

* error check fix

* CR: some refactors around verification data lookup

* CR: removed accountdb prefix from creation function

* CR: error refactoring

* ledger test now asserts more specific situations

* CR: added unit test for panic

* CR: moved locking for performance improvement

* CR: added load from disk test

* CR: moved reload test to ledger_tests

* CR: Removed redundant error check
* fix: stateprf deletes keys after stateproof advances.

* fixed unit test to match the current code of key deletion

* fix: all unit tests pass, added TODO

* fx: comment

* fx: deletion of previous rounds

* fx: unit-test

* fx: prevStateproof init value

* fx: golangcli

* fix merge issues.

* refactoring

* add test

* fix deletion bug

* fix tests

* refactoring tests

* reuse code on tests

* fix keys removal to use the next state proof

* builder will not remove signatures and builder to support recoverability

* refactoring + avoid remove if stateproof was not changed

* remove double test

* checking that the deletion round is not higher than expected

* minor refactor

* we now remove the last ephemeral key

* nits

* add test for deleting all keys in account manger

* move warning message to error

Co-authored-by: algoidan <[email protected]>
* add version to the verification structure and new consensus param

* refactor state proof verifcaion

* apply state proof using trakcer

* export stateproof verification data in LedgerForCowBase interface

* using tracker on tests.

* fix ledger test

* fix borken interface

* choose to use the tracker based on atRound proto version

* fix tests

* testing v34

* refactoring

* fix comments
* adding one memory slot for latest sp-verification-data

* refactor: changed if-else to one if clause

* loadFromDisk should fill lastSeenStateProofVerificationData from the disk

* added unit tests

* fix: caching latt lookup for future uses

* sql-fx

* fix: returning copy of the lastLookedUpVerificationData

* fix: checks cache location moved
removed unneeded sql lookup on reload from disk
fixed comments
fix: zero-cache is ignored

* removed comment from function
Comment on lines +255 to +260
// Since the map is small (Usually 0 - 2 elements) we decided to keep the code simple
// and check for deletion in every round.
func (vt *votersTracker) removeOldVoters(hdr bookkeeping.BlockHeader) {
lowestStateProofRound := stateproof.GetOldestExpectedStateProof(&hdr)

vt.votersMu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetOldestExpectedStateProof is based on StateProofMaxRecoveryIntervals which is outdated.
Some of StateProofMaxRecoveryIntervals dependency mush be cleaned, maybe as a future work.

But, as part of the current consensus upgrade, perhaps the value of StateProofMaxRecoveryIntervals should be updated to something like 0, and tested.

Copy link
Contributor

@algonautshant algonautshant Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@algonautshant will create a ticket for this.
Cleanup the code for holding the blocks, and check any limitations on the number of accounts held because of this.

Aharonee and others added 3 commits March 16, 2023 18:14
Fix for issue from nightly tests where tryBroadcast is holding the lock and preventing invokeBuilder from running.
algorandskiy
algorandskiy previously approved these changes Mar 20, 2023
@id-ms id-ms force-pushed the feature/stateproofs-recoverability branch from 85fc808 to 8bf459b Compare March 29, 2023 09:02
algonautshant
algonautshant previously approved these changes Mar 29, 2023
Copy link
Contributor

@algonautshant algonautshant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
I cannot find any actionable problems at this time.
Some pending tasks will be addressed in future PRs.

The complexity of this feature remains high, despite many simplifications.
These are mainly due to architectural interdependencies, such as the ldeger/ledgercore dependency, but this is out of the scope of this work.

algorandskiy
algorandskiy previously approved these changes Mar 29, 2023
@algorandskiy
Copy link
Contributor

@algoidan please remerge master. Someone just merged internal -> eval renaming :(

@id-ms id-ms dismissed stale reviews from algorandskiy and algonautshant via cd7dbcc March 30, 2023 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants