-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missing data to block_header_state and change verify_qc_claim to use block_header_state data rather than only looking up a block_state in the fork database #694
Comments
P2 is not really lost from the block_header_state of B6, as it is the currently active policy. |
[1.0.1] Implement the two tests required to validate issue #694
[1.0 -> main] Implement the two tests required to validate issue #694
Thanks @greg7mdp. That was a typo. I corrected it now. |
Problem
The final validations of
verify_proper_block_exts
actually verify that the attached QC is valid:spring/libraries/chain/controller.cpp
Lines 3973 to 3981 in a8159fe
But notice that it assumes the claimed block can be found in the fork database. This is not always going to be true depending on how the node started up. If it started up from a snapshot, it is possible that a later block received from the network advances the QC claim relative to the claim made by the block from the snapshot (which is also the root of the fork database on startup) but to a block that is still an ancestor of that block from the snapshot.
For example, consider the following case:
The node processes the following blockchain:
where:
B2 claims a strong QC on B1.
B3 claims a strong QC on B1.
B4 claims a strong QC on B2. (B4 makes B1 final.)
B5 claims a strong QC on B4. (B5 makes B2 final.)
B6 claims a strong QC on B5. (B6 makes B4 final.)
Let's say the node operator decided to take a snapshot on B3. After their node receives B6, B4 becomes final and the snapshot on B3 becomes available to the node operator as a valid snapshot.
Then the operator shuts down nodeos and decides to restart from the snapshot on B3.
After starting up from the snapshot, their node receives block B4 from the P2P network. Since B4 advances the QC claim relative to its parent (from a strong QC claimed on B1 to a strong QC claimed on B2), it must include a QC attached to justify its claim. It does in fact contain the strong QC on block B2, but how does this node verify the QC? It started with B3 as the root block of its fork database, so block B2 does not exist in the fork database.
The only information needed to validate a QC on a particular block, beyond the data included in the QC itself, is the following state associated with that particular block: its finality digest (which is also the digest to sign for strong votes and it trivially derives the digest to sign for weak votes); its active finalizer policy; and its pending finalizer policy (if present for that block).
Consider a block
B_new
that has a QC in its QC block extension. This block makes a QC claim on an ancestor blockB_claimed
. And that QC claim must be better than the QC claim ofB_new
's parent blockB_parent
.B_parent
makes a QC claim on its ancestor block which we can denoteB_parent_claimed
. So we know thatB_parent_claimed
is eitherB_claimed
or an ancestor ofB_claimed
; note that it is possible forB_parent_claimed
to beB_claimed
in the case where theB_parent
makes a claim of a weak QC onB_parent_claimed
(akaB_claimed
) whileB_new
makes a claim of a strong QC onB_claimed
(akaB_parent_claimed
).Fortunately, we know that the
finality_core
in theblock_header_state
ofB_parent
has the finality digests for all of the blocks that are ancestors ofB_parent
up to and includingB_parent_claim
. And of course the finality digest ofB_parent
is known if theblock_header_state
ofB_parent
is available. So no matter which blockB_claimed
happens to be, we know that the finality digest of that block is known using only information available in theblock_header_state
ofB_parent
which should of course be available if evaluating blockB_new
.Unfortunately, the active (and possibly pending) finalizer policies as of block
B_claimed
are also needed to validate the QC forB_claimed
(the one attached in the QC block extension ofB_new
). And it might be possible that these policies forB_claimed
are different than the ones forB_parent
.For example, consider the following case:
The node processes the following blockchain:
where:
B1 has active finalizer policy P1 and pending finalizer policy.
B1 proposes finalizer policy P2.
B2 claims a strong QC on B1.
B2 has active finalizer policy P1 and no pending finalizer policy.
B3 claims a strong QC on B2. (B3 makes B1 final.)
B3 has active finalizer policy P1 and has pending finalizer policy P2.
B4 claims a strong QC on B3. (B4 makes B2 final.)
B4 has active finalizer policy P1 and has pending finalizer policy P2.
B5 claims a strong QC on B3.
B5 has active finalizer policy P1 and has pending finalizer policy P2.
B6 claims a strong QC on B4. (B5 makes B3 final.)
B6 has active finalizer policy P1 and no pending finalizer policy.
(At this point, in the current implementation policy P2 is lost from the
block_header_state
of B6, which is the source of the problem.)B7 claims a strong QC on B5.
B7 has active finalizer policy P2 and no pending finalizer policy.
B8 claims a strong QC on B6. (B8 makes B4 final.)
B8 has active finalizer policy P2 and no pending finalizer policy.
B9 claims a strong QC on B8. (B9 makes B6 final.)
B9 has active finalizer policy P2 and no pending finalizer policy.
The node operator decided to take a snapshot on B6. After their node receives B9, B6 becomes final and the snapshot on B6 becomes available to the node operator as a valid snapshot.
Then the operator shuts down nodeos and decides to restart from the snapshot on B6.
After starting up from the snapshot, their node receives block B7 from the P2P network. Since B7 advances the QC claim relative to its parent (from a strong QC claimed on B4 to a strong QC claimed on B5), it must include a QC attached to justify its claim. It does in fact contain the strong QC on block B5, but how does this node verify the QC? It started with B6 as the root block of its fork database, so block B5 does not exist in the fork database.
Yes, the finality digest for B5 can be retrieved from the
finality_core
in theblock_header_state
for B6. But theblock_header_state
of B6 contains anactive_finalizer_policy
of policy P2 and it contains nopending_finalizer_policy
. Not only does it not know the generation numbers for the active and pending (if present) finalizer policies of B5, even if it did know the generation numbers, it simply would no longer have policy P1 which it needs to validate the QC for block B5.Solution
The solution is to augment the state tracked in
block_header_state
.First, the
block_ref
infinality_core
needs to be expanded to also track the generation number of the active finalizer policy and the generation number of the pending finalizer policy (if it is present). If the pending finalizer policy is not present, then a generation number of 0 can be used to represent the lack of presence (valid generation numbers for finalizer policies start from 1).Second, all finalizer policies tracked in the
finality_core
starting with the latest QC claim block (that is the block has block numberlatest_qc_claim().block_num
) must be kept around in theblock_header_state
. However, due to how promotion of finalizer policies work, we know that the only finalizer policy that we would need to track that is not already tracked with theactive_finalizer_policy
orpending_finalizer_policy
is the active finalizer policy of the latest QC claim block.So we just need to store one extra member, called
latest_qc_claim_block_active_finalizer_policy
, in theblock_header_state
to track the active finalizer policy of the latest QC claim block. However, if the latest QC claim block's active finalizer policy is the same as the current block's active finalizer policy (as determined by comparing their generation numbers for equality), then we should not store the redundant finalizer policy (it would be null) so that we do not serialize it out to disk.We should add the following member function to
block_header_state
:And we should add the following member function to
block_state
:When constructing the next
block_header_state
infinish_next
, we need to make sure the newlatest_qc_claim_block_active_finalizer_policy
member is properly set. This can be done with the following code:With the appropriate information now in the new
block_header_state
, the code at the end ofverify_proper_block_exts
that verifies the attached QC should be modified to grab the necessary data from theblock_header_state
of the parent block rather than looking up the claimed block in the fork database. This may require computing the necessary weak digest of the claimed block on the fly from the finality digest found within thefinality_core
.The two test cases described in the Problem section above should also be captured in test to verify these changes work.
Compatibility concerns
Breaking change
Spring v1.0.0 has already been released with the existing structure of
block_header_state
. So changing the serialization ofblock_header_state
would be a breaking change which impacts the serialized data format in thefork_db.dat
andchain_head.dat
files as well as in the v7 snapshots generated by Spring v1.0.0.Normally, we would not introduce such breaking changes in a patch release (or even a minor release), but the nature of this bug forces us to get the fix out sooner as a patch release. So we will be forced to break semver standards on this patch release and introduce the breaking change.
This means that the next release of Spring that includes this bug fix, assumed to be Spring v1.0.1, will not be able to start up normally using the state in the data directory processed by a Spring v1.0.0 (or earlier node). Starting from a snapshot will be required (or of course sync / replay from genesis is always an option).
So we should bump the version of the fork database with this change so that we can distinguish on startup between the new supported fork database and the old unsupported fork database. We can fail to startup with a meaningful error message if the fork database is on the old unsupported version.
There is still a possibility that the data directory contains no fork database but it includes a
chain_head.dat
file. This would be the case after starting from a snapshot but shutting down nodeos before it finishes replaying all of the irreversible blocks in the block log. So in that scenario, thechain_head.dat
file could also be the old version. We do not have an official versioning scheme for thechain_head.dat
file, but a variant is serialized to distinguish between the legacy (tag of 0) and Savanna (tag of 1) versions of the block state. So we could use that to detect that this is an old (unsupported) version of thechain_head.dat
file and fail to startup with a meaningful error message. Additionally, we should use this opportunity to serialize the newchain_head.dat
with version information for the future. However, we should pick a version scheme that make it unambiguous whether we are dealing with the new format or the old (v1.0.0) format.There is no issue with starting up the Spring v1.0.1 node from v6 (or earlier) snapshot. But there is a catch with trying to start it with a v7 snapshot.
Problems with v7 snapshots
The problem with the v7 snapshot is that the
block_header_state
(notblock_header_state_legacy
) contained within may not have the necessary information expected by the v1.0.1 (or later) software. Particularly, it would be missing the active and pending generation numbers for past blocks (back to the block of the latest QC claim) and it would be missing any of those other relevant finalizer policies other than the active and pending one.There are only limited circumstances in which this information can be reliably recovered despite not being in the snapshot:
block_header_state
to worry about at all.block_header_state
, and furthermore all the transition blocks have no pending policy and have the genesis finalizer policy as their active finalizer policy.block_header_state
tracks all the necessary information.Note that the first case of Legacy blocks means that it is always safe to startup with a v6 (or earlier) snapshot in v1.0.1 because v6 (or earlier) snapshots can only be taken on Legacy blocks.
Also note that some cases may appear to provide the necessary information to enable reliable recovery of the missing data in the new
block_header_state
but actually they do not.For example, consider the case where the snapshot is of a block that has an active finalizer policy with a generation of 1 but it also had a pending finalizer policy. All finalizer policies that could be relevant to validating QCs going forward are in fact being tracked by the old
block_header_state
in that case. However, what is missing is whether or not a past block had that pending finalizer policy or had no finalizer policy. That single bit of information per each past block is unavailable in the oldblock_header_state
and cannot be reliably recovered.In cases where all information for the new
block_header_state
can be reliable recovered given a v7 snapshot, nodeos could allow loading from the v7 snapshot and automatically reconstructing the newblock_header_state
from the oldblock_header_state
in the snapshot by filling in the missing information. But then that would mean that Spring v1.0.1 and later would only support loading v7 snapshots taken on blocks before the transition to Savanna or taken on blocks after the transition to Savanna but before the first proposed finalizer policy became pending (which would probably be not very long after the transition to Savanna happened).To simplify both the changes in this issue and the explanation to users, we prefer to just disallow loading v7 snapshots going forward. Other than syncing/replaying from genesis, users will be forced to start a Spring v1.0.1 node using either a new v8 snapshot or the old v6 (or earlier) snapshots. But v7 snapshots would not be supported going forward.
Spring v1.0.1 would generate the new v8 snapshots that include the new
block_header_state
with all the necessary information.Consensus protocol compatible
The solution requires making changes to the structure of
block_header_state
. But we intend to not change the consensus protocol with the changes required to fix the bug in this issue.So it is important that the changes to
block_header_state
are done in a way that does not accidentally change how thebase_digest
orreversible_blocks_mroot
are computed, or in general anything else that could break consensus protocol compatibility with Spring 1.0.0.For
reversible_blocks_mroot
it should be easier to guarantee there is no change becausefinality_core::get_reversible_blocks_mroot
constructs theblock_ref_digest_data
structs needed to compute the digest of each leaf node in the Merkle tree. So the structure ofblock_ref_digest_data
should not change even though theblock_ref
struct will have more information in it.But to guarantee
base_digest
does not change, specific changes are needed incompute_base_digest
. Specifically, it is not longer okay to simply pack thecore
using the default serialization supported by the fc macros. A new serialization ofcore
is needed specifically for use incompute_base_digest
that preserves the serialization used in Spring 1.0.0.To be confident that these changes have not accidentally changed the consensus protocol, a new test should be added to check for regression in consensus compatibility.
The test could store a reference blockchain (with QCs) generated and serialized (and committed to the source control) at some earlier point in time. Then it could read that blockchain into a nodeos (replay it) and use it as a source of blocks from which another nodeos instance can sync those blocks. The test should involve both a replay and sync to test both of those code paths in
controller
.We would not normally commit that reference blockchain to source control. The test/tool could allow us to manually regenerate it if necessary, but we would really only need to do it once (or whenever we want to change the nature of the reference blockchain for the test case). Then the test, when it normally runs, just ensures it can validate that blockchain. However, it would be good for the test to generate the reference blockchain anyway, even though it is going to throw it away under typical use, just as an ongoing check in our CI to ensure the reference blockchain generation portion of the test does not break or regress.
This test should ideally be done with the Boost testing framework so that the generation of the reference blockchain is deterministic, but an integration test is also acceptable. However, if we can guarantee the generation of the reference blockchain is deterministic, then it provides other advantages. First, anytime we need to change the reference blockchain (which remember is binary data committed to source control), a reviewer could easily validate the changed data in a pull request by regenerating it themself locally. Second, it allows for the test to include a final check at the end where it compares the block ID of the reference chain it generates to the one that was in the source control to ensure they are the same.
With such a test implemented, to make sure we do not have an accidental consensus change as part of the work for this issue, it would be critical to generate the reference blockchain with the Spring implementation prior to the changes of this issue (e.g. using a branch built off of the Spring v1.0.0 tag with only this consensus regression test added on top). Then the test can act as a check that no accidental consensus change is introduced as one works on the changes required for this issue.
The text was updated successfully, but these errors were encountered: