Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing data to block_header_state and change verify_qc_claim to use block_header_state data rather than only looking up a block_state in the fork database #694

Closed
arhag opened this issue Sep 4, 2024 · 2 comments · Fixed by #714, #711, #719 or #752
Assignees
Labels
bug The product is not working as was intended. 👍 lgtm
Milestone

Comments

@arhag
Copy link
Member

arhag commented Sep 4, 2024

Problem

The final validations of verify_proper_block_exts actually verify that the attached QC is valid:

// find the claimed block's block state on branch of id
auto bsp = fork_db_fetch_bsp_on_branch_by_num( prev.id(), new_qc_claim.block_num );
EOS_ASSERT( bsp,
invalid_qc_claim,
"Block state was not found in forkdb for block_num ${q}. Block number: ${b}",
("q", new_qc_claim.block_num)("b", block_num) );
// verify the QC proof against the claimed block
bsp->verify_qc(qc_proof);

But notice that it assumes the claimed block can be found in the fork database. This is not always going to be true depending on how the node started up. If it started up from a snapshot, it is possible that a later block received from the network advances the QC claim relative to the claim made by the block from the snapshot (which is also the root of the fork database on startup) but to a block that is still an ancestor of that block from the snapshot.

For example, consider the following case:

The node processes the following blockchain:

<- B1 <- B2 <- B3 <- B4 <- B5 <- B6

where:
B2 claims a strong QC on B1.
B3 claims a strong QC on B1.
B4 claims a strong QC on B2. (B4 makes B1 final.)
B5 claims a strong QC on B4. (B5 makes B2 final.)
B6 claims a strong QC on B5. (B6 makes B4 final.)

Let's say the node operator decided to take a snapshot on B3. After their node receives B6, B4 becomes final and the snapshot on B3 becomes available to the node operator as a valid snapshot.

Then the operator shuts down nodeos and decides to restart from the snapshot on B3.

After starting up from the snapshot, their node receives block B4 from the P2P network. Since B4 advances the QC claim relative to its parent (from a strong QC claimed on B1 to a strong QC claimed on B2), it must include a QC attached to justify its claim. It does in fact contain the strong QC on block B2, but how does this node verify the QC? It started with B3 as the root block of its fork database, so block B2 does not exist in the fork database.

The only information needed to validate a QC on a particular block, beyond the data included in the QC itself, is the following state associated with that particular block: its finality digest (which is also the digest to sign for strong votes and it trivially derives the digest to sign for weak votes); its active finalizer policy; and its pending finalizer policy (if present for that block).

Consider a block B_new that has a QC in its QC block extension. This block makes a QC claim on an ancestor block B_claimed. And that QC claim must be better than the QC claim of B_new's parent block B_parent. B_parent makes a QC claim on its ancestor block which we can denote B_parent_claimed. So we know that B_parent_claimed is either B_claimed or an ancestor of B_claimed; note that it is possible for B_parent_claimed to be B_claimed in the case where the B_parent makes a claim of a weak QC on B_parent_claimed (aka B_claimed) while B_new makes a claim of a strong QC on B_claimed (aka B_parent_claimed).

Fortunately, we know that the finality_core in the block_header_state of B_parent has the finality digests for all of the blocks that are ancestors of B_parent up to and including B_parent_claim. And of course the finality digest of B_parent is known if the block_header_state of B_parent is available. So no matter which block B_claimed happens to be, we know that the finality digest of that block is known using only information available in the block_header_state of B_parent which should of course be available if evaluating block B_new.

Unfortunately, the active (and possibly pending) finalizer policies as of block B_claimed are also needed to validate the QC for B_claimed (the one attached in the QC block extension of B_new). And it might be possible that these policies for B_claimed are different than the ones for B_parent.

For example, consider the following case:

The node processes the following blockchain:

<- B1 <- B2 <- B3 <- B4 <- B5 <- B6 <- B7 <- B8 <- B9

where:

B1 has active finalizer policy P1 and pending finalizer policy.
B1 proposes finalizer policy P2.

B2 claims a strong QC on B1.
B2 has active finalizer policy P1 and no pending finalizer policy.

B3 claims a strong QC on B2. (B3 makes B1 final.)
B3 has active finalizer policy P1 and has pending finalizer policy P2.

B4 claims a strong QC on B3. (B4 makes B2 final.)
B4 has active finalizer policy P1 and has pending finalizer policy P2.

B5 claims a strong QC on B3.
B5 has active finalizer policy P1 and has pending finalizer policy P2.

B6 claims a strong QC on B4. (B5 makes B3 final.)
B6 has active finalizer policy P1 and no pending finalizer policy.
(At this point, in the current implementation policy P2 is lost from the block_header_state of B6, which is the source of the problem.)

B7 claims a strong QC on B5.
B7 has active finalizer policy P2 and no pending finalizer policy.

B8 claims a strong QC on B6. (B8 makes B4 final.)
B8 has active finalizer policy P2 and no pending finalizer policy.

B9 claims a strong QC on B8. (B9 makes B6 final.)
B9 has active finalizer policy P2 and no pending finalizer policy.

The node operator decided to take a snapshot on B6. After their node receives B9, B6 becomes final and the snapshot on B6 becomes available to the node operator as a valid snapshot.

Then the operator shuts down nodeos and decides to restart from the snapshot on B6.

After starting up from the snapshot, their node receives block B7 from the P2P network. Since B7 advances the QC claim relative to its parent (from a strong QC claimed on B4 to a strong QC claimed on B5), it must include a QC attached to justify its claim. It does in fact contain the strong QC on block B5, but how does this node verify the QC? It started with B6 as the root block of its fork database, so block B5 does not exist in the fork database.

Yes, the finality digest for B5 can be retrieved from the finality_core in the block_header_state for B6. But the block_header_state of B6 contains an active_finalizer_policy of policy P2 and it contains no pending_finalizer_policy. Not only does it not know the generation numbers for the active and pending (if present) finalizer policies of B5, even if it did know the generation numbers, it simply would no longer have policy P1 which it needs to validate the QC for block B5.

Solution

The solution is to augment the state tracked in block_header_state.

First, the block_ref in finality_core needs to be expanded to also track the generation number of the active finalizer policy and the generation number of the pending finalizer policy (if it is present). If the pending finalizer policy is not present, then a generation number of 0 can be used to represent the lack of presence (valid generation numbers for finalizer policies start from 1).

Second, all finalizer policies tracked in the finality_core starting with the latest QC claim block (that is the block has block number latest_qc_claim().block_num) must be kept around in the block_header_state. However, due to how promotion of finalizer policies work, we know that the only finalizer policy that we would need to track that is not already tracked with the active_finalizer_policy or pending_finalizer_policy is the active finalizer policy of the latest QC claim block.

So we just need to store one extra member, called latest_qc_claim_block_active_finalizer_policy, in the block_header_state to track the active finalizer policy of the latest QC claim block. However, if the latest QC claim block's active finalizer policy is the same as the current block's active finalizer policy (as determined by comparing their generation numbers for equality), then we should not store the redundant finalizer policy (it would be null) so that we do not serialize it out to disk.

We should add the following member function to block_header_state:

struct finalizer_policies_t {
   digest_type          finality_digest;
   finalizer_policy_ptr active_finalizer_policy; // Never null
   finalizer_policy_ptr pending_finalizer_policy; // Only null if the block has no pending finalizer policy
};

// Only defined for core.latest_qc_claim().block_num <= num <= core.current_block_num()
finalizer_policies_t get_finalizer_policies(block_num_type num);`

And we should add the following member function to block_state:

// Defined for core.last_final_block_num().block_num <= num <= core.current_block_num()
uint32_t get_active_finalizer_policy_generation(block_num_type num);

When constructing the next block_header_state in finish_next, we need to make sure the new latest_qc_claim_block_active_finalizer_policy member is properly set. This can be done with the following code:

const auto latest_qc_claim_block_num = next_header_state.core.latest_qc_claim().block_num;
const auto active_generation_num = next_header_state.active_finalizer_policy->generation; // 
if (prev.get_active_finalizer_policy_generation(latest_qc_claim_block_num).active_finalizer_policy_generation != active_generation_num) {
   next_header_state.latest_qc_claim_block_active_finalizer_policy = prev.get_finalizer_policies(latest_qc_claim_block_num).active_finalizer_policy;
} else {
   next_header_state.latest_qc_claim_block_active_finalizer_policy = nullptr;
}

With the appropriate information now in the new block_header_state, the code at the end of verify_proper_block_exts that verifies the attached QC should be modified to grab the necessary data from the block_header_state of the parent block rather than looking up the claimed block in the fork database. This may require computing the necessary weak digest of the claimed block on the fly from the finality digest found within the finality_core.

The two test cases described in the Problem section above should also be captured in test to verify these changes work.

Compatibility concerns

Breaking change

Spring v1.0.0 has already been released with the existing structure of block_header_state. So changing the serialization of block_header_state would be a breaking change which impacts the serialized data format in the fork_db.dat and chain_head.dat files as well as in the v7 snapshots generated by Spring v1.0.0.

Normally, we would not introduce such breaking changes in a patch release (or even a minor release), but the nature of this bug forces us to get the fix out sooner as a patch release. So we will be forced to break semver standards on this patch release and introduce the breaking change.

This means that the next release of Spring that includes this bug fix, assumed to be Spring v1.0.1, will not be able to start up normally using the state in the data directory processed by a Spring v1.0.0 (or earlier node). Starting from a snapshot will be required (or of course sync / replay from genesis is always an option).

So we should bump the version of the fork database with this change so that we can distinguish on startup between the new supported fork database and the old unsupported fork database. We can fail to startup with a meaningful error message if the fork database is on the old unsupported version.

There is still a possibility that the data directory contains no fork database but it includes a chain_head.dat file. This would be the case after starting from a snapshot but shutting down nodeos before it finishes replaying all of the irreversible blocks in the block log. So in that scenario, the chain_head.dat file could also be the old version. We do not have an official versioning scheme for the chain_head.dat file, but a variant is serialized to distinguish between the legacy (tag of 0) and Savanna (tag of 1) versions of the block state. So we could use that to detect that this is an old (unsupported) version of the chain_head.dat file and fail to startup with a meaningful error message. Additionally, we should use this opportunity to serialize the new chain_head.dat with version information for the future. However, we should pick a version scheme that make it unambiguous whether we are dealing with the new format or the old (v1.0.0) format.

There is no issue with starting up the Spring v1.0.1 node from v6 (or earlier) snapshot. But there is a catch with trying to start it with a v7 snapshot.

Problems with v7 snapshots

The problem with the v7 snapshot is that the block_header_state (not block_header_state_legacy) contained within may not have the necessary information expected by the v1.0.1 (or later) software. Particularly, it would be missing the active and pending generation numbers for past blocks (back to the block of the latest QC claim) and it would be missing any of those other relevant finalizer policies other than the active and pending one.

There are only limited circumstances in which this information can be reliably recovered despite not being in the snapshot:

  • If the snapshot is of a Legacy block, meaning a block prior to the start of the transition to Savanna, then there is no block_header_state to worry about at all.
  • If the snapshot is of a Transition block, then given how the consensus algorithm works, we know that the only finalizer policy that exists at that point is the genesis finalizer policy, which necessarily has a generation number of 1 and is necessarily tracked in the block_header_state, and furthermore all the transition blocks have no pending policy and have the genesis finalizer policy as their active finalizer policy.
  • If the snapshot is of a Proper Savanna block which still has an active finalizer policy with a generation of 1 and no pending finalizer policy, then we also know that block_header_state tracks all the necessary information.

Note that the first case of Legacy blocks means that it is always safe to startup with a v6 (or earlier) snapshot in v1.0.1 because v6 (or earlier) snapshots can only be taken on Legacy blocks.

Also note that some cases may appear to provide the necessary information to enable reliable recovery of the missing data in the new block_header_state but actually they do not.

For example, consider the case where the snapshot is of a block that has an active finalizer policy with a generation of 1 but it also had a pending finalizer policy. All finalizer policies that could be relevant to validating QCs going forward are in fact being tracked by the old block_header_state in that case. However, what is missing is whether or not a past block had that pending finalizer policy or had no finalizer policy. That single bit of information per each past block is unavailable in the old block_header_state and cannot be reliably recovered.

In cases where all information for the new block_header_state can be reliable recovered given a v7 snapshot, nodeos could allow loading from the v7 snapshot and automatically reconstructing the new block_header_state from the old block_header_state in the snapshot by filling in the missing information. But then that would mean that Spring v1.0.1 and later would only support loading v7 snapshots taken on blocks before the transition to Savanna or taken on blocks after the transition to Savanna but before the first proposed finalizer policy became pending (which would probably be not very long after the transition to Savanna happened).

To simplify both the changes in this issue and the explanation to users, we prefer to just disallow loading v7 snapshots going forward. Other than syncing/replaying from genesis, users will be forced to start a Spring v1.0.1 node using either a new v8 snapshot or the old v6 (or earlier) snapshots. But v7 snapshots would not be supported going forward.

Spring v1.0.1 would generate the new v8 snapshots that include the new block_header_state with all the necessary information.

Consensus protocol compatible

The solution requires making changes to the structure of block_header_state. But we intend to not change the consensus protocol with the changes required to fix the bug in this issue.

So it is important that the changes to block_header_state are done in a way that does not accidentally change how the base_digest or reversible_blocks_mroot are computed, or in general anything else that could break consensus protocol compatibility with Spring 1.0.0.

For reversible_blocks_mroot it should be easier to guarantee there is no change because finality_core::get_reversible_blocks_mroot constructs the block_ref_digest_data structs needed to compute the digest of each leaf node in the Merkle tree. So the structure of block_ref_digest_data should not change even though the block_ref struct will have more information in it.

But to guarantee base_digest does not change, specific changes are needed in compute_base_digest. Specifically, it is not longer okay to simply pack the core using the default serialization supported by the fc macros. A new serialization of core is needed specifically for use in compute_base_digest that preserves the serialization used in Spring 1.0.0.

To be confident that these changes have not accidentally changed the consensus protocol, a new test should be added to check for regression in consensus compatibility.

The test could store a reference blockchain (with QCs) generated and serialized (and committed to the source control) at some earlier point in time. Then it could read that blockchain into a nodeos (replay it) and use it as a source of blocks from which another nodeos instance can sync those blocks. The test should involve both a replay and sync to test both of those code paths in controller.

We would not normally commit that reference blockchain to source control. The test/tool could allow us to manually regenerate it if necessary, but we would really only need to do it once (or whenever we want to change the nature of the reference blockchain for the test case). Then the test, when it normally runs, just ensures it can validate that blockchain. However, it would be good for the test to generate the reference blockchain anyway, even though it is going to throw it away under typical use, just as an ongoing check in our CI to ensure the reference blockchain generation portion of the test does not break or regress.

This test should ideally be done with the Boost testing framework so that the generation of the reference blockchain is deterministic, but an integration test is also acceptable. However, if we can guarantee the generation of the reference blockchain is deterministic, then it provides other advantages. First, anytime we need to change the reference blockchain (which remember is binary data committed to source control), a reviewer could easily validate the changed data in a pull request by regenerating it themself locally. Second, it allows for the test to include a final check at the end where it compares the block ID of the reference chain it generates to the one that was in the source control to ensure they are the same.

With such a test implemented, to make sure we do not have an accidental consensus change as part of the work for this issue, it would be critical to generate the reference blockchain with the Spring implementation prior to the changes of this issue (e.g. using a branch built off of the Spring v1.0.0 tag with only this consensus regression test added on top). Then the test can act as a check that no accidental consensus change is introduced as one works on the changes required for this issue.

@arhag arhag added bug The product is not working as was intended. 👍 lgtm labels Sep 4, 2024
@arhag arhag added this to the Spring v1.0.0 milestone Sep 4, 2024
@enf-ci-bot enf-ci-bot moved this to Todo in Team Backlog Sep 4, 2024
@arhag arhag removed the triage label Sep 4, 2024
@arhag arhag modified the milestones: Spring v1.0.0, Spring v1.0.1 Sep 4, 2024
@arhag arhag removed the 👍 lgtm label Sep 4, 2024
@greg7mdp greg7mdp self-assigned this Sep 5, 2024
@greg7mdp greg7mdp moved this from Todo to In Progress in Team Backlog Sep 5, 2024
@greg7mdp
Copy link
Contributor

greg7mdp commented Sep 5, 2024

(At this point, in the current implementation policy P2 is lost from the block_header_state of B6, which is the source of the problem.)

P2 is not really lost from the block_header_state of B6, as it is the currently active policy.
But we have lost P1 which was the active policy on block B5. And also the information that P2 was the currently pending policy for block B5.

greg7mdp added a commit that referenced this issue Sep 6, 2024
[1.0.1] Implement the two tests required to validate issue #694
greg7mdp added a commit that referenced this issue Sep 6, 2024
[1.0 -> main] Implement the two tests required to validate issue #694
@github-project-automation github-project-automation bot moved this from In Progress to Done in Team Backlog Sep 6, 2024
@greg7mdp greg7mdp reopened this Sep 6, 2024
@github-project-automation github-project-automation bot moved this from Done to Todo in Team Backlog Sep 6, 2024
@arhag
Copy link
Member Author

arhag commented Sep 6, 2024

P2 is not really lost from the block_header_state of B6, as it is the currently active policy.
But we have lost P1 which was the active policy on block B5.

Thanks @greg7mdp. That was a typo. I corrected it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment