Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature flag for LastIndex and Erasure duplicate proofs #34360

Merged
merged 5 commits into from
Dec 20, 2023

Conversation

AshwinSekar
Copy link
Contributor

@AshwinSekar AshwinSekar commented Dec 7, 2023

Problem

These cases introduced in 1.17 #32965 need to be feature flagged as

// Notify duplicate consensus state machine
duplicate_slots_sender.send(shred_slot)?;

will update fork choice.

Summary of Changes

Add feature flag to ensure that fork choice is consistent along the cluster when upgrading to v1.17.

Copy link

codecov bot commented Dec 7, 2023

Codecov Report

Merging #34360 (c83719d) into master (65e10ae) will increase coverage by 0.0%.
Report is 9 commits behind head on master.
The diff coverage is 93.3%.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #34360    +/-   ##
========================================
  Coverage    81.8%    81.8%            
========================================
  Files         820      820            
  Lines      220869   221087   +218     
========================================
+ Hits       180791   180987   +196     
- Misses      40078    40100    +22     

@AshwinSekar AshwinSekar added the v1.17 PRs that should be backported to v1.17 label Dec 7, 2023
Copy link
Contributor

mergify bot commented Dec 7, 2023

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

Comment on lines 147 to 148
feature_set: &FeatureSet,
epoch_schedule: &EpochSchedule,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to send root_bank: &Bank here instead of these 2 args.
The reason I was using feature_set and epoch_schedule in the other code was because I didn't want to keep a reference to bank in order to avoid reintroducing: #33105

Copy link
Contributor

@behzadnouri behzadnouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about LastIndex and Erasure duplicate proofs received from other nodes from gossip?

e.g. 50% of the cluster have upgraded to v1.17, and 50% have not; and then one node which does not have this patch starts sending LastIndex and Erasure duplicate proofs over gossip. Wouldn't that cause an issue because 50% of the cluster recognize that as valid duplicate proof and the other 50% don't?!

fn run_check_duplicate(
cluster_info: &ClusterInfo,
blockstore: &Blockstore,
shred_receiver: &Receiver<PossibleDuplicateShred>,
duplicate_slots_sender: &DuplicateSlotSender,
bank_forks: &Arc<RwLock<BankForks>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&Arc is redundant. Either just & or if you really need Arc then just Arc.

@AshwinSekar AshwinSekar force-pushed the dup-proof-ff branch 2 times, most recently from 6a76edc to 2798583 Compare December 8, 2023 00:09
@AshwinSekar
Copy link
Contributor Author

AshwinSekar commented Dec 8, 2023

How about LastIndex and Erasure duplicate proofs received from other nodes from gossip?

e.g. 50% of the cluster have upgraded to v1.17, and 50% have not; and then one node which does not have this patch starts sending LastIndex and Erasure duplicate proofs over gossip. Wouldn't that cause an issue because 50% of the cluster recognize that as valid duplicate proof and the other 50% don't?!

As stated here #34292 (comment) we don't consume duplicate proofs from gossip in fork choice. #32963 is the change which accomplishes that, and is planned to be enabled alongside the new vote tx change.

@AshwinSekar AshwinSekar added the feature-gate Pull Request adds or modifies a runtime feature gate label Dec 8, 2023
Copy link
Contributor

@behzadnouri behzadnouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like feature gating the consensus part, but I am wondering if we should feature gate earlier in the pipeline in the ingestion/blockstore code which identifies these duplicates.

Like the new code wouldn't store the shreds. Wouldn't that cause issues if the duplicate is confirmed and the node has go with it anyways?

@@ -137,17 +141,50 @@ impl WindowServiceMetrics {
}
}

fn should_send_index_and_erasure_conflicts(shred_slot: Slot, root_bank: &Arc<Bank>) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be just &Bank. not &Arc<Bank>.

Comment on lines 145 to 159
match root_bank
.feature_set
.activated_slot(&feature_set::index_erasure_conflict_duplicate_proofs::id())
{
None => false,
Some(feature_slot) => {
let epoch_schedule = root_bank.epoch_schedule();
let feature_epoch = epoch_schedule.get_epoch(feature_slot);
let shred_epoch = epoch_schedule.get_epoch(shred_slot);
// Has a 1 epoch delay, as we don't have enough information
// on the epoch boundary of the feature activation
feature_epoch < shred_epoch
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

) -> Result<()> {
let mut root_bank = bank_forks.read().unwrap().root_bank().clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if last_updated.elapsed().as_millis() as u64 > DEFAULT_MS_PER_SLOT {
// Grabs bank forks lock once a slot
last_updated = Instant::now();
root_bank = bank_forks.read().unwrap().root_bank().clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clone is redundant

Comment on lines +182 to +174
if send_index_and_erasure_conflicts {
(shred, conflict)
} else {
return Ok(());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we actually feature-gate earlier in the blockstore code which identifies these conflicts?

@behzadnouri
Copy link
Contributor

This seems like feature gating the consensus part, but I am wondering if we should feature gate earlier in the pipeline in the ingestion/blockstore code which identifies these duplicates.

Like the new code wouldn't store the shreds. Wouldn't that cause issues if the duplicate is confirmed and the node has go with it anyways?

One other issue is that I believe if blockstore already has a duplicate/conflict for a slot, future duplicates/conflicts are not processed the same ways as before (like they are not inserted into blockstore anymore, not sure if they are still channeled to replay or not).
So seems like if the blockstore code is not feature gated there is already some change of behavior. Not sure if it impacts consensus or not.

@AshwinSekar
Copy link
Contributor Author

AshwinSekar commented Dec 11, 2023

from a consensus perspective it doesn't matter which scenario a duplicate block proof is constructed for. they all result in the slot being marked as invalid in fork choice.

One other issue is that I believe if blockstore already has a duplicate/conflict for a slot, future duplicates/conflicts are not processed the same ways as before (like they are not inserted into blockstore anymore, not sure if they are still channeled to replay or not).

this is correct, if there already is a duplicate proof no further action will be taken. replay/gossip will not be notified.
however the code for detecting last index and erasure conflict in blockstore has been around for a long time #13992, so I don't believe it needs to be feature gated.

@AshwinSekar
Copy link
Contributor Author

I believe #29227 should have been feature gated for the scenario you're describing.

Receive duplicate proof from gossip -> store in blockstore -> no notification to replay.
Later we receive shreds for the slot and try to construct the duplicate proof by ourselves but there is already a duplicate proof in blockstore, so no action is taken.

It seems #29227 was already in 1.16, so I believe it is too late to feature gate it.

carllin
carllin previously approved these changes Dec 13, 2023
@behzadnouri
Copy link
Contributor

behzadnouri commented Dec 13, 2023

this is correct, if there already is a duplicate proof no further action will be taken. replay/gossip will not be notified.
however the code for detecting last index and erasure conflict in blockstore has been around for a long time #13992, so I don't believe it needs to be feature gated.

does this mean in current v1.16 code if it sees a last index or erasure conflict it will not mark the slot as dead but also will no longer recognize any kind of duplicates in the block?

separately, didn't you submit a change which would extend erasure conflicts detection?

@AshwinSekar
Copy link
Contributor Author

does this mean in current v1.16 code if it sees a last index or erasure conflict it will not mark the slot as dead but also will no longer recognize any kind of duplicates in the block?

yes that's correct. We will add the proof to the blockstore column but not invoke handle_duplicate, so the results of the column are unused:

// TODO: handle_duplicate is not invoked and so duplicate shreds are
// not gossiped to the rest of cluster.
if !erasure_meta.check_coding_shred(&shred) {

however because we have stored it in the column
if self
.store_duplicate_if_not_existing(
slot,
ending_shred.into_owned(),

for any future simple duplicate shred case (2 shreds with the same index), we will no longer be able to take any action, as there is already a proof in the column:
let check_duplicate = |shred: Shred| -> Result<()> {
let shred_slot = shred.slot();
if !blockstore.has_duplicate_shreds_in_slot(shred_slot) {
if let Some(existing_shred_payload) =
blockstore.is_shred_duplicate(shred.id(), shred.payload().clone())
{
cluster_info.push_duplicate_shred(&shred, &existing_shred_payload)?;
blockstore.store_duplicate_slot(
shred_slot,
existing_shred_payload,
shred.into_payload(),
)?;
duplicate_slots_sender.send(shred_slot)?;
}
}

@AshwinSekar
Copy link
Contributor Author

separately, didn't you submit a change which would extend erasure conflicts detection?

If you're referring to #33037, it is still in development.

@AshwinSekar AshwinSekar merged commit def3bc4 into solana-labs:master Dec 20, 2023
44 checks passed
mergify bot pushed a commit that referenced this pull request Dec 20, 2023
* Add feature flag for LastIndex and Erasure duplicate proofs

* pr feedback: use root bank instead of 2 params

* pr feedback: & instead of &Arc

* pr feedback: reuse fn, remove redundant clones

* rebase: fix feature set conflict

(cherry picked from commit def3bc4)

# Conflicts:
#	sdk/src/feature_set.rs
AshwinSekar added a commit that referenced this pull request Dec 20, 2023
…ackport of #34360) (#34541)

* Add feature flag for LastIndex and Erasure duplicate proofs (#34360)

* Add feature flag for LastIndex and Erasure duplicate proofs

* pr feedback: use root bank instead of 2 params

* pr feedback: & instead of &Arc

* pr feedback: reuse fn, remove redundant clones

* rebase: fix feature set conflict

(cherry picked from commit def3bc4)

# Conflicts:
#	sdk/src/feature_set.rs

* fix feature set conflict

---------

Co-authored-by: Ashwin Sekar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-gate Pull Request adds or modifies a runtime feature gate v1.17 PRs that should be backported to v1.17
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants