Skip to content

Commit

Permalink
Introduce NW BatchV2 in Protocol Version 12 & Pipe ProtocolConfig int…
Browse files Browse the repository at this point in the history
…o NW (#12178)

## Description 

This is attempt 2 of getting protocol config into narwhal. Previous
attempt ([PR#11519](#11519)) had
to be reverted because mismatched batch versions were causing the
validators to panic. The issue was that we were not handling the
protocol upgrade correctly. `ProtocolConfig` was passed in when
`NarwhalManager` was created but on epoch change
`monitor_reconfiguration` does not recreate `NarwhalManager` if it still
has a handle on the `ValidatorComponents` but rather just calls `start`
from the existing `NarwhalManager`. This is not a problem for
`NarwhalConfiguration` parameters which is also only passed in on
`NarwhalManager` creation because the moment the node is restarted for a
binary update the parameters take effect. However in the case of
protocol upgrades `ProtocolConfig` is only updated on the following
epoch.

For example if the validator restarts to update its binary from version
`N` to version `N+1`, `NarwhalManager` would be constructed with
`ProtocolConfig` at version `N` (not `N+1`) because we need to ensure we
have a majority quorum before actually going to version `N+1` which
happens at epoch change. On epoch change because the node still has a
handle on `ValidatorComponents` it just starts Narwhal with the existing
`NarwhalManager` and ends up using `ProtocolConfig` at version `N` which
is the root cause of the issues.

To resolve this the following changes were added to the last PR to fix
the issue and make it more robust
- Filter the received or fetched batches to ensure the supported batch
version is received.
- Pass `ProtocolConfig` to `NarwhalManager` on start and not on creation

## Test Plan 

Added unit tests & tested protocol upgrade in labnet from mainnet
release branch to v12

---
### Type of Change (Check all that apply)

- [X] protocol change
- [ ] user-visible impact
- [ ] breaking change for a client SDKs
- [ ] breaking change for FNs (FN binary must upgrade)
- [ ] breaking change for validators or node operators (must upgrade
binaries)
- [ ] breaking change for on-chain data layout
- [ ] necessitate either a data wipe or data migration

### Release notes
Start using `BatchV2` in Narwhal which introduces `VersionedMetadata`
that allows for more granular tracking of NW batch execution latency.
  • Loading branch information
arun-koshy authored May 30, 2023
1 parent 35a815b commit 59e0f09
Show file tree
Hide file tree
Showing 62 changed files with 1,847 additions and 355 deletions.
8 changes: 8 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

58 changes: 52 additions & 6 deletions crates/sui-core/src/consensus_validator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@ use eyre::WrapErr;
use mysten_metrics::monitored_scope;
use prometheus::{register_int_counter_with_registry, IntCounter, Registry};
use std::sync::Arc;
use sui_protocol_config::ProtocolConfig;

use crate::authority::authority_per_epoch_store::AuthorityPerEpochStore;
use crate::transaction_manager::TransactionManager;
use async_trait::async_trait;
use narwhal_types::BatchAPI;
use narwhal_types::{validate_batch_version, BatchAPI};
use narwhal_worker::TransactionValidator;
use sui_types::messages_consensus::{ConsensusTransaction, ConsensusTransactionKind};
use tap::TapFallible;
Expand Down Expand Up @@ -56,8 +57,17 @@ impl TransactionValidator for SuiTxValidator {
Ok(())
}

async fn validate_batch(&self, b: &narwhal_types::Batch) -> Result<(), Self::Error> {
async fn validate_batch(
&self,
b: &narwhal_types::Batch,
protocol_config: &ProtocolConfig,
) -> Result<(), Self::Error> {
let _scope = monitored_scope("ValidateBatch");

// TODO: Remove once we have upgraded to protocol version 12.
validate_batch_version(b, protocol_config)
.map_err(|err| eyre::eyre!(format!("Invalid Batch: {err}")))?;

let txs = b
.transactions()
.iter()
Expand Down Expand Up @@ -147,6 +157,8 @@ mod tests {
consensus_adapter::consensus_tests::{test_certificates, test_gas_objects},
consensus_validator::{SuiTxValidator, SuiTxValidatorMetrics},
};

use narwhal_test_utils::{get_protocol_config, latest_protocol_version};
use narwhal_types::Batch;
use narwhal_worker::TransactionValidator;
use sui_types::signature::GenericSignature;
Expand All @@ -164,6 +176,8 @@ mod tests {
let mut objects = test_gas_objects();
objects.push(Object::shared_for_testing());

let latest_protocol_config = &latest_protocol_version();

let network_config =
sui_swarm_config::network_config_builder::ConfigBuilder::new_with_temp_dir()
.with_objects(objects.clone())
Expand Down Expand Up @@ -199,8 +213,10 @@ mod tests {
})
.collect();

let batch = Batch::new(transaction_bytes);
let res_batch = validator.validate_batch(&batch).await;
let batch = Batch::new(transaction_bytes, latest_protocol_config);
let res_batch = validator
.validate_batch(&batch, latest_protocol_config)
.await;
assert!(res_batch.is_ok(), "{res_batch:?}");

let bogus_transaction_bytes: Vec<_> = certificates
Expand All @@ -215,8 +231,38 @@ mod tests {
})
.collect();

let batch = Batch::new(bogus_transaction_bytes);
let res_batch = validator.validate_batch(&batch).await;
let batch = Batch::new(bogus_transaction_bytes, latest_protocol_config);
let res_batch = validator
.validate_batch(&batch, latest_protocol_config)
.await;
assert!(res_batch.is_err());

// TODO: Remove once we have upgraded to protocol version 12.
// protocol version 11 should only support BatchV1
let protocol_config_v11 = &get_protocol_config(11);
let batch_v1 = Batch::new(vec![], protocol_config_v11);

// Case #1: Receive BatchV1 and network has not upgraded to 12 so we are okay
let res_batch = validator
.validate_batch(&batch_v1, protocol_config_v11)
.await;
assert!(res_batch.is_ok());
// Case #2: Receive BatchV1 but network has upgraded to 12 so we fail because we expect BatchV2
let res_batch = validator
.validate_batch(&batch_v1, latest_protocol_config)
.await;
assert!(res_batch.is_err());

let batch_v2 = Batch::new(vec![], latest_protocol_config);
// Case #3: Receive BatchV2 but network is still in v11 so we fail because we expect BatchV1
let res_batch = validator
.validate_batch(&batch_v2, protocol_config_v11)
.await;
assert!(res_batch.is_err());
// Case #4: Receive BatchV2 and network is upgraded to 12 so we are okay
let res_batch = validator
.validate_batch(&batch_v2, latest_protocol_config)
.await;
assert!(res_batch.is_ok());
}
}
38 changes: 27 additions & 11 deletions crates/sui-core/src/narwhal_manager/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,13 @@ use prometheus::{register_int_gauge_with_registry, IntGauge, Registry};
use std::path::PathBuf;
use std::sync::Arc;
use std::time::Instant;
use sui_protocol_config::{ProtocolConfig, ProtocolVersion};
use sui_types::crypto::{AuthorityKeyPair, NetworkKeyPair};
use tokio::sync::Mutex;

#[derive(PartialEq)]
enum Running {
True(Epoch),
True(Epoch, ProtocolVersion),
False,
}

Expand Down Expand Up @@ -117,9 +118,17 @@ impl NarwhalManager {
}

// Starts the Narwhal (primary & worker(s)) - if not already running.
// Note: After a binary is updated with the new protocol version and the node
// is restarted, the protocol config does not take effect until we have a quorum
// of validators have updated the binary. Because of this the protocol upgrade
// will happen in the following epoch after quorum is reached. In this case NarwhalManager
// is not recreated which is why we pass protocol config in at start and not at creation.
// To ensure correct behavior an updated protocol config must be passed in at the
// start of EACH epoch.
pub async fn start<State, TxValidator: TransactionValidator>(
&self,
committee: Committee,
protocol_config: ProtocolConfig,
worker_cache: WorkerCache,
execution_state: Arc<State>,
tx_validator: TxValidator,
Expand All @@ -128,10 +137,9 @@ impl NarwhalManager {
{
let mut running = self.running.lock().await;

if let Running::True(epoch) = *running {
if let Running::True(epoch, version) = *running {
tracing::warn!(
"Narwhal node is already Running at epoch {:?} - shutdown first before starting",
epoch
"Narwhal node is already Running for epoch {epoch:?} & protocol version {version:?} - shutdown first before starting",
);
return;
}
Expand All @@ -147,7 +155,11 @@ impl NarwhalManager {

let name = self.primary_keypair.public().clone();

tracing::info!("Starting up Narwhal for epoch {}", committee.epoch());
tracing::info!(
"Starting up Narwhal for epoch {} & protocol version {:?}",
committee.epoch(),
protocol_config.version
);

// start primary
const MAX_PRIMARY_RETRIES: u32 = 2;
Expand All @@ -159,6 +171,7 @@ impl NarwhalManager {
self.primary_keypair.copy(),
self.network_keypair.copy(),
committee.clone(),
protocol_config.clone(),
worker_cache.clone(),
network_client.clone(),
&store,
Expand Down Expand Up @@ -197,6 +210,7 @@ impl NarwhalManager {
name.clone(),
id_keypair_copy,
committee.clone(),
protocol_config.clone(),
worker_cache.clone(),
network_client.clone(),
&store,
Expand All @@ -219,8 +233,9 @@ impl NarwhalManager {
}

tracing::info!(
"Starting up Narwhal for epoch {} is complete - took {} seconds",
"Starting up Narwhal for epoch {} & protocol version {:?} is complete - took {} seconds",
committee.epoch(),
protocol_config.version,
now.elapsed().as_secs_f64()
);

Expand All @@ -233,7 +248,7 @@ impl NarwhalManager {
.set(primary_retries as i64);
self.metrics.start_worker_retries.set(worker_retries as i64);

*running = Running::True(committee.epoch());
*running = Running::True(committee.epoch(), protocol_config.version);
}

// Shuts down whole Narwhal (primary & worker(s)) and waits until nodes
Expand All @@ -242,16 +257,17 @@ impl NarwhalManager {
let mut running = self.running.lock().await;

match *running {
Running::True(epoch) => {
Running::True(epoch, version) => {
let now = Instant::now();
tracing::info!("Shutting down Narwhal epoch {:?}", epoch);
tracing::info!(
"Shutting down Narwhal for epoch {epoch:?} & protocol version {version:?}"
);

self.primary_node.shutdown().await;
self.worker_nodes.shutdown().await;

tracing::info!(
"Narwhal shutdown for epoch {:?} is complete - took {} seconds",
epoch,
"Narwhal shutdown for epoch {epoch:?} & protocol version {version:?} is complete - took {} seconds",
now.elapsed().as_secs_f64()
);

Expand Down
3 changes: 3 additions & 0 deletions crates/sui-core/src/unit_tests/narwhal_manager_tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ use fastcrypto::traits::KeyPair;
use mysten_metrics::RegistryService;
use narwhal_config::{Epoch, WorkerCache};
use narwhal_executor::ExecutionState;
use narwhal_test_utils::latest_protocol_version;
use narwhal_types::{BatchAPI, ConsensusOutput, TransactionProto, TransactionsClient};
use narwhal_worker::TrivialTransactionValidator;
use prometheus::Registry;
Expand Down Expand Up @@ -132,6 +133,7 @@ async fn test_narwhal_manager() {
narwhal_manager
.start(
narwhal_committee.clone(),
latest_protocol_version(),
worker_cache.clone(),
Arc::new(execution_state.clone()),
TrivialTransactionValidator::default(),
Expand Down Expand Up @@ -193,6 +195,7 @@ async fn test_narwhal_manager() {
narwhal_manager
.start(
narwhal_committee.clone(),
latest_protocol_version(),
worker_cache.clone(),
Arc::new(execution_state.clone()),
TrivialTransactionValidator::default(),
Expand Down
1 change: 1 addition & 0 deletions crates/sui-node/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -772,6 +772,7 @@ impl SuiNode {
narwhal_manager
.start(
new_epoch_start_state.get_narwhal_committee(),
epoch_store.protocol_config().clone(),
worker_cache,
consensus_handler,
SuiTxValidator::new(
Expand Down
1 change: 1 addition & 0 deletions crates/sui-open-rpc/spec/openrpc.json
Original file line number Diff line number Diff line change
Expand Up @@ -1350,6 +1350,7 @@
"disallow_change_struct_type_params_on_upgrade": false,
"loaded_child_objects_fixed": true,
"missing_type_is_compatibility_error": true,
"narwhal_versioned_metadata": false,
"no_extraneous_module_bytes": false,
"package_digest_hash_module": false,
"package_upgrades": true,
Expand Down
14 changes: 13 additions & 1 deletion crates/sui-protocol-config/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ const MAX_PROTOCOL_VERSION: u64 = 12;
// framework changes.
// Version 11: Introduce `std::type_name::get_with_original_ids` to the system frameworks.
// Version 12: Changes to deepbook in framework to add API for querying marketplace.
// Change NW Batch to use versioned metadata field.

#[derive(Copy, Clone, Debug, Hash, Serialize, Deserialize, PartialEq, Eq, PartialOrd, Ord)]
pub struct ProtocolVersion(u64);
Expand Down Expand Up @@ -183,6 +184,9 @@ struct FeatureFlags {
// If true, checks no extra bytes in a compiled module
#[serde(skip_serializing_if = "is_false")]
no_extraneous_module_bytes: bool,
// If true, then use the versioned metadata format in narwhal entities.
#[serde(skip_serializing_if = "is_false")]
narwhal_versioned_metadata: bool,
}

fn is_false(b: &bool) -> bool {
Expand Down Expand Up @@ -672,6 +676,10 @@ impl ProtocolConfig {
self.feature_flags.scoring_decision_with_validity_cutoff
}

pub fn narwhal_versioned_metadata(&self) -> bool {
self.feature_flags.narwhal_versioned_metadata
}

pub fn consensus_order_end_of_epoch_last(&self) -> bool {
self.feature_flags.consensus_order_end_of_epoch_last
}
Expand Down Expand Up @@ -1135,7 +1143,11 @@ impl ProtocolConfig {
cfg
}
11 => Self::get_for_version_impl(version - 1),
12 => Self::get_for_version_impl(version - 1),
12 => {
let mut cfg = Self::get_for_version_impl(version - 1);
cfg.feature_flags.narwhal_versioned_metadata = true;
cfg
}
// Use this template when making changes:
//
// // modify an existing constant.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ feature_flags:
package_digest_hash_module: true
disallow_change_struct_type_params_on_upgrade: true
no_extraneous_module_bytes: true
narwhal_versioned_metadata: true
max_tx_size_bytes: 131072
max_input_objects: 2048
max_size_written_objects: 5000000
Expand Down
1 change: 1 addition & 0 deletions narwhal/config/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ mysten-util-mem.workspace = true
crypto = { path = "../crypto", package = "narwhal-crypto" }
mysten-network.workspace = true
workspace-hack = { version = "0.1", path = "../../crates/workspace-hack" }
sui-protocol-config = { path = "../../crates/sui-protocol-config" }
rand = "0.8.5"

[dev-dependencies]
Expand Down
Loading

1 comment on commit 59e0f09

@vercel
Copy link

@vercel vercel bot commented on 59e0f09 May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Successfully deployed to the following URLs:

offline-signer-helper – ./dapps/offline-signer-helper

offline-signer-helper-git-main-mysten-labs.vercel.app
offline-signer-helper-mysten-labs.vercel.app
offline-signer-helper.vercel.app

Please sign in to comment.