Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

inherent disputes: remove per block initializer and disputes timeout event #6937

Merged
merged 15 commits into from
Mar 24, 2023

Conversation

sandreim
Copy link
Contributor

@sandreim sandreim commented Mar 22, 2023

Fixes #6862

TODO:

  • remove timeout related configuration
  • Final versi burnin

https://github.com/paritytech/polkadot/blob/master/runtime/parachains/src/disputes.rs#L918 is the culprit for the heavy weight blocks. During each block execution, but before paras inherent enter, this piece of code iterates all active/inactive disputes to mark them as timed out if that is the case. We keep disputes for 6 sessions, then we prune them. For 60k disputes as seen on Polkadot, this means 1,200000s (20usec * 60000) of weight not accounted for.

My fix is to remove the timeout event and the code relevant to it, and allow those disputes to be pruned as session advances. As discussed with @eskimor we don't really need the timeout event. However I don't have a good understanding if we would want to slash timed out disputes in the future, or how that should work. cc @ordian . If these are still needed I have already explored some options like:

  • separate concluded disputes storage
  • per block expire entries and multi-block processing if its really needed

Reducing the maximum allowed dispute weight to just 70% of the total block weight is to reduce heavy block import times during high load. I would personally go even lower to 50%, but with current value plus backing and bitfield processing weight should be 90% of a block. Initially I have wrongly assumed that without pushing enough disputes on chain it would slow finality, but this is wrong as finality happens offchain and we just need the votes for slashing.

I've tested this fix on Versi (without reducing the dispute weight) and metrics below confirm it works. By time index in the chart we start before the fix with disputes ongoing. Then disputes are stopped and you can see the stair pattern (10m sessions) as disputes are pruned on each new session. After the fix is applied, the block times have different distribution as we are no longer paying that cost of iterating every dispute and also get back to normal as soon as disputes are ended.

Screenshot 2023-03-22 at 15 51 17

Signed-off-by: Andrei Sandu <[email protected]>
Signed-off-by: Andrei Sandu <[email protected]>
@sandreim sandreim added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. labels Mar 22, 2023
{
Self::deposit_event(Event::DisputeTimedOut(candidate_hash));

dispute.concluded_at = Some(now);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed with @eskimor we don't really need the timeout event

Could you post the gist of the discussion here? This code not only removes timeout events, but the notion of timing out for disputes. I'm lacking the context here. cc @burdges

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes:

  1. Disputes that hit the runtime should actually never time out - they have been confirmed, so they should conclude.
  2. We don't have a timeout on the node side. So timing them out in the runtime is kind of pointless and out of sync with what the node would assume about the dispute.
  3. We drop disputes after 6 sessions anyway.
  4. We can not and should not slash anybody on a timed out dispute - the people who voted are the good guys (most likely).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We drop disputes after 6 sessions anyway.

Yes, but maybe we should be reverting a block if the dispute hasn't concluded instead of finalizing it after some time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what we could and probably should do: Revert immediately once f+1 validators voted against a candidate.

runtime/parachains/src/paras_inherent/mod.rs Outdated Show resolved Hide resolved
{
Self::deposit_event(Event::DisputeTimedOut(candidate_hash));

dispute.concluded_at = Some(now);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes:

  1. Disputes that hit the runtime should actually never time out - they have been confirmed, so they should conclude.
  2. We don't have a timeout on the node side. So timing them out in the runtime is kind of pointless and out of sync with what the node would assume about the dispute.
  3. We drop disputes after 6 sessions anyway.
  4. We can not and should not slash anybody on a timed out dispute - the people who voted are the good guys (most likely).

Signed-off-by: Andrei Sandu <[email protected]>
@burdges
Copy link
Contributor

burdges commented Mar 22, 2023

separate concluded disputes storage

We'd dump all pruned disputes here whenever this happens? Or just the ones where we lost money? (excepting some special config or whatever)

@sandreim
Copy link
Contributor Author

separate concluded disputes storage

We'd dump all pruned disputes here whenever this happens? Or just the ones where we lost money? (excepting some special config or whatever)

Most of the disputes are concluded and there is no point in iterating them for everyblock in each session until they are pruned. We only have a limited number of active disputes, because they conclude fast. Iterating them should be fast.

@sandreim sandreim changed the title inherent disputes: remove per block initializer and reduce total allowed dispute weight inherent disputes: remove per block initializer Mar 23, 2023
Signed-off-by: Andrei Sandu <[email protected]>
Signed-off-by: Andrei Sandu <[email protected]>
@paritytech-ci paritytech-ci requested review from a team March 23, 2023 11:30
@sandreim sandreim added the E0-runtime_migration PR introduces code that might require downstream chains to run a runtime upgrade. label Mar 23, 2023
@sandreim sandreim changed the title inherent disputes: remove per block initializer inherent disputes: remove per block initializer and disputes timeout event Mar 23, 2023
Copy link
Member

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some other changes to the guide, at least: https://paritytech.github.io/polkadot/book/runtime/disputes.html

Other than that, looks good.

Signed-off-by: Andrei Sandu <[email protected]>
Signed-off-by: Andrei Sandu <[email protected]>
Signed-off-by: Andrei Sandu <[email protected]>
@sandreim sandreim added B1-note_worthy Changes should be noted in the release notes and removed B0-silent Changes should not be mentioned in any release notes labels Mar 24, 2023
Signed-off-by: Andrei Sandu <[email protected]>
@sandreim sandreim added the T1-runtime This PR/Issue is related to the topic “runtime”. label Mar 24, 2023
Signed-off-by: Andrei Sandu <[email protected]>
@sandreim sandreim enabled auto-merge (squash) March 24, 2023 11:41
@sandreim sandreim merged commit 2375a62 into master Mar 24, 2023
@sandreim sandreim deleted the sandreim/fix_disputes_initializer branch March 24, 2023 11:48
@@ -148,15 +148,15 @@ pub mod v4 {
}

fn on_runtime_upgrade() -> Weight {
if StorageVersion::get::<Pallet<T>>() == 3 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 146 above it is checking for version 3.
No idea why the CI is not catching that, with this recent change the pre- and post checks should run. cc @kianenigma

Copy link
Contributor Author

@sandreim sandreim Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, looks like I missed changing this and indeed CI did not fail ....

ordian added a commit that referenced this pull request Mar 29, 2023
* master:
  bump timestamp script to v0.2 (#6954)
  Subsystem channel tweaks (#6905)
  Companion for #13683 (#6944)
  inherent disputes: remove per block initializer and disputes timeout event (#6937)
ordian added a commit that referenced this pull request Mar 29, 2023
* master:
  Companion: wasm-builder support stable Rust (#6967)
  Fix feature (#6966)
  configuration: backport async backing parameters from the feature branch (#6961)
  Histogram support in runtime metrics (#6935)
  Bump openssl from 0.10.38 to 0.10.48 (#6955)
  Proxy for Nomination Pools (#6846)
  Nomination Pools migration v5: RewardPool fix (#6957)
  bump timestamp script to v0.2 (#6954)
  Subsystem channel tweaks (#6905)
  Companion for #13683 (#6944)
  inherent disputes: remove per block initializer and disputes timeout event (#6937)
@Polkadot-Forum
Copy link

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-dispute-storm-the-postmortem/2550/1

mattheworris added a commit to frequency-chain/frequency that referenced this pull request Jun 15, 2023
# Goal
The goal of this PR is upgrade Polkadot to v0.9.42

Polkadot Release Notes:
https://github.com/paritytech/polkadot/releases/tag/v0.9.42
https://github.com/paritytech/polkadot/releases/tag/v0.9.41
https://github.com/paritytech/polkadot/releases/tag/v0.9.40

Polkadot Release Analysis:

https://forum.polkadot.network/t/polkadot-release-analysis-v0-9-41-v0-9-42/2828/1

Closes #1472 
Closes #1270 
Closes #1332 

# Discussion
- v0.9.40 was not working and there was evidence that changes in v0.9.42
were related, so we decided to jump ahead to v0.9.42
<!-- List discussion items -->
# Runtime Migrations Included
- [x] paritytech/polkadot#6937
- [x] paritytech/substrate#13715
- [x] paritytech/substrate#13936
- [x] paritytech/polkadot#7114
- [x] Further all migrations from 9.38+ are included:
paritytech/polkadot#7162)
# Checklist
- [x] Chain spec updated
- [ ] Custom RPC OR Runtime API added/changed? Updated js/api-augment.
- [ ] Design doc(s) updated
- [ ] Tests added
- [ ] Benchmarks added
- [x] Weights updated
# Tests Performed
- [x] `make ci-local` -- Passing (includes lint, docs, unit-test and
integration-test)
- [x] `make start-native` -- Successfully attached debugger when
creating MSA with Polkadot UI
- [x] `make start-relay` -- Docker Containers successfully started, but
too slow in emulation on Apple Silicon M2.
- [x] `make upgrade-local` -- Successfully started local relay network
and upgraded to a newer test runtime.
- [x] `make upgrade-local` -- Successfully updated a node running the
current version on branch `main`

---------

Co-authored-by: Frequency CI [bot] <[email protected]>
Co-authored-by: Matthew Orris <--help>
Co-authored-by: Robert La Ferla <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B1-note_worthy Changes should be noted in the release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. E0-runtime_migration PR introduces code that might require downstream chains to run a runtime upgrade. T1-runtime This PR/Issue is related to the topic “runtime”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Polkadot Incident 11.03.2023
8 participants