Sync 2.0 #534

arkpar · 2022-01-26T23:25:23Z

This is a tracking for the new implementation of syncing protocol and algorithm in substrate. It mostly concerns full/fast sync, but not warp sync.

Current implementation issues.

Reliance on the longest chain fork rule.
No clear separation between initial and keep-up sync.
Does not consider finality in full sync mode.
Complicated common ancestry search.

Proposed new protocol.

On connection peers exchange their best finalized block and all leaf blocks that are not yet finalized or discarded.
On import of each block, node sends out an announcement to all connected peers. Announcement includes newly imported block information along with the best finalized block for a peer.

Proposed syncing algorithm.

For each connected peer we maintain a set of leaves that they have and common finalized number. These are updated on each announcement.

There are two major syncing modes:

Initial sync. In this mode the node syncs a single finalized chain. It is triggered on startup or when majority of peers start announcing blocks with unknown parents (i.e. after an offline period). Since this is a canonical chain, series of blocks may be requested by number allowing for parallel block download from multiple peers. There's no need to do ancestry search as we can simply assume that the common block is the one with a min number in a finalized chain.

Keep-up sync. Node switches to this mode when it finalizes a recent block. In this mode sync simply requests chains that peers have, starting from each leaf and back to the best finalized block (or existing parent block).

Tasks

Give feedback

Move block announce validation/polling to SyncingEngine #503

I4-refactor
Move request-response handling from ChainSync to SyncingEngine #502

I4-refactor
Move import queue out of ChainSync to SyncingEngine #501

I4-refactor
Remove syncing code from sc-network-common #1738

I4-refactor
Options

The text was updated successfully, but these errors were encountered:

arkpar · 2022-01-26T23:25:40Z

cc @bkchr

nazar-pc · 2022-08-09T07:06:11Z

Is this supposed to work with chain that doesn't have finality? I'm worried about growing number of places where finality is implicitly assumed to exist in Substrate, making it less generic.

Also I'm interested in clearer separation between initial and keep-up sync here.

arkpar · 2022-08-09T19:53:54Z

It may not be finality in the same sense, as the the actual finality engine used. Substrate indeed already has a limit on maximum possible re-org. So in the simplest case sync may consider "final" any block past that threshold. But yes, this strategy uses the notion of finality to make thing simpler. We may make the sync strategy pluggable to support other cases though.

burdges · 2022-08-10T21:11:56Z

We must support chains without their own finality, since this describes parachains. We do not afaik need chains without any notion of finality, because afaik they're not really secure anyways.

A priori, I think this proposal makes sync easier to understand, which maybe helps us do async time thing by Handan (and Peter Czaban) and do my slashing reform idea. I've not really thought about the details though.. At least slashing reform cares most about sync ahead of finality, which is not really the topic here I guess, so maybe just orthogonal, not sure about async time.

nazar-pc · 2022-08-12T12:15:19Z

We do not afaik need chains without any notion of finality, because afaik they're not really secure anyways.

Do you suggest for Substrate to not support PoW and similar consensus mechanisms that don't have deterministic finality at all?

burdges · 2022-08-13T01:58:24Z

Yes exactly..

Proof-of-work is not secure. It's especially bad on smaller chains ala https://www.crypto51.app/ but bitcoin shall be double spent eventually too ala https://economics.princeton.edu/working-papers/on-the-instability-of-bitcoin-without-the-block-reward/

We'll never fully support proof-of-work parachains anyways. We need pre-collation proofs in too many places, like parachain block submission, XCMP message transport, etc, so attackers could halt proof-of-work parachains fairly easily. And proof-of-work parachains would de facto pay 10x more for their parachain slot, assuming they miss like 9/10 relay chain slots.

Intuitively, if you've a properly engineered protocol then you've somewhat tight tolerances everywhere, so proof-of-work anywhere typically enables DoS attacks upon your larger protocol.

timorl · 2023-06-23T14:25:01Z

We are working on a similar rewrite in aleph-node for probably very similar reasons. Some crucial differences in our approach:

We want the sync to be using the network as an external resource rather than being intertwined with it.
1. I'm not completely clear as to how much this intertwining will be the case for you – the handshake will still be the way in which nodes connect to each other at all, right?
We only want to sync/import forks that are in some way interesting, while if I understand correctly you want to get all of them.
1. Aren't you worried that this will somehow cause a problem due to the limit on forks in the DB?
2. Due to that we have a data structure for gathering information about known blocks and whether they are of interest. (Sorry for linking to code, we have quite a detailed design doc, but it's not cleaned up properly to be made public yet. :/) This might be of interest, as it's designed, among other goals, to optimize syncing forks (perhaps related to Fork sync targets are handled inefficiently #525).
We are currently not planning to differentiate between initial sync and keep-up sync.
1. We might do that if it turns out getting respectable performance otherwise is unfeasible.
2. If we do, we likely won't be switching to the other mode as aggressively as you (if I'm reading your design correctly), but rather only if the node is significantly (~hundreds of blocks) behind.
We want to be able to synchronize justifications by themselves, without dragging a block with them always. This is a relatively minor requirement though.

Some of these differences are due to us using our own finality gadget with somewhat different base assumptions than GRANDPA (thus 2 and 4), some are motivated by a much lower block production time (thus 2.1 and 3.2), and some are probably just random results of different approaches of people doing the design. I'm mentioning this in case we might be able to avoid some doubled effort in this area, or at least maybe share some insights.

altonen · 2023-06-23T15:34:42Z

We would also like syncing to be decoupled from networking as much as possible but given that syncing is the default set protocol, it has some implications which we can't get rid of such as other protocols expecting it to exist and syncing being "handshake protocol", and it will always be treated as a sort of a different protocol.

As far as keep-up/initial sync is concerned, it doesn't have to be as aggressive as it's the document right now. It can be adjusted to any threshold we find suitable. Obviously if the history download takes a long time, some of the announced blocks can spill over into the initial sync and it only switches to keep-up sync only after finalizing a very recent block. One reason to keep the syncing modes separate is to make the code more clearly separate and I fail to see why this would be unfeasible for your version. Could you elaborate on that?

bkchr · 2023-06-24T09:01:51Z

ut given that syncing is the default set protocol,

Why can we not change this?

as other protocols expecting it to exist and syncing being "handshake protocol"

Other protocols that want to follow sync (like grandpa, transaction sync etc) should use a generic stream of events PeerConnected/PeerDisconnected. This could be provided by everybody. AFAIR we already started doing this and not relying on events from network service telling us that a sync peer was connected.

altonen · 2023-06-26T07:24:01Z

Why can we not change this?

The main limitation has to do with handshaking (basically the assumption that the Role of the node is learned in the syncing handshaking) but that is fixable. That's also in the Polkadot spec. It's doable for sure I think but it's work that we may do in the future if we find it useful.

Right now the only protocol that is being followed is the sync protocol because it's considered the default but that already happens through a mechanism you're describing. But it's only implemented for syncing because we only need it for the sync protocol. But it can be implemented for every protocol if it's needed

timorl · 2023-06-26T07:45:43Z

Why can we not change this?

In our case we were planning to replace the normal sync/default set protocol with a near-noop, that essentially only performs the handshake.

I fail to see why this would be unfeasible

I meant the other way around – we are trying to have a single mode, which would handle both syncing when a node is up to date, as well as when it's behind. This way we don't need any switching mechanism between modes and we don't need separate logic at all. The thing that might be unfeasible, is to have both that and the ability to "catch up" sufficiently fast from being far behind – we suspect we should be able to do that, but in case it's not possible we will introduce a separate "initial sync" mode.

The switching between modes being too aggressive is not hypothetical for us btw – we had some problems with nodes "flapping" between major and normal sync when under load or with some network disruptions (in sync 1.0 with a 1s block prod time, and some other modifications), which sometimes resulted with the node ending up in a broken state.

…e` (#1650) Move request-response handling from `ChainSync` to `SyncingEngine` as part of [Sync 2.0](#534) refactoring aimed at making `ChainSync` a pure state machine. Resolves #502. --------- Co-authored-by: Aaro Altonen <[email protected]> Co-authored-by: Bastian Köcher <[email protected]>

This PR is part of [Sync 2.0](#534) refactoring aimed at making `ChainSync` a pure state machine. Resolves #501.

* upped the version of tracing * find/replace * added scale-info * added scale-info to more pallets * more build fixes * fixes * fixed imports * removed metadata * upped serde and codec versions * changed BaseCallFilter to Everything * added max instructions to FixedWeightBounds * added OperationalFeeMultiplier * import Everything for BaseCallFilter * added missing implementations * added force_unreserve implementation * removed commented code * return Vec from WeakBoundedVec * fixed OpaqueMetadata conversion * use BenchmarkError::Stop * using struct instead of method call * use struct patterns * changed mock CallFilter to use Everything * added scale-info/std to std feature * removed native_executor_instance macro * updated jsonrpc * use tokio_handle * renamed task_executor * removed unused imports * updated @polkadot/api to 6.8.1 because of new metadata * rustfmt * rustfmt with the right settings

…e` (paritytech#1650) Move request-response handling from `ChainSync` to `SyncingEngine` as part of [Sync 2.0](paritytech#534) refactoring aimed at making `ChainSync` a pure state machine. Resolves paritytech#502. --------- Co-authored-by: Aaro Altonen <[email protected]> Co-authored-by: Bastian Köcher <[email protected]>

This PR is part of [Sync 2.0](paritytech#534) refactoring aimed at making `ChainSync` a pure state machine. Resolves paritytech#501.

arkpar added I7-refactor labels Jan 26, 2022

This was referenced Aug 24, 2023

Move block announce validation/polling to SyncingEngine #503

Closed

Move request-response handling from ChainSync to SyncingEngine #502

Closed

Move import queue out of ChainSync to SyncingEngine #501

Closed

altonen transferred this issue from paritytech/substrate Aug 24, 2023

the-right-joyce added I4-refactor Code needs refactoring. D2-substantial Can be fixed by an experienced coder with a working knowledge of the codebase. and removed I7-refactor labels Aug 25, 2023

dmitry-markin mentioned this issue Aug 29, 2023

Get rid of polling in WarpSync #1265

Merged

dmitry-markin mentioned this issue Sep 20, 2023

Move requests-responses and polling from ChainSync to SyncingEngine #1650

Merged

dmitry-markin mentioned this issue Sep 28, 2023

Move import queue from ChainSync to SyncingEngine #1736

Merged

dmitry-markin added a commit that referenced this issue Sep 29, 2023

Move import queue from ChainSync to SyncingEngine (#1736)

0691c91

This PR is part of [Sync 2.0](#534) refactoring aimed at making `ChainSync` a pure state machine. Resolves #501.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync 2.0 #534

Sync 2.0 #534

arkpar commented Jan 26, 2022 •

edited by altonen

Loading

Tasks

arkpar commented Jan 26, 2022

nazar-pc commented Aug 9, 2022

arkpar commented Aug 9, 2022

burdges commented Aug 10, 2022

nazar-pc commented Aug 12, 2022

burdges commented Aug 13, 2022

timorl commented Jun 23, 2023

altonen commented Jun 23, 2023

bkchr commented Jun 24, 2023

altonen commented Jun 26, 2023

timorl commented Jun 26, 2023

Sync 2.0 #534

Sync 2.0 #534

Comments

arkpar commented Jan 26, 2022 • edited by altonen Loading

Current implementation issues.

Proposed new protocol.

Proposed syncing algorithm.

There are two major syncing modes:

Tasks

arkpar commented Jan 26, 2022

nazar-pc commented Aug 9, 2022

arkpar commented Aug 9, 2022

burdges commented Aug 10, 2022

nazar-pc commented Aug 12, 2022

burdges commented Aug 13, 2022

timorl commented Jun 23, 2023

altonen commented Jun 23, 2023

bkchr commented Jun 24, 2023

altonen commented Jun 26, 2023

timorl commented Jun 26, 2023

arkpar commented Jan 26, 2022 •

edited by altonen

Loading