Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Light sync state needs a proper format #60

Open
tomaka opened this issue Apr 7, 2022 · 24 comments
Open

Light sync state needs a proper format #60

tomaka opened this issue Apr 7, 2022 · 24 comments

Comments

@tomaka
Copy link
Contributor

tomaka commented Apr 7, 2022

The sync_state_genSyncSpec RPC call can be used in order to dump the current finalized state of the chain. This state is commonly called a "checkpoint". This is currently used by smoldot in order to have a starting point for syncing, so that it doesn't have to sync from scratch.

Unfortunately, this was sloppily implemented.
Code can be found here: https://github.com/paritytech/substrate/blob/b9927cdca03830d07bc1cdd8b8d80a3d32f64bd1/client/sync-state-rpc/src/lib.rs#L132-L178

We literally just copy some SCALE-encoded blob into the chain spec. The SCALE-encoded blob use internal data structures of Substrate, such as EpochChanges.
This means that any modification to the EpochChanges struct or its children structs modifies the format of the checkpoints and requires an update on the smoldot side (or any other piece of code that would want to decode checkpoints).
This happened for example in paritytech/substrate#9284, which modifies this struct. Since it's not hinted anywhere that modifying this struct will break things, it is easy to accidentally do it.

An additional problem is that this blob contains a lot of data that is completely unnecessary. Checkpoints should only contain the minimum amount of information necessary to sync from the current finalized block, whereas right now it also contains all the upcoming epoch changes in all non-finalized forks.

Instead, we should properly define what information is found in these checkpoints, and simplify them.

@tomaka
Copy link
Contributor Author

tomaka commented Apr 7, 2022

we should properly define what information is found in these checkpoints

To give an idea, these are the data structures that smoldot uses for a checkpoint:

https://github.com/paritytech/smoldot/blob/4da02f21a6278f100153d2878756a19bdfb216f2/src/chain/chain_information.rs#L121-L333

It also has an equivalent JSON format, which could more or less be standardized and copy-pasted into Substrate:

https://github.com/paritytech/smoldot/blob/073348483b61ad951dd3d04f8a74cbdaf6f0810c/src/database/finalized_serialize/defs.rs

Note that this JSON format is a superset of a checkpoint, as for example finalized_storage isn't part of the checkpoint.

@cheme
Copy link
Contributor

cheme commented Apr 8, 2022

Sounds very interesting definition to me, in the sense that it would be nice that we have a single encoding output and a way to associate a unique hash to such state.
Then of course being able to provide merkle proof for the different elements would be nice (I don't know serialization that support this, but sounds quite interesting thing to do, attaching merkle proof to scale surely would be fun, but being realistic just something similar to a chain block (hash of hash a few bigger content) would work).

The smoldot checkpoint looks good (three content: block production, block finalization and state).

Also I think this could be interesting for full client too, so maybe something a bit generic (array of extension/subsystems), which make me think a binary serialization with schema would be good, but having a versioning of the different components should be enough.

IIRC implementing this in substrate may be a bit of work (I remember previously that a bit of grandpa history was needed due to implementation, may have be lifted by recent changes), but probably worth doing anyway.

@tomaka
Copy link
Contributor Author

tomaka commented Apr 8, 2022

a binary serialization with schema would be good

This is a bit nit-picky, but JSON makes sense to me. Performance for parsing this doesn't matter, and our chain specs are already in JSON. There's no need to make it more complicated to read than it already is.

@cheme
Copy link
Contributor

cheme commented Apr 8, 2022

My not JSON is not so much about perf (even if I enjoy compactness), more about having unique hash for the same set of info.
Certainly a good formatting rules can work here, or converting another format from the json to achieve this goal.
I like this idea of unique hash to be able to build some trust around the snapshot (a bit like srtool for binary).
Also it might actually be a bit optimistic from my part to consider that for a finalized block the snapshot content can be restricted to a unique information set.
Edit: actually format is not that relevant, getting the right set of content (as minimal as possible) is the difficult part.

@bkchr
Copy link
Member

bkchr commented Apr 11, 2022

I can not really follow what you are talking here about @cheme. I mean you can just hash the file and get a unique hash? Or what kind of unique hash do you want to have?

@cheme
Copy link
Contributor

cheme commented Apr 11, 2022

I can not really follow what you are talking here about @cheme. I mean you can just hash the file and get a unique hash? Or what kind of unique hash do you want to have?

I was targeting a unique snapshot hash for a state, (not unique hash for a encoded snapshot): having a single valid snapshot hash for a given block. So unique set of info at a state (may not be achievable), and unique encoding for this set of infos (so encoding would not allow different ordering of infos, not allow additional spaces...).

But in fact it may be something better abstracted from file format (for state there is already header -> state root that does the job).

@tomaka
Copy link
Contributor Author

tomaka commented Apr 11, 2022

What would be the purpose of that hash?

Slightly unrelated: a maybe controversial question is which information exactly to include in the checkpoint.

Strictly speaking, you could retrieve GrandPa and Babe information from the chain and the checkpoint would just be a finalized block header (or even hash). The problem with doing that is that you need to find a node that hasn't pruned the storage of that finalized block, otherwise you can't retrieve them.

If you use warp syncing, then you need the GrandPa info of the starting point in order to verify the warp sync proof, but you don't need the Babe information because the warp sync will bring you to a block that hasn't been pruned and you can retrieve this Babe info from the chain.

Including the Babe information, however, makes it possible to immediately start verifying blocks on top of the checkpoint, without first having to do some networking requests.

@cheme
Copy link
Contributor

cheme commented Apr 11, 2022

What would be the purpose of that hash?

Only use case I got in mind is trusted users signing the hash (of course you can sign if there is multiple hash, but it creates possible issues or less convergence if multiple signing are done on different hash: generally with a single snapshot builder everyone can ends up on same hash, but it sounds less good).
Probably a way to skip on usual proof check (but one idea could be to have the warp sync proof attached to snap and check babe content on import), out of the efficiency reason, some may have more confidence in trusted peers than past history of validators signing for more or less rational reason.

In other word if you got two snapshot providers (trusted through https), you know both snaps are containing the same content without having to look inside.

The babe question sounds tricky, I would attach it.
It would not make sense to me to first download a trusted snapshot and then having to download additional content atop of it.

Still a use case could be to download a snapshot in order to warpsync from it afterward. Then no need for it.
I would certainly propose to make an optional snapshot subcomponent (either attached or just its hash (if hash of all subcomponents is snapshot hash) or just not here), but I already did and conclude it may be a wrong direction :)
Probably the size of babe state is not worth skipping.

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/community-bootnodes-and-snapshot-providers/630/1

@tomaka
Copy link
Contributor Author

tomaka commented Sep 25, 2023

Another idea: maybe the checkpoint could be just a block header and a Merkle proof containing all the information necessary to construct the state of GrandPa and/or the consensus?

The Merkle proof would contain the runtime code. The user of the checkpoint builds the runtime, then performs a few runtime calls to get the state of GrandPa/Babe, and these runtime calls use the Merkle proof as well.

This would have the advantage of being simple to define, future proof (we can upgrade to sassafras/others without modifying the format), and can be made deterministic if desired.
It would also be possible to provide either a proof with just GrandPa or a proof with GrandPa+Babe. In other words, the question above would be resolved by giving the choice to the generator of the checkpoint. EDIT: to be clear, I'm not advocating for adding a parameter to the JSON-RPC function or something, but rather that would could easily switch from one or the other, or re-use the format, if we realize that one is more appropriate

Having to provide the runtime code bothers me a bit, as it inflates checkpoints quite a lot and is necessary just for the idea to work rather than because you fundamentally need that runtime code.
However, this is actually beneficial if the runtime at the head of the chain is still the same as the one in the checkpoint, as the user of the checkpoint can avoid downloading the runtime code if the hashes compare equal.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 25, 2023

It would also have the advantage that it's possible to build a checkpoint through the JSON-RPC API without having to introduce a separate JSON-RPC function just for checkpoints.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 25, 2023

If you agree, what I would suggest is to add a new optional field to the chain spec, something like this:

"checkpoint": {
    "header": "...",
    "callProof": "..."
}

The two ... are hexadecimal values.

I'm suggesting a new field rather than put the data inside lightSyncState, because the word "lightSyncState" is weird and really doesn't mean anything.

The proof must contain a storage proof of :code and of :heappages, and call proofs for BabeApi_current_epoch, BabeApi_next_epoch, BabeApi_configuration, GrandpaApi_grandpa_authorities, and GrandpaApi_current_set_id, all merged together into one.
(none of these runtime functions have any parameter)

Once implemented in both Substrate and smoldot, and that everything works, we could immediately remove the old format in Substrate. Only smoldot would need to continue supporting the old one until everyone updates their chain specs.

Seems simple enough, no?

@tomaka
Copy link
Contributor Author

tomaka commented Oct 17, 2023

If you agree, what I would suggest is to add a new optional field to the chain spec, something like this:

"checkpoint": {
    "header": "...",
    "callProof": "..."
}

The two ... are hexadecimal values.

I'm suggesting a new field rather than put the data inside lightSyncState, because the word "lightSyncState" is weird and really doesn't mean anything.

The proof must contain a storage proof of :code and of :heappages, and call proofs for BabeApi_current_epoch, BabeApi_next_epoch, BabeApi_configuration, GrandpaApi_grandpa_authorities, and GrandpaApi_current_set_id, all merged together into one. (none of these runtime functions have any parameter)

Once implemented in both Substrate and smoldot, and that everything works, we could immediately remove the old format in Substrate. Only smoldot would need to continue supporting the old one until everyone updates their chain specs.

Seems simple enough, no?

If nobody is opposed, I'm going to implement this in smoldot (smol-dot/smoldot#1186), and submit to Substrate a PR that adds a dummy checkpoint field to the chain specification format (and if I where to do that, print a warning if this field is set).

@lexnv
Copy link
Contributor

lexnv commented Jan 23, 2024

To make sure I understood this correctly:

  • The server will build a storage proof from the said entries that are provided as checkpoint (from the latest finalized block)
  • The storage proof validates the wasm runtime code, together with the heappages (needed for building the runtime), and with the values of a few runtime API calls. This then means that the output produced by the wasm runtime code wrt to the runtime APIs (BabeApi_current_epoch and others) can be trusted.
  • After verifying the proof, the client builds the runtime to gain access to the runtime APIs (BabeApi_current_epoch and others)
  • Users make runtime calls to the functions included in the proof to reconstruct the state needed for warp syncing (from smoldot)

I'm wondering if the storage proof contains the :code: section, could we trust at this point the output of AuraApi_slot_duration even if it is not included in the checkpoint storage proof? Let me know If this is right, thanks 🙏

@tomaka
Copy link
Contributor Author

tomaka commented Jan 23, 2024

I'm confused by your comment because your vocabulary is very weird.
A storage proof doesn't "validate" anything, it doesn't include "values of a few runtime API calls", and doesn't "include functions".

Also, there's no server and client in this context, as this is about the chain spec format, though I understand "server" means "the thing that generates a chain spec" and "client" means "the thing that reads a chain spec".

@lexnv
Copy link
Contributor

lexnv commented Feb 12, 2024

I looked around a bit and I think I have a better understanding of this now.

I've created this PoC PR in substrate which adds the checkpoint to the chainSpec of substrate and polkadot as an extension. It merges the execution proofs of the mentioned runtime APIs, together with the read proof of :code and :heap. The PR exposes the checkpoint via the sync_state_genSyncSpec RPC call, but we could change that.

I was wondering if we should also add the execution proof of:

@tomaka
Copy link
Contributor Author

tomaka commented Feb 12, 2024

I was wondering if we should also add the execution proof of:

You can find here all the runtime calls that smoldot might do in order to determine the state of the chain: https://github.com/paritytech/smoldot/blob/938055a638ec201c022f680c8e8cbd0349e70ed1/src/chain/chain_information/build.rs#L122-L146


As a side note, one issue I've realized with that idea is that smoldot needs to ask a full node for a proof in order to be able to generate a checkpoint.

Smoldot reads the initial list of GrandPa and Babe authorities and parameters by reading from the state or a call proof, then later follows updates to these GrandPa/Babe parameters by reading block headers. Unfortunately, it can't convert back the up-to-date parameters into a checkpoint.

@lexnv
Copy link
Contributor

lexnv commented Apr 30, 2024

Would it be possible to have a different format that could easily be built by smoldot without involving full nodes? 🙏

If not, is there anything we could add to increase our confidence in the data provided by full nodes and trust the checkpoint smoldot builds?

Ideally, we'd build the checkpoint in an agnostic way which doesn't specifically mention the consensus and allows us to include more information in the format, without breaking changes. However, we could always resort to having specific types that the checkpoint needs (https://github.com/paritytech/smoldot/blob/4da02f21a6278f100153d2878756a19bdfb216f2/src/chain/chain_information.rs#L121-L333). I think having this format specified in the spec would still be beneficial for both users and developers.

@tomaka
Copy link
Contributor Author

tomaka commented Apr 30, 2024

I think having this format specified in the spec would still be beneficial for both users and developers.

The issue is that Babe and GrandPa are themselves not specced properly. There are ambiguous corner cases (that fortunately can never happen in practice, but remain a theoretical possibility in case of a bug somewhere). Also, for example Babe disabled authorities are I believe not used at all and could be entirely removed. Babe and GrandPa are in this state of limbo where they could get clean-ups but nobody wants to clean them up.

In this state of things, properly defining a format that relies on specific Babe/GrandPa concepts (instead of an abstraction over the consensus/finality engine like I suggested above in this issue) is kind of complicated and would likely add more to the "things that need to be cleaned up later".

@lexnv
Copy link
Contributor

lexnv commented Sep 11, 2024

I've had a go at trying to spec out babe and grandpa in our rpc-spec-v2 repo: https://github.com/paritytech/json-rpc-interface-spec/pull/124/files.

The PR feels a bit too complex and the main downside is that we'll have to maintain quite a few fields to only deprecate them later (once we allocate resources for a cleanup work). At the moment, the checkpoint idea looks the cleanest to me.

As a side note, one issue I've realized with that idea is that smoldot needs to ask a full node for a proof in order to be able to generate a checkpoint.

Could we do anything to make the checkpoint work for an unstable_chainSpec method?
Can smoldot request the proof from the full node, and if that operation fails, fall back to using the latest chainSpec it has stored? If unavailable, could it then use the chainSpec provided by the user as a fallback?

Would love to get your thoughts on this @tomaka @bkchr

@tomaka
Copy link
Contributor Author

tomaka commented Sep 11, 2024

Can smoldot request the proof from the full node, and if that operation fails, fall back to using the latest chainSpec it has stored? If unavailable, could it then use the chainSpec provided by the user as a fallback?

Smoldot doesn't "store" any chain spec other than the one that the user provides when connecting to a chain.
I don't really see the point in sending to the user the checkpoint stored in the chain spec that it has provided, given that, well, the user already has that checkpoint.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 11, 2024

There's been a lot of confusion. To summarize:

While it is connected to a chain, smoldot stores strongly-typed consensus data, which contains mainly the list of Babe and GrandPa authorities plus a few other things. That's what it uses to verify blocks.

Right now, the checkpoint format found in chain specs directly contains this strongly-typed consensus data. When you pass a checkpoint to smoldot, it simply loads this data in memory and can immediately verify descendants of the block of the checkpoint.

This strongly-typed consensus data can also be calculated by doing a runtime call. This runtime call requires accessing the state of the chain, which requires querying a full node. That's what smoldot does as part of the GrandPa warp sync process.

The somewhat elegant solution that I suggested above is to store in the chain spec the fragment of the state of the chain necessary in order to compute this strongly-typed consensus data, instead of the strongly-typed consensus data directly.

The problem however is that it is not possible to convert from the strongly-typed consensus data back to the state of the chain (you can't do the inverse of a runtime call). While smoldot stores this strongly-typed consensus data in memory, it can't generate the state of the chain without querying a full node.

@bkchr
Copy link
Member

bkchr commented Sep 11, 2024

The somewhat elegant solution that I suggested above is to store in the chain spec the fragment of the state of the chain necessary in order to compute this strongly-typed consensus data, instead of the strongly-typed consensus data directly.

I like this idea.

@lexnv
Copy link
Contributor

lexnv commented Sep 19, 2024

Thanks for the info @tomaka 🙏

it can't generate the state of the chain without querying a full node

Would it be acceptable from the light-client perspective to not generate a chainSpec if querying a full node fails? Are there other concerns (ie the full node could give us wrong/malicious information 🤔 ?

From the user perspective, not receiving the chainSpec some times might be a bit misleading indeed. I'm wondering if we could have an unstable method to implement for full nodes, that may produce the result most of the times for lightclients as well 🤔

I think specing out Babe/Grandpa consensus should be placed at https://github.com/w3f/polkadot-spec, instead of the rpc-v2 spec repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: backlog
Development

No branches or pull requests

5 participants