State trie storing and loading alternate implementation? #11824

qdm12 · 2022-07-12T20:40:16Z

qdm12
Jul 12, 2022

I'm currently working on the alternate Go implementation of the Polkadot host, more specifically
on the v1 state trie code and offloading the in-memory trie data to a database (our memory usage is too high since we keep all trie nodes fully in memory).

The following is about storing/loading trie nodes from database in the in-memory trie.
I am suggesting an alternate implementation, and would like to hear your thoughts on it, as well as any critic on my understanding of how Substrate currently works. This is not necessarily something to adopt in Substrate, but it's been on my mind for our Go implementation. I also focus on fat nodes (my own wording) which are trie nodes with an encoding and subvalue both larger than 32 bytes.

Substrate implementation

Writing a key value in the trie

Find place in the memory trie to determine node partial key
If state version is v1 and value > 32B:
1. hash value to get the value digest
2. store in database: value digest ➡️ value
Encode node
If the encoding > 32B:
1. hash encoding to get the encoding digest
2. store in database: encoding digest ➡️ encoding
insert in the memory trie the node with only:
- its partial key
- its encoding hash digest

So for fat nodes, it's 1 encoding + 2 hashes + 2 database writes.

Read a key value from the trie

Find the trie node corresponding to the key given in the memory trie
If that node has only an encoding hash digest:
1. Lookup from database the node encoding using this digest
2. Decode the node using this encoding
If the node encoding header is a (v1) 'node with hashed subvalue':
1. Lookup from database the subvalue using the subvalue hash digest

So for 'fat nodes', it's 1 decoding + 2 database reads.

Suggested alternate implementation

Writing a key value in the trie

Find place in the memory trie to determine node partial key
If value > 32B (for version v0 or v1):
1. hash value to get the value digest
2. store in database: value digest ➡️ value
insert in the memory trie the node with only:
- its partial key
- its subvalue or subvalue hash digest, if subvalue > 32B
- 1 node variant byte (or state version), to know how to encode the node
- 1 boolean byte indicating if the node 32B subvalue is a hash or not, to know if the subvalue should be looked up in the database

So for 'fat nodes', it's 1 hash + 1 database write.
But each node uses 2 more bytes.
The boolean could also be the last bit of the variant byte, so that could be optimized to be 1 more byte only. For example:

0100 0001 => leaf with 32B hashed subvalue
0100 0000 => leaf with 32B inlined subvalue

Read a key value from the trie

Find the trie node corresponding to the key given in the memory trie
Depending on the inlined value boolean, lookup the value from database using the subvalue hash digest

Here it's 1 database read.

Other considerations

Merkle root hash computation:
- Is re-encoding each node every time faster than looking up the encoding from database?
- If not, Merkle values could be cached in the in-memory trie, but that would additionally use up to 32B

Sum up table

Operation	Existing implementation	Suggested implementation
Write key value	`1` encoding + `2` hashes + `2` database writes	`1` hash + `1` database write
Read key value	`1` decoding + `2` database reads	`1` database read
Node memory usage	Partial key length + up to 32 bytes for encoding hash digest	Partial key length + up to 32 bytes for subvalue hash digest + 1 byte variant-inlined-value

Thank you all!

cheme · 2022-07-13T21:07:05Z

cheme
Jul 13, 2022
Collaborator

Second implementation sounds better in the context of the trie being already in memory .
I am not sure if the first implementation is current 'gossamer' one or parity substrate.
In parity substrate all nodes are read from db and written in db. So when you access a value at depth 5 it is 5 read 5 decoding and sometime a 6th read for the value. On write it is 5 encoding of node and write of node (since hashes in the encoded nodes did change). So the number are way higher. Nodes are not in memory but we use (or plan to) node caching.
The thing with write is that it is batched: we got a delta of all block changes in memory, build a partial trie in memory (only contains changed node and their parent nodes), calculate in memory all the new nodes to write and batch all these change (hash -> encoded node) in a single db transaction.
Just by curiosity, how is gossamer in memory trie build (is there a storage of all value and the trie is build from all values?)?

Is re-encoding each node every time faster than looking up the encoding from database?

Encoding node from its in memory value should be faster than reading from db (not if cached).

If not, Merkle values could be cached in the in-memory trie, but that would additionally use up to 32B

If by merkle value we are talking about merkle hash, those are not very needed at a single node level. But when considering a whole state, it is very important to have them I mean the sibling hashes of the updated nodes (without them how to calculate new trie root without recalculating hash of all the nodes).
I mean looking at https://github.com/ChainSafe/gossamer/blob/3a471d91f61c1ba10022ff11403b9cebb71c8666/internal/trie/node/node.go#L35 then to encode your node you need to fetch all children node hash. In parity trie we store the up to sixteen children hash in the parent node (so we can encode change to a child without fetching the other children).
But I think I misunderstood your point here.
(I am really not sure how current gossamer storage works, eg when storage_root is called what happens? how the persistence looks likey?)

1 reply

qdm12 Sep 12, 2022
Author

Thanks for the detailed answer! In the end I think I'll take the same route as substrate + node encoding caching.
One of the reason is that the alternative approach I propose is still (at least theoretically) unbounded in terms of memory usage. I would rather have slower operation (multiple disk IO) with a configurable cache size to speed up operation ('node caching').

For now Gossamer just keeps all nodes in memory (and even caches the node encoding in memory as well), so it's just horrible memory usage wise. We do however store node merkle value <-> node encoding in our database when a trie gets finalised, but we still keep it in memory as well.

I have however a question when it comes to cleaning up nodes when a trie is pruned because it's a part of an unfinalized fork. How does Substrate handle it? The approach I have in mind is to go through the entire (pruned) trie and delete any 'dirty'/modified node (and their v1 subvalue if any) from disk since the last finalized trie, but that feels like a lot of IO? 🤔

burdges · 2022-09-12T22:08:11Z

burdges
Sep 12, 2022

As an aside..

We know dense tries handle Merkle proofs poorly. Afaik the correct solution is radix-2 hashing, with perhaps radix 16 cashing or whatever. This saves 4x the space on Merkle proofs in dense tries.

We'd pay some cost in sparse tries for radix 2 hashing, but this can be fixed by using a hash function with a metadata field that permits "fast forwarding" over null copath elements, so morally the radix 2 hash function looks like

H(0,x) = x ++ 'R'
H(x,0) = x ++ 'L'
H(x,y) = H'(x ++ y)

We'd obviously implement ++ 'L' etc using some bit twiddling in some metadata bits of course, not actual concatenation. It'd need to be adapted to some other trie complexities too I guess.

0 replies

qdm12 · 2022-11-24T15:25:39Z

qdm12
Nov 24, 2022
Author

@cheme Would you mind please let me know a bit more when you flush the write changes to database? I guess you don't write to database trie changes on every key-value pair insertion since AFAIK it's too CPU expensive to scale encode each node in the trie path to the node inserted, right?

The thing with write is that it is batched: we got a delta of all block changes in memory, build a partial trie in memory (only contains changed node and their parent nodes), calculate in memory all the new nodes to write and batch all these change (hash -> encoded node) in a single db transaction.

0 replies

cheme · 2022-11-24T16:10:30Z

cheme
Nov 24, 2022
Collaborator

Sure.
First we run the whole block: here we do only access the db (if trie nodes are not cached already).
Changes are stored in memory (only key value storage nothing trie related), and when the host function storage_root is called we calculate new trie nodes but only store them in memory.
Then end of block processing, if we are just building block or some others, we drop the changes, if it was block import we insert in a single transaction: new header, new trie nodes, pruned trie nodes, and other meta to update substrate state.

Note that technically the new trie nodes are cached on every call to storage_root, so when getting the trie nodes at end of block import, we only recalculate if a change occurs after last call to storage_root host function.
But sometime it would be better to never store/cache trie node changes when doing storage_root, but only calculate them after import: that is if storage_root is called often, then we are caching these nodes a lot for nothing.
But usual use case is that storage_root is only really use once per block at the end of the processing, in this case it is better to cache this trie nodes.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State trie storing and loading alternate implementation? #11824

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

State trie storing and loading alternate implementation? #11824

qdm12 Jul 12, 2022

Substrate implementation

Writing a key value in the trie

Read a key value from the trie

Suggested alternate implementation

Writing a key value in the trie

Read a key value from the trie

Other considerations

Sum up table

Replies: 4 comments · 1 reply

cheme Jul 13, 2022 Collaborator

qdm12 Sep 12, 2022 Author

burdges Sep 12, 2022

qdm12 Nov 24, 2022 Author

cheme Nov 24, 2022 Collaborator

qdm12
Jul 12, 2022

Replies: 4 comments 1 reply

cheme
Jul 13, 2022
Collaborator

qdm12 Sep 12, 2022
Author

burdges
Sep 12, 2022

qdm12
Nov 24, 2022
Author

cheme
Nov 24, 2022
Collaborator