-
Notifications
You must be signed in to change notification settings - Fork 2.6k
mmr proof generation failures on rococo #12864
Comments
This is definitely the right way to do it. In this case we consciously decided to not do it since MMR pallet was not yet stable and we didn't want to carry around old versions of the API which were never used in production. Since we've made multiple breaking changes to it lately, I am definitely leaning towards resetting the pallet once these changes are all deployed and thus, start fresh on a clean & stable MMR. |
With the new MMR client gadget I'm not sure if we can reset the pallet (If resetting the pallet = setting NUM_MMR_LEAVES to 0). The MMR gadget will take the first block with MMR and compute leaf numbers based on it. After that, if we set NUM_MMR_LEAVES to 0 later, I think we'll get errors for canonicalization and pruning. |
We should then coordinate to reset the pallet in the same runtime upgrade that brings in the new API that the gadget relies on. |
The new API was already released. As discussed offline, we need to see if we can support resets in the gadget. |
The idea that I'm thinking about in order to make the MMR client gadget support pallet resets is to change the way we save mmr nodes in offchain DB. Now we are saving them node by node by their position in the MMR. The option that I'm working on is to save each subtree by the block that introduced them. For example For the tree:
we are now saving the data for each node at the following keys:
I'm trying to save:
And then these get canonicalized by the client gadget. If we store subtrees by the block number that introduced them, we don't care about the position of the nodes in the mmr tree anymore, so when the pallet resets, we can continue to canonicalize by block number. The client gadget won't be impacted by the fact that at block However the disadvantage would be that when we need to retrieve a node in order to generate a proof, we would retrieve the entire subtree introduced by a block, which could be less efficient. In the meanwhile I'm still thinking if there could be other simpler or more efficient solutions. The easiest solution would probably be to call |
Efficiency is much more important for on-chain activity than off-chain activity. Making above change would add more complexity to the process of "adding a node to the mmr" which happens multiple times per block during block execution.
Because of onchain db cache, I think the second option is actually faster than what we currently have as it does everything in memory, and hopefully doesn't even touch the underlying onchain storage db since by the end of the process, nothing ultimately changes to onchain storage; and for offchain db we get a single write instead of one per node. If offchain db also had(has?) cache, then what we have now is the most efficient. I would actually not change anything here, but simply add a new
Or this ^ - which is equivalent when we check count only ever goes up. Efficiency-wise, it's no problem for the gadget to call a runtime api per finality notification, or even once per block. |
Yes, I'm experimenting with something close to the second option. I did some tests with a PoC version and indeed this is more efficient than what we have now. Also, unexpectedly it's faster when generating proofs. I guess it's because the DB probably does cache the subtrees when first accessed. There are 2 problems however:
On the other hand this should simplify the client gadget a bit because we wouldn't need to know the first mmr block and compute leaf/node indexes.
Yes, this is actually easier to use than |
"more efficient" on what ops, based on what tests/benchmarks?
Yup, you're getting hot cache hits. Just as a note, generating proofs is not "in the hot path", we shouldn't make design decisions for small optimizations there.
I suggest we take the easy route now (use |
The time needed to generate 10k-100k blocks
Agree. I'll try the |
I believe this is fixed now |
We've been seeing
mmr_generateProof
calls failures on rococo the last weeks.1. InconsistentStore
The first source of these errors is
mmr_lib::Error::InconsistentStore
.This issue is due to the offchain worker not being triggered on every block.
Solution
This can be mitigated by firing notifications on the initial block sync too, as suggested by @serban300 and done here:
Lederstrumpf@a571c51
However, this should already be solved by #12753 - will test once runtime 9330 is activate on rococo.
2. CodecError
The other issue is
codec::Error
. The source of these is that the onchain runtime version's api may mismatch what the client was built with. As such, the failures we've seen changed based on what runtime version was current (unless historic chain state is queried).For example, on a
v0.9.31
orv0.9.32
release node,mmr_generateProof
works for blocks with onchain runtime version9321
or9310
, that is from block 2744973 onwards, but fails with the codec error for earlier blocks. In this case, we changed the API for runtime9310
to use block numbers in lieu of leaf indices with PR #12345, so moving from u64 → u32 leads to the codec error thrown atsubstrate/frame/merkle-mountain-range/rpc/src/lib.rs
Lines 188 to 195 in b0e994c
Confirmed that this is the root of the codec error since changing back to leaf inputs in the proof generation API atop v0.9.31 resolves this:
Lederstrumpf@72cdcad
Solution
Keeping the mmr API interface stable would avoid this.
If we change the API interface, the issue will disappear once the runtime aligned with changes on the API server becomes active. It will then still fail for historic queries, which I see the following options for:
The text was updated successfully, but these errors were encountered: