Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BEEFY: Rococo⇄Sepolia deployment stalled #3080

Closed
Lederstrumpf opened this issue Jan 26, 2024 · 0 comments
Closed

BEEFY: Rococo⇄Sepolia deployment stalled #3080

Lederstrumpf opened this issue Jan 26, 2024 · 0 comments
Assignees
Labels
T15-bridges This PR/Issue is related to bridges.

Comments

@Lederstrumpf
Copy link
Contributor

Lederstrumpf commented Jan 26, 2024

The Rococo⇄Sepolia bridge is currently stalled since it's not receiving any new commitments from relayers.
https://sepolia.etherscan.io//address/0xe6e799ebb05ac563f36037f9538d13c4e2649f8b

The last successful submitFinal call was https://sepolia.etherscan.io/tx/0xc8dcf52dcddfd3157e6d49ff2f43f4daa860a3a10becda36f6877dfabdaeda1e on Jan 19th containing a commitment to block 8810076.

However, thereafter Snowfork's relayers went down with the following commitment processing errors:

1. Jan 19 17:00:59 ip-172-31-41-212 start-beefy-relayer.sh[844030]: {"@timestamp":"2024-01-19T17:00:59.292636461Z","IsHandover":false,"commitment":{"blockNumber":8810100,"nextValidatorSetID":15087,"validatorSetID":15086},"level":"warning","message":"Discarded commitment
with depth not fast forward","validatorSetID":15086}
2. Jan 19 17:00:59 ip-172-31-41-212 start-beefy-relayer.sh[844030]: {"@timestamp":"2024-01-19T17:00:59.66543264Z","IsHandover":true,"commitment":{"blockNumber":8810108,"nextValidatorSetID":15087,"validatorSetID":15087},"level":"warning","message":"Discarded invalid commitment","validatorSetID":15086}
3. Jan 19 17:01:26 ip-172-31-41-212 start-beefy-relayer.sh[844030]: {"@timestamp":"2024-01-19T17:01:26.449254523Z","error":"commitment has unexpected validatorSetID: blockNumber=8810707 validatorSetID=15088 expectedValidatorSetID=15086","level":"fatal","message":"Unhandled error"}
  1. error (inappropriately thrown as a warning only) via https://github.com/snowfork/snowbridge/blob/13db09317fad428af4e2bb8faf590cb3f17ad97c/relayer/relays/beefy/polkadot-listener.go#L137-L139 and
  2. error thrown via
    https://github.com/snowfork/snowbridge/blob/13db09317fad428af4e2bb8faf590cb3f17ad97c/relayer/relays/beefy/polkadot-listener.go#L85-L91

The reason for these errors is that BEEFY finality had stalled (gossip messages were likely lost during a prior reset) and tranches of rococo validators were reverted to a state snapshot before 17h00 UTC. When restoring from the snapshot, the validators resumed voting, but only voted on mandatory blocks to catch up with grandpa finality, hence only produced finality proofs for the mandatory blocks. Ergo: the relayers would only relay commitments to mandatory blocks, which all have the off-by-one bug described in polkadot-fellows/runtimes#160.

Details on the issue's cause

(see also https://hackmd.io/w48qUMd8TUiYvFxH9Vtcgg)

The structure of a Commitment relayed to Ethereum is the following

Commitment {
block_num: <N>
auth_set: <auth_set_of<N>>
..
        }

while the current structure of an MMRLeaf (also see polkadot-fellows/runtimes#160, paritytech/substrate#11797, paritytech/polkadot#6577) is

// contents for leaf index <N-1> added by block <N>
MmrLeaf {
    version: <leaf-data-format-version>,
    (
        parent_num: <N-1>,
        parent_hash: <hash_of_<N-1>>,
    ),
    extra_data: <para_heads_of_<N-1>>,
    next_auth_set: <next_auth_set_of<N-1>>,
}

For the mandatory block 8810076, the payload for Commitment and MMRLeaf was thus

Commitment {
    block_num: 8810076,
    auth_set: 15087,
    ..
        }

MMRLeaf {
    next_auth_set: 15087,
    ..
}

since, given that <N-1> was still in the prior session, next_auth_set_of<N-1> referred to the current auth set 15087.

On the relayer, this payload then fails on the check that auth_set == next_auth_set - 1 since they are currently in fact equal on mandatory blocks:
https://github.com/snowfork/snowbridge/blob/13db09317fad428af4e2bb8faf590cb3f17ad97c/relayer/relays/beefy/polkadot-listener.go#L113
Had the relayer sent it to the bridge, it would still have failed on
https://github.com/snowfork/snowbridge/blob/13db09317fad428af4e2bb8faf590cb3f17ad97c/contracts/src/BeefyClient.sol#L362-L364
Hence, the Ethereum bridge never handled any mandatory blocks.

The above off-by-one issue had been masked so far since under normal operating circumstances, validators don't only vote on the mandatory blocks, but also on blocks in-between these (if BEEFY finality is not too far behind GRANDPA finality), hence it was safe for the bridge to skip the mandatory blocks that live on session boundaries:
New BEEFY rounds will commence prior to next_session_start only iff best_beefy + NEXT_POWER_OF_TWO((best_grandpa - best_beefy + 1) / 2)) < next_session_start, see https://spec.polkadot.network/sect-finality#defn-beefy-round-number).
But given the large lag of BEEFY finality vs GRANDPA at the time, this condition was never satisfied, hence only voting on mandatory blocks was performed and no new BEEFY rounds were started in session 15088 (corresponding to auth_set 15077).

Consequently, the earliest block voted subsequent to 8810108 is 8810707, which is already session 15089 (corresponding to auth_set 15088), hence the relayer threw above error 3. (and the solidity contract would else have errored on https://github.com/snowfork/snowbridge/blob/13db09317fad428af4e2bb8faf590cb3f17ad97c/contracts/src/BeefyClient.sol#L353-L355 since auth_set 15088 exceeds both its currentValidatorSet and nextValidatorSet)

Solution

The long-term fix is addressing the off-by-one error (polkadot-fellows/runtimes#160), but this will still leave the current deployment bricked since it will only reflect in a future runtime.
If BEEFY finality hadn't progressed yet, we could also have deployed a modified client that would initiate voting rounds on an offset from the current definition (https://spec.polkadot.network/sect-finality#defn-beefy-round-number), but now this would regress BEEFY finality and would only have been a temporary solution if we revert paritytech/polkadot#6577.

As such, to address the stalled deployment, our current tally of options is the following:

  1. reset the bridge, i.e. redeploy contracts on Sepolia with an initial block higher than the gap without non-mandatory beefy finalizations, reset nonces in the BridgeHub runtime, and clear assets registered on AssetHub
  2. manually aggregate (i.e. outside the voting protocol) a signed commitment for any block in session 15087 (other than the mandatory block already finalized), viz à viz any later sessions that only had commitments to the mandatory blocks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T15-bridges This PR/Issue is related to bridges.
Projects
None yet
Development

No branches or pull requests

2 participants