Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix https://github.com/status-im/nim-beacon-chain/issues/944 #955

Closed
wants to merge 1 commit into from

Conversation

tersec
Copy link
Contributor

@tersec tersec commented Apr 30, 2020

prysmaticlabs/prysm#5658 has more extensive discussion of this. Consensys/teku#1693 shows one example of a fix, not the same one as here.

There's so much off with this block -- it requests an absurd slot in the 10^15 range, its attestations aren't correct, and its signature is incorrect.

The other fixes mostly target the attestations, but those appear to be an artifact of the zcli-type tools bypassing the usual verification pathways, and fixing that wouldn't really make the beacon_node pathways any more robust. So, instead, do something else useful, and move the block signature check earlier. There's a cost to this, potentially, but (a) it's supposed to be paid regardless, and (b) it's a fixed cost, instead of processing (here) 2 quadrillion slots before figuring out the signature's incorrect.

The conceptual downside to the committee size-based approach is that, in theory, this regards the committees as they'll appear 2 quadrillion slots from now, not their current size, so it's easier to justify rejecting the block based on the signature.

Finally, there are obvious issues with processing blocks of unbounded length -- but they're largely an artifact of these test tools, as there should not be use cases in actual testnets or mainnets for such jumps. Could be worth guarding against.

@arnetheduck
Copy link
Member

well, couldn't a signed block have the same issue? also, doesn't the spec determin in which order things are checked? if yes, we should probably be doing it in the spec order?

there are checks around saying we shouldn't consider blocks from the future anyway so I'm a bit hesitant to apply just like that...

@tersec
Copy link
Contributor Author

tersec commented Apr 30, 2020

well, couldn't a signed block have the same issue? also, doesn't the spec determin in which order things are checked? if yes, we should probably be doing it in the spec order?

there are checks around saying we shouldn't consider blocks from the future anyway so I'm a bit hesitant to apply just like that...

Yes, a signed block could. Then one would be left with a block with invalid attestations that according to the spec, wouldn't be detected for quadrillions of slots. prysmaticlabs/prysm#5658 (comment) makes this point well, quoting the spec state transition pseudocode, and:

The log you pasted in the original issue shows that Prysm is starting process_slots and you terminated it in 15s. Based on the spec, we have implemented this correctly and are doing exactly what is intended. Like i said in the previous comment, blocks would not have reached this state transition core function from any layer of the application. pcli is calling state transition directly, as intended.

Even without this fix, once ncli_transition "finishes" (I hardcoded the process_slots() slot target to 30) its request, it's correctly rejected, just as it should be, first because of the signature, which is checked before any of its contents, correctly according to the pseudocode which https://github.com/ethereum/eth2.0-specs/blob/v0.11.1/specs/phase0/beacon-chain.md#beacon-chain-state-transition-function shows:

def state_transition(state: BeaconState, signed_block: SignedBeaconBlock, validate_result: bool=True) -> BeaconState:
    block = signed_block.message
    # Process slots (including those with no blocks) since block
    process_slots(state, block.slot)
    # Verify signature
    if validate_result:
        assert verify_block_signature(state, signed_block)
    # Process block
    process_block(state, block)
    # Verify state root
    if validate_result:
        assert block.state_root == hash_tree_root(state)
    # Return post-state
    return state

@tersec
Copy link
Contributor Author

tersec commented Apr 30, 2020

This is akin to #922 / #927 which shows another example of where ncli_transition hits problems no beacon_node ever would. At some point, this becomes a bit ad hoc -- the spec could be argued to be buggy insofar as it allows this, and/or "correct" fuzzing needs to be more sophisticated about this rather than gradually push people into reimplementing every check that occurs at front-end block/attestation/etc ingestion in the usual flow again into the back-end processing, in case one tries it with zcli/pcli/ncli/etc.

I'd be fine with not merging this, too, but the class of fuzzed issues, real insofar as they can be reproduced, but probably not achievable (because they require an invalid beacon state and/or requiring bypassing checks which exist in the spec, but not in the zcli class of tools) won't stop with this one -- we need a broader policy.

@tersec
Copy link
Contributor Author

tersec commented Apr 30, 2020

A final reply, to address:

there are checks around saying we shouldn't consider blocks from the future anyway so I'm a bit hesitant to apply just like that...

Specifically. Yes, there are checks, for example, in BlockPool.isValidBeaconBlock(...):

  if not (signed_beacon_block.message.slot <= current_slot):
    debug "isValidBeaconBlock: block is from a future slot",
      signed_beacon_block_message_slot = signed_beacon_block.message.slot,
      current_slot = current_slot
    return false

Which reflects https://github.com/ethereum/eth2.0-specs/blob/v0.11.1/specs/phase0/p2p-interface.md#global-topics saying:

The block is not from a future slot (with a MAXIMUM_GOSSIP_CLOCK_DISPARITY allowance) -- i.e. validate that signed_beacon_block.message.slot <= current_slot (a client MAY queue future blocks for processing at the appropriate slot).

But by following the beacon chain state transition function specification as we do, a tool which directly calls state_transition(...) directly simply ... skips all of that. Doesn't do any of it. Which is what I'm getting at with my concern about this style of fuzzing gradually requiring recreating all these checks which already exist, in state_transition(...) or process_deposit(...) or etc pathways.

@arnetheduck
Copy link
Member

I guess the issue with those checks is that they depend on the local clock, making them unsuitable for "hard" logic like the one in state_transition but rather "soft" logic that might keep the block around for a bit based on some heuristic.

@arnetheduck
Copy link
Member

another way to put this: in ncli, as a matter of quality of implementation, it would perhaps make sense to have a tolerance parameter, because as far as testing tools go, it could be useful to try to apply some valid crazy future blocks just to see what happens (with rewards for example).

@tersec
Copy link
Contributor Author

tersec commented Apr 30, 2020

another way to put this: in ncli, as a matter of quality of implementation, it would perhaps make sense to have a tolerance parameter, because as far as testing tools go, it could be useful to try to apply some valid crazy future blocks just to see what happens (with rewards for example).

Would this imply also having libnfuzz.a/libnfuzz.so/ncli_transition not simply run state_transition(...) (as in this case), but run some part of the usual validation that blocks would go through in general? Because while that'd be reasonable, (a) the tolerance parameter wouldn't make much sense otherwise, as it's tolerance on nothing, and (b) that'd qualitatively change the sort of tool zcli would be, both to (b)(1) not be just a slightly-different-syntax version of the zcli/pcli idea, and (b)(2) potentially (re)create a need for something which didn't either (b)(2)(i) drag in lots of dependencies beyond the state transition function itself or (b)(2)(ii) recreate semi-duplicate, cut-down versions/variations of existing checks.

@gnattishness
Copy link
Contributor

My couple cents:

... my concern about this style of fuzzing gradually requiring recreating all these checks which already exist, in state_transition(...) ... pathways.

Agreed, sounds a very valid concern.
The fuzzing should be checking for compliance with the spec, and other crashes.
If the spec assumes certain conditions during the state transition (particularly if validation is specified at outer levels), then the fuzzing should ensure that it complies with this.
I.e the fuzzing pre-processing or harnesses should aim to ensure inputs are provided that, during normal operation, can be reasonably expected to reach the state transition function (or other relevant targets) - so BeaconStates that can be feasibly produced by a compliant implementation, as their content isn't untrusted input.

For reasonable performance, we should also be limitting the number of slots needing to be processed.

For consistency and predictable behavior, I would advocate for ncli_transition to continue to directly expose state_transition. This also avoids maintenance costs and the risk for the added checks to result in bugs present in ncli_transition but not in "real" operation, or for them to hide real bugs.

If extra checks end up being necessary, I'd think it preferable to run it as a separate ncli endpoint so there's less confusion about what the command actually does.

@arnetheduck
Copy link
Member

If the spec assumes certain conditions during the state transition (particularly if validation is specified at outer levels), then the fuzzing should ensure that it complies with this.

hm, this is a tricky one - the more the fuzzing inputs are "constrained", the less resilient the state transition itself becomes against odd cases and we might miss some combination of parameters - ie one of the benefits of fuzzing is that even if there are bugs in other layers, what gets through to the transition is handled "gracefully".

Based on fuzzing findings, would it make sense to put some additional checks in the transition function itself?

@tersec
Copy link
Contributor Author

tersec commented May 4, 2020

hm, this is a tricky one - the more the fuzzing inputs are "constrained", the less resilient the state transition itself becomes against odd cases and we might miss some combination of parameters - ie one of the benefits of fuzzing is that even if there are bugs in other layers, what gets through to the transition is handled "gracefully".

Definitely a concern. It's a false positive/false negative tradeoff.

Based on fuzzing findings, would it make sense to put some additional checks in the transition function itself?

Primarily, I'd like whatever the fixes are to be consistent across clients, so that each client doesn't contort their transition functions in slightly different ways which subtly differ from the spec to create interop issues in a testnet or mainnet.

So my specific proposal:

  • for now, accept the false negatives (constrain fuzzing), or just don't fix fuzzed issues only found in the transition function itself, but where there's an obvious check elsewhere which would have picked it up.

  • Simultaneously, work with the EF and other Eth2 implementation teams to try to create a unified idea of what sort of fuzzing-resistance, if any, the state transition function itself should have and how that should ramify out to the rest of the spec.

@tersec
Copy link
Contributor Author

tersec commented May 6, 2020

This isn't meant to close any future discussion, but closing this issue, because I'm not going to merge this particular PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants