Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass a correct session end height to NewSession #1545

Merged
merged 1 commit into from
Apr 20, 2023

Conversation

msmania
Copy link
Contributor

@msmania msmania commented Apr 18, 2023

With session rollover (#1536), we accept relays for older sessions. For exaple, a node at block 101 accepts a relay for the session height 97.

There is a bug in the relay validation function relay.Validate. In the example above, the function sees the following values:

ctx.BlockHeight() = 101
sessionBlockHeight = r.Proof.SessionBlockHeight = 97
sessionCtx.BlockHeight() = 97

and if the corresponding session is not cached, it passes sessionCtx and ctx to NewSession. This may return a wrong session because the second argument is supposed to be the end of the session, but in this case it's not.

The proposed fix is if ctx is beyond the relay session, we get a new context of the session end and pass it to NewSession.

With session rollover (#1536), we accept relays for older sessions.  For
exaple, a node at block 101 accepts a relay for the session height 97.

There is a bug in the relay validation function `relay.Validate`.  In the
example above, the function sees the following values:

```
ctx.BlockHeight() = 101
sessionBlockHeight = r.Proof.SessionBlockHeight = 97
sessionCtx.BlockHeight() = 97
```

and if the corresponding session is not cached, it passes `sessionCtx` and `ctx`
to `NewSession`.  This may return a wrong session because the second argument
is supposed to be the end of the session, but in this case it's not.

The proposed fix is if `ctx` is beyond the relay session, we get a new context
of the session end and pass it to `NewSession`.
@nodiesBlade
Copy link
Contributor

nodiesBlade commented Apr 18, 2023

So If I'm understanding correctly, the second argument is to account for nodes that get jailed in between blocks within a a specific session (i.e block 97-100*)

Since we're allowing for session rollovers, we need to check the state of the jailed nodes at the last block (i.e block 100*) within a specific session. This could happen if a node serves a session rollover relay after restarting the pocket core process, which would then not have the cached session anymore, since it's only persistent in memory.

Does that sound about right? If so, LGTM.

@msmania
Copy link
Contributor Author

msmania commented Apr 19, 2023

Thank you for your comment. It's not limited to jail. Any of newstake/editstake/unstake/jail/unjail in the middle of a session can change the session node set. This means the session node set is not finalized until the session end block is committed. It's defined by the behavior of ValidateClaim.

This could happen if a node serves a session rollover relay after restarting the pocket core process, which would then not have the cached session anymore,

Not exactly. This could be a problem if a node caches a session rollover relay for 97 and the session node set is changed at the next seession height 101, AND the cached session is not cleared until the node start receiving claims for the session 97.

If that happens, the cached session is not correct, so the node accepts claims that should be rejected or vice virsa.

@nodiesBlade
Copy link
Contributor

nodiesBlade commented Apr 19, 2023

I think your concern(s) makes sense and this fix is applied for both scenarios.

any of newstake/editstake/unstake/jail/unjail in the middle of a session can change the session node set.

(May be unrelated to your issue) But this sounds like session caching is busted fundamentally. If we cache the session by SessionHeader, then most actors won't reflect the state of validators/servicers for sessionBlock[(start+1..., end] because servicers will usually only cache the start of the session, until the cache is cleared (when it submits the claim). This means that depending on when a servicer receives a relay, it's possible for two servicers to cache different 'session states'

@msmania
Copy link
Contributor Author

msmania commented Apr 19, 2023

(May be unrelated to your issue) But this sounds like session caching is busted fundamentally.

Yes, and I think that's why as a mitigation we call ClearSessionCache() at several transactions that impact active sessions and at every session.

I agree that the current design of the session cache is not effective. It hasn't been exposed so much until LeanPocket.

@nodiesBlade
Copy link
Contributor

nodiesBlade commented Apr 19, 2023

Hmm.. the fundamental issue still lies in that nodes do not have a deterministic way to save the same session state throughout handling relays. I think your fix patches up one of the problems as a direct result of session rollover, but definitely something to think about in the future PR. Thank you for your explanations! This all makes sense and LGTM!

Each node has its own SessionCache in LP to isolate any shared concerns, except the first node which has a reference to GlobalSessionCache. In hindsight, I think this should be named GlobalConsensusCache and any node (regardless of LP) should not be sharing this cache to prevent a consensus hiccup.

So in your scenario:

If that happens, the cached session is not correct, so the node accepts claims that should be rejected or vice versa.

All validators should still share the same session state and reject the claim. I'm not sure if this can actually happen with LP given that the nodes have an individual cache for sessions when handling relays and share a singular common cache for consensus unless it's the first node in LP. Can you confirm if my understanding is correct here?

@Olshansk Olshansk added the bug Something isn't working label Apr 19, 2023
@Olshansk Olshansk added this to the Network Scalability milestone Apr 19, 2023
@msmania
Copy link
Contributor Author

msmania commented Apr 20, 2023

First of all, thank you for signing off!

In hindsight, I think this should be named GlobalConsensusCache and any node (regardless of LP) should not be sharing this cache to prevent a consensus hiccup.

Personally the name GlobalSessionCache sounds right to me because it caches sessions after all. GlobalConsensusCache sounds like it cached some data in tendermint layer.

All validators should still share the same session state and reject the claim. I'm not sure if this can actually happen with LP given that the nodes have an individual cache for sessions when handling relays and share a singular common cache for consensus unless it's the first node in LP. Can you confirm if my understanding is correct here?

I was thinking of the following scenario. What prevents this from happening is we clear caches every session. If we don't, I think any secondary node can hit the wrong apphash error like the case of Node2 below. This is all theoretical. Maybe I'm wrong and missing something.


SessionNodeCount = 4

Block 101: Session A/B/C/D
Block 102: Session A/B/C/D
Block 103: Session A/B/C/D
Block 104: Session A/C/D/E (B out; E in)

Node1 with LP: hosting A (and not hosting E)

  • SessionCache[A] = GlobalSessionCache = A/B/C/D
  • replays a claim for E --> should be approved (potential wrong apphash error)

Node2 with LP: hosting B

  • SessionCache[B] = A/B/C/D
  • replays a claim for B --> should be rejected (potential wrong apphash error)

Anyway, this is unrelated to this PR. Maybe I should create a new issue.

@msmania msmania merged commit 8d4836b into staging Apr 20, 2023
@msmania msmania deleted the toshi/bugfix-session-rollover branch April 20, 2023 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants