-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass a correct session end height to NewSession
#1545
Conversation
With session rollover (#1536), we accept relays for older sessions. For exaple, a node at block 101 accepts a relay for the session height 97. There is a bug in the relay validation function `relay.Validate`. In the example above, the function sees the following values: ``` ctx.BlockHeight() = 101 sessionBlockHeight = r.Proof.SessionBlockHeight = 97 sessionCtx.BlockHeight() = 97 ``` and if the corresponding session is not cached, it passes `sessionCtx` and `ctx` to `NewSession`. This may return a wrong session because the second argument is supposed to be the end of the session, but in this case it's not. The proposed fix is if `ctx` is beyond the relay session, we get a new context of the session end and pass it to `NewSession`.
So If I'm understanding correctly, the second argument is to account for nodes that get jailed in between blocks within a a specific session (i.e block 97-100*) Since we're allowing for session rollovers, we need to check the state of the jailed nodes at the last block (i.e block 100*) within a specific session. This could happen if a node serves a session rollover relay after restarting the pocket core process, which would then not have the cached session anymore, since it's only persistent in memory. Does that sound about right? If so, LGTM. |
Thank you for your comment. It's not limited to jail. Any of newstake/editstake/unstake/jail/unjail in the middle of a session can change the session node set. This means the session node set is not finalized until the session end block is committed. It's defined by the behavior of
Not exactly. This could be a problem if a node caches a session rollover relay for 97 and the session node set is changed at the next seession height 101, AND the cached session is not cleared until the node start receiving claims for the session 97. If that happens, the cached session is not correct, so the node accepts claims that should be rejected or vice virsa. |
I think your concern(s) makes sense and this fix is applied for both scenarios.
(May be unrelated to your issue) But this sounds like session caching is busted fundamentally. If we cache the session by SessionHeader, then most actors won't reflect the state of validators/servicers for sessionBlock[(start+1..., end] because servicers will usually only cache the start of the session, until the cache is cleared (when it submits the claim). This means that depending on when a servicer receives a relay, it's possible for two servicers to cache different 'session states' |
Yes, and I think that's why as a mitigation we call I agree that the current design of the session cache is not effective. It hasn't been exposed so much until LeanPocket. |
Hmm.. the fundamental issue still lies in that nodes do not have a deterministic way to save the same session state throughout handling relays. I think your fix patches up one of the problems as a direct result of session rollover, but definitely something to think about in the future PR. Thank you for your explanations! This all makes sense and LGTM! Each node has its own So in your scenario:
All validators should still share the same session state and reject the claim. I'm not sure if this can actually happen with LP given that the nodes have an individual cache for sessions when handling relays and share a singular common cache for consensus unless it's the first node in LP. Can you confirm if my understanding is correct here? |
First of all, thank you for signing off!
Personally the name
I was thinking of the following scenario. What prevents this from happening is we clear caches every session. If we don't, I think any secondary node can hit the wrong apphash error like the case of Node2 below. This is all theoretical. Maybe I'm wrong and missing something. SessionNodeCount = 4 Block 101: Session A/B/C/D Node1 with LP: hosting A (and not hosting E)
Node2 with LP: hosting B
Anyway, this is unrelated to this PR. Maybe I should create a new issue. |
With session rollover (#1536), we accept relays for older sessions. For exaple, a node at block 101 accepts a relay for the session height 97.
There is a bug in the relay validation function
relay.Validate
. In the example above, the function sees the following values:and if the corresponding session is not cached, it passes
sessionCtx
andctx
toNewSession
. This may return a wrong session because the second argument is supposed to be the end of the session, but in this case it's not.The proposed fix is if
ctx
is beyond the relay session, we get a new context of the session end and pass it toNewSession
.