-
Notifications
You must be signed in to change notification settings - Fork 668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/5193 stackerdb decoherence #5197
Conversation
…te machine are using
… on irrecoverable error
…as to verify that connection pinning prevents decoherence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, will approve once we can confirm it resolves the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just flagged one typo
…ignore flag set Signed-off-by: Jacinta Ferrant <[email protected]>
…ting for unnecessary signatures Signed-off-by: Jacinta Ferrant <[email protected]>
This allows us to avoid hitting block 240, which is when the stackers get unstacked and the chain stalls, making `partial_tenure_fork` less flaky
I am testing this on mainnet along with my other in-flight PRs, and I think I'm getting OOM'ed. I need to confirm first. |
will also run this branch to see if i can reproduce |
Co-authored-by: Brice Dobry <[email protected]>
…erence Fix/5193 stackerdb decoherence
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This fixes #5193 by having all p2p state machines (namely, both epoch 2.x and Nakamoto inv sync and StackerDB) track and report their pinned connections to the peer network, so they won't be pruned. The cause of the decoherency seems to have been that once a peer's outbound neighbor count exceeded
[connection_opts].soft_max_neighbors_per_org
or one of the other similar limits, the pruner would simply close the newer connections until the number of connections was brought down. This would often happen during StackerDB sync (and would also happen in inv sync), which would have the effect of a node with many neighbors failing to synchronize their StackerDB replicas.This I suspect was also the cause of the decoherence we would see with larger Nakamoto testnets, where the soft limits on the number of neighbors were exceeded.
You can see the effect of this PR in
/v2/neighbors
-- inbound and outbound peer entries now report anage
(in seconds), which should rarely be reset due to the pinning. Before, neighbors would come and go very quickly as state machines connected to them and the pruner immediately disconnected them.Leaving as a draft for now so I can test this live with the Nakamoto testnet signers.