-
Notifications
You must be signed in to change notification settings - Fork 668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: snappy downloader #5393
base: develop
Are you sure you want to change the base?
Fix: snappy downloader #5393
Conversation
…if there's download pressure
… so that we only forward results that contain blocks (drop tx and stackerdb messages)
…-network/stacks-blockchain into fix/relayer-drain-channel
…ded), and merge un-sent NetworkResult's in order to keep the queue length bound to at most one outstanding NetworkResult
… and clean out completed tenures based on whether or not we see them processed
There's still something weird happening with this PR. My test node has repeatedly gotten itself stuck at the same block height for hours, with not even so much as an attempt to download missing block data (despite it witnessing the Bitcoin chain advancing). Need to dig more into this. |
…o current or next reward cycle
…-network/stacks-blockchain into fix/relayer-drain-channel
Okay, this is now working again. The fix was to disconnect from nodes that served seemingly-stale data via their unconfirmed tenure downloader interface. There's at least one Stacks 2.5 node out there still running, and it was consistently replying to the unconfirmed downloader and inadvertently preventing it from making progress (since the bug caused the downloader to wait forever for the remote peer's stale view to be corrected). |
…tale (otherwise we would cease to make progress if the node never caught up), and throttle down unconfirmed download checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! I just had one minor refactoring request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think this breaks simple_neon_integration test. I don't see this failing anywhere else (passes on develop with prom metrics enabled). It seems to be there was a change to the prometheus metric in this PR that is screwing it up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will reapprove once simple_neon_integration test is fixed.
This fixes a few bugs in the relayer and networking stack:
It removes a convoy effect that can happen when the node is under load. Before, the channel between the p2p thread and relayer thread could grow unbounded if the relayer couldn't keep up with bursts of
NetworkResult
s. In this PR, the p2p thread merges outstandingNetworkResult
s into a singleNetworkResult
and drops / consolidates obsolete data, which both minimizes the relayer's total workload and minimizes the time between receiving a data-bearing message and processing it.It fixes the block downloader so that it detects and deprioritizes unhealthy replicas during block download, so that most of the time, the node is only querying replicas that can serve it data. It also improves error and retry logging in the downloader.
To stress-test the downloader, it adds an option to disable block-push altogether, so the node is forced to download everything
It fixes an off-by-one error in the p2p stack which was preventing it from caching reward sets. Instead, the p2p stack would always fetch reward sets from disk, which lead to performance degradation.