Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: snappy downloader #5393

Open
wants to merge 38 commits into
base: develop
Choose a base branch
from
Open

Fix: snappy downloader #5393

wants to merge 38 commits into from

Conversation

jcnelson
Copy link
Member

@jcnelson jcnelson commented Oct 28, 2024

This fixes a few bugs in the relayer and networking stack:

  • It removes a convoy effect that can happen when the node is under load. Before, the channel between the p2p thread and relayer thread could grow unbounded if the relayer couldn't keep up with bursts of NetworkResults. In this PR, the p2p thread merges outstanding NetworkResults into a single NetworkResult and drops / consolidates obsolete data, which both minimizes the relayer's total workload and minimizes the time between receiving a data-bearing message and processing it.

  • It fixes the block downloader so that it detects and deprioritizes unhealthy replicas during block download, so that most of the time, the node is only querying replicas that can serve it data. It also improves error and retry logging in the downloader.

  • To stress-test the downloader, it adds an option to disable block-push altogether, so the node is forced to download everything

  • It fixes an off-by-one error in the p2p stack which was preventing it from caching reward sets. Instead, the p2p stack would always fetch reward sets from disk, which lead to performance degradation.

@jcnelson jcnelson requested a review from a team as a code owner October 28, 2024 20:55
jferrant
jferrant previously approved these changes Oct 28, 2024
jcnelson and others added 20 commits October 28, 2024 17:48
… so that we only forward results that contain blocks (drop tx and stackerdb messages)
…ded), and merge un-sent NetworkResult's in order to keep the queue length bound to at most one outstanding NetworkResult
… and clean out completed tenures based on whether or not we see them processed
@jcnelson jcnelson changed the title Fix: drain relayer channel Fix: snappy downloader Nov 1, 2024
@jcnelson
Copy link
Member Author

jcnelson commented Nov 3, 2024

There's still something weird happening with this PR. My test node has repeatedly gotten itself stuck at the same block height for hours, with not even so much as an attempt to download missing block data (despite it witnessing the Bitcoin chain advancing). Need to dig more into this.

@jcnelson
Copy link
Member Author

jcnelson commented Nov 5, 2024

Okay, this is now working again. The fix was to disconnect from nodes that served seemingly-stale data via their unconfirmed tenure downloader interface. There's at least one Stacks 2.5 node out there still running, and it was consistently replying to the unconfirmed downloader and inadvertently preventing it from making progress (since the bug caused the downloader to wait forever for the remote peer's stale view to be corrected).

…tale (otherwise we would cease to make progress if the node never caught up), and throttle down unconfirmed download checks
Copy link
Contributor

@obycode obycode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! I just had one minor refactoring request.

Copy link
Member

@kantai kantai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few comments.

jferrant
jferrant previously approved these changes Nov 5, 2024
Copy link
Contributor

@obycode obycode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jferrant
Copy link
Collaborator

jferrant commented Nov 5, 2024

I think this breaks simple_neon_integration test. I don't see this failing anywhere else (passes on develop with prom metrics enabled). It seems to be there was a change to the prometheus metric in this PR that is screwing it up.

Copy link
Collaborator

@jferrant jferrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will reapprove once simple_neon_integration test is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants