Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[consensus] Update proposer metrics #19655

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

arun-koshy
Copy link
Contributor

@arun-koshy arun-koshy commented Oct 2, 2024

Description

Adding a few metrics that will help with the smart ancestor selection investigations

  • Set leader timestamp for the parent round of the current threshold clock round in threshold clock. This will allow for us to get better block_proposal_leader_wait_ms metric values
  • Update block_proposal_leader_wait_count whenever we hit the case where leaders don't exist during proposal
  • Add metric for the interval between propsals.

Test plan

private-testnet


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • REST API:

Copy link

vercel bot commented Oct 2, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Oct 2, 2024 6:38am
sui-kiosk ⬜️ Ignored (Inspect) Oct 2, 2024 6:38am
sui-typescript-docs ⬜️ Ignored (Inspect) Oct 2, 2024 6:38am

@arun-koshy
Copy link
Contributor Author

arun-koshy commented Oct 2, 2024

All of the changes to the metrics can be seen reflected on the left side of the graphs and the right side is "main". Let me know what you think.

  • We can see the rate at which we have to call back in to try new block because leaders were missing.
    Screenshot 2024-10-02 at 10 55 09 AM
  • We can see the the average wait time for a leader AFTER a quorum has been reached which is around 3ms. With this we will have the quorum receive latency + leader wait time separated to show us which is taking most of the time.
    Screenshot 2024-10-02 at 10 55 37 AM
  • And this is what block proposal interval will look like.
    Screenshot 2024-10-02 at 10 55 53 AM

.add_blocks(accepted_blocks.iter().map(|b| b.reference()).collect())
// Get max round of accepted blocks. This will be equal to the threshold
// clock round, either by advancing the threshold clock round by being
// greater than current clock round or by equaling the current clock round.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the case? Blocks older than current threshold clock round can get accepted as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added the case in the comment for greater and equal but blocks less than the clock round are essentially ignored by threshold clock

self.context
.metrics
.node_metrics
.block_proposal_leader_wait_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a separate metric for counting the number of times leader is not found. block_proposal_leader_wait_count is tied to block_proposal_leader_wait_ms, so when the average wait is ~250ms, we know the leader is missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the confusion for me with these metrics is that it doesn't just include leader wait time, it includes the quorum receive wait time which can make this metric a little misleading. Separating them brings more clarity. Though I guess we could always subtract this metric from quorum receive latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants