Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High regen queue job wait time #5384

Closed
twoeths opened this issue Apr 19, 2023 · 5 comments
Closed

High regen queue job wait time #5384

twoeths opened this issue Apr 19, 2023 · 5 comments
Assignees

Comments

@twoeths
Copy link
Contributor

twoeths commented Apr 19, 2023

Describe the bug

With the new gossip flow, the Job Wait Time of regen queue is almost 1s, this is likely one of the reason of I/O lag issue

Screenshot 2023-04-19 at 13 32 46

right now the regen queue length is limited to 16 to accept new gossip work

Screenshot 2023-04-19 at 13 34 43

Expected behavior

  • This is from stable (v1.7): regen queue is only 1 or 0 and job wait time is way lower

Screenshot 2023-04-19 at 13 40 31

Screenshot 2023-04-19 at 13 40 52

Question

  • We have a state cache of 96 items, why are there some attestations with block head not in that cache, is it worth to always regen them? maybe we should skip some of them, see
    // TODO (LH): Enforce a maximum skip distance for unaggregated attestations.
  • Consider reducing REGEN_CAN_ACCEPT_WORK_THRESHOLD (right now it's 16) cc @dapplion
@twoeths
Copy link
Contributor Author

twoeths commented Apr 19, 2023

when I zoom in, some Job Wait Time is > 4s which is too big

Screenshot 2023-04-19 at 15 00 19

@dapplion
Copy link
Contributor

@tuyennhv for this type of metric is best to capture the histogram to understand the real distribution of values through a longer timeframe. That would allow to answer the question: what % of regen jobs over a day take > 1s

@twoeths
Copy link
Contributor Author

twoeths commented Apr 20, 2023

@dapplion it's about 25% of them take >1s

Screenshot 2023-04-20 at 11 16 35

this issue happens randomly on stable mainnet node but more consistently on unstable mainnet node (could be because unstable likely validate all attestations with the new gossip)

@twoeths
Copy link
Contributor Author

twoeths commented Apr 20, 2023

Sometimes getBlockSlotState() call of precomputeEpoch takes up to 3s

Screenshot 2023-04-20 at 16 09 57

Screenshot 2023-04-20 at 16 10 05

Screenshot 2023-04-20 at 16 10 19

update: should investigate the long epoch transition in #5409

@twoeths
Copy link
Contributor Author

twoeths commented Apr 24, 2023

Latest 1h chart in the last 24h in unstable mainnet node:

Screenshot 2023-04-25 at 05 08 40

vs stable mainnet node

Screenshot 2023-04-25 at 05 09 16

@twoeths twoeths closed this as completed Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants