Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement regen metrics #2153

Closed
dapplion opened this issue Mar 9, 2021 · 6 comments
Closed

Implement regen metrics #2153

dapplion opened this issue Mar 9, 2021 · 6 comments
Assignees
Labels
prio-medium Resolve this some time soon (tm). scope-metrics All issues with regards to the exposed metrics. scope-performance Performance issue and ideas to improve performance.

Comments

@dapplion
Copy link
Contributor

dapplion commented Mar 9, 2021

  • Cache hits vs cache miss
  • Cache frequency usage, last usage per item
@dapplion dapplion self-assigned this Mar 9, 2021
@dapplion dapplion changed the title Implement state cache metrics Implement regen metrics Apr 10, 2021
@dapplion
Copy link
Contributor Author

dapplion commented Apr 10, 2021

@dapplion I think we should close this in favor of suggestion above:
add metrics to regen module

From #2285 (comment)

This was referenced May 2, 2021
@dapplion dapplion added the scope-metrics All issues with regards to the exposed metrics. label May 2, 2021
@dapplion dapplion added the scope-performance Performance issue and ideas to improve performance. label Jun 11, 2021
@dapplion dapplion added the prio-medium Resolve this some time soon (tm). label Jun 27, 2021
@g11tech
Copy link
Contributor

g11tech commented Jun 27, 2021

@dapplion please assign

@dapplion dapplion assigned g11tech and unassigned dapplion Jun 27, 2021
@dapplion
Copy link
Contributor Author

@dapplion please assign

If you need help to design the metrics on this component reach out to me or Cayman. Be aware that the regen logic is quite complex with many entrypoints. Doing the metrics right will be much harder than the fork choice and require a good understatement of it mechanics.

@g11tech
Copy link
Contributor

g11tech commented Jun 27, 2021

@dapplion thanks, will try to figure it out while I implement other simpler metrics. Will circle back on this once I get a priliminary understanding. 👍

@g11tech
Copy link
Contributor

g11tech commented Jul 10, 2021

ok, i think i have a preliminary understanding of the regen module basis which I propose following metrics structure:

regenStateCacheLookupTotal: gauge<entrypointFn,callingModule>
regenStateCacheLoopkupHits: gauge<entrypointFn,callingModule>

regenCPStateCacheLookupTotal: gauge<entrypointFn,callingModule>
regenCPStateCacheLookupHits: gauge<entrypointFn,callingModule>

where entrypointFn label in ['getPreState','getCheckpointState','getState','getBlockSlotState']
callingModule label in ['validateGossipBlock','validateGossipAttestation','validateGossipAggregateAndProof','validateGossipVoluntaryExit','produceBlock','produceAttestationData', 'getProposerDuties', 'getAttesterDuties', 'getSyncCommitteeDuties',', 'onFinalized']

This will give a good slicing, dicing and rollups to figure out and debut what is happening.
grafana dashboard panel, can plot them up as stacked, and a module wise sum by (callingModule) graphs can be plotted.

@dapplion @wemeetagain let me know your thoughts on the same.

@wemeetagain
Copy link
Member

I thinkthe lookup / hits is a good start.

Here's how I see the regen module, there's a few different loads in the regen module that may happen, from best to worst:

  • the requested state is in the cache, no further processing required
  • the requested state at a prior slot is in the cache, slots must be dialed forward, possibly hitting an epoch transition
  • a prior state is in the cache, blocks must be replayed to the requested state
  • a prior state is in the cache, blocks must be replayed to the requested state, then slots must be dialed forward

And then there's the error case, when the state can't be regenerated for whatever reason.

I think some other things I'd like to know, where metrics could help:

  • how often are we needing to reprocess slots? epochs? blocks?
    • how many blocks? (not very actionable, but maybe interesting)
  • how long do regen tasks take?
  • how often are there regen errors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
prio-medium Resolve this some time soon (tm). scope-metrics All issues with regards to the exposed metrics. scope-performance Performance issue and ideas to improve performance.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants