Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics not updating when quickly downloading blocks #2987

Open
CharlieC3 opened this issue Jan 7, 2022 · 15 comments
Open

Prometheus metrics not updating when quickly downloading blocks #2987

CharlieC3 opened this issue Jan 7, 2022 · 15 comments
Assignees

Comments

@CharlieC3
Copy link
Member

Describe the bug
When spinning up a stacks-node a few hundred blocks behind the tip, the Prometheus metrics report the initial block at boot but this figure is rarely updated until the stacks-node has finished syncing to the tip. This will make the node appear to be stalled or delayed for an hour or more when it has actually been downloading blocks and progressing along the blockchain the whole time.

For example, the Prometheus metrics pulled from a stacks-node below show the node stuck at block 43941 for over an hour, then suddenly jumping up to the tip height for that time. However our logs show the stacks-node was steadily appending blocks during this "delayed" period, progressing incrementally over the 1+ hour.

Steps To Reproduce

  1. Step up a stacks-node
  2. Compare the block height recorded via Prometheus metrics vs logs

Expected behavior
The Prometheus metrics should update the block height figure soon after the new block has been added.

@kantai
Copy link
Member

kantai commented Jan 10, 2022

@wileyj
Copy link
Contributor

wileyj commented Jan 20, 2022

for myself - this also seems to be off, not sure why since it was working previously.

stacks_node_active_miners_total

@CharlieC3
Copy link
Member Author

If possible I'd also like the stacks_node_mempool_outstanding_txs metric to be re-assessed.

Currently it will increment or decrement the number of outstanding TXs for a stacks-node. However, it assumes that when a node starts it has 0 TXs in its mempool. Often when a node starts it may have dozens, hundreds, or thousands of TXs in its mempool. As a result, it's pretty common to see this metric dip heavily into the negative digits for a while after bootup.

It's also very common to see this metric show numbers 6X higher than what the API's mempool endpoint shows, which is making me think the decrement logic either isn't always being triggered when it should, or it's not in all code paths.

Ideally this metric would represent the number of TXs in a stacks-nodes mempool accurately.

@wileyj
Copy link
Contributor

wileyj commented Jan 31, 2022

number of neighbors is sometimes not being returned
@CharlieC3 , these are the 2 metrics you were mentioning correct?
edit - i wonder if this is related to the boot from snapshot metrics weirdness

stacks_node_neighbors_outbound
stacks_node_neighbors_inbound

@CharlieC3
Copy link
Member Author

@wileyj We're actually querying the /v2/neighbors endpoint directly and converting that JSON data to a prometheus metric because it's treated as an availability test. It's possible this is not an issue with the stacks-node's reporting. I think we can exclude these two metrics from this issue, at least for now until we have clearer evidence that it's a reporting issue.

@wileyj
Copy link
Contributor

wileyj commented Jan 31, 2022

@wileyj We're actually querying the /v2/neighbors endpoint directly and converting that JSON data to a prometheus metric because it's treated as an availability test. It's possible this is not an issue with the stacks-node's reporting. I think we can exclude these two metrics from this issue, at least for now until we have clearer evidence that it's a reporting issue.

got it, thanks - i think i'll track these values anyways while testing to see if anything strange is reported (at the very least, it'll show us it's reproducable)

@wileyj
Copy link
Contributor

wileyj commented Feb 1, 2022

stacks_node_mempool_outstanding_txs

doesn't seem to be an easy way to get this value on startup (greg has a binary he wrote, but it's not a quick exec).
another option i'll try here is to simply reset the gauge on startup

@CharlieC3
Copy link
Member Author

@wileyj I think the gauge currently does get reset on startup, but that causes the aforementioned problem where the gauge can dip into negative digits if some mempool TXs leave soon after the node is started.

I'm surprised this number of TXs can't be fetched from the mempool database. Why isn't something like

select count(txid) from mempool;

a valid count of mempool TXs?

@kantai
Copy link
Member

kantai commented Feb 1, 2022

Why isn't something like

select count(txid) from mempool;

a valid count of mempool TXs?

The mempool database keeps transactions around even after they've been "confirmed". The reason for this is to deal with forks/reorgs (which happen pretty frequently, like 10% of blocks): a transaction that is confirmed in one block may not be confirmed in the other fork, so miners should include the transaction in that fork as well. Because of this, the mempool keeps those transactions around until they've become sufficiently old.

@CharlieC3
Copy link
Member Author

@kantai I see - Is there a key @wileyj can filter on in that table to determine the count of TXs for the node's mempool? Does the node itself have a way of tracking the number of TXs in its mempool?

@kantai
Copy link
Member

kantai commented Feb 1, 2022

No, the node doesn't have a way of tracking the number of transactions in its mempool in a strict sense. That's why that prometheus metric that attempts to track pending transactions behaves in the way that it currently does.

@CharlieC3
Copy link
Member Author

Thanks @kantai.

It's sounding like there's no way the stacks_node_mempool_outstanding_txs metric can easily be resumed after a restart without making larger changes. Getting accurate access to this metric would be valuable for Hiro, although I'm not sure how big of a change it would entail, so it's likely out of scope for this issue. It seems an additional attribute would have to be added to each TX being tracked in the mempool database.

@wileyj
Copy link
Contributor

wileyj commented Feb 3, 2022

@CharlieC3 #3033

not addressed in this PR is the mempool metric - that's going to be a heavier lift so we can look at that later

wileyj added a commit that referenced this issue Feb 4, 2022
@jcnelson
Copy link
Member

I see #3033 got merged, so I'm going to close this out. If there's a need for a separate mempool metric, please open an issue for that instead.

@CharlieC3
Copy link
Member Author

Reopening because we're still seeing this issue in the latest 2.05 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants