Prometheus metrics not updating when quickly downloading blocks #2987

CharlieC3 · 2022-01-07T18:58:09Z

Describe the bug
When spinning up a stacks-node a few hundred blocks behind the tip, the Prometheus metrics report the initial block at boot but this figure is rarely updated until the stacks-node has finished syncing to the tip. This will make the node appear to be stalled or delayed for an hour or more when it has actually been downloading blocks and progressing along the blockchain the whole time.

For example, the Prometheus metrics pulled from a stacks-node below show the node stuck at block 43941 for over an hour, then suddenly jumping up to the tip height for that time. However our logs show the stacks-node was steadily appending blocks during this "delayed" period, progressing incrementally over the 1+ hour.

Steps To Reproduce

Step up a stacks-node
Compare the block height recorded via Prometheus metrics vs logs

Expected behavior
The Prometheus metrics should update the block height figure soon after the new block has been added.

The text was updated successfully, but these errors were encountered:

kantai · 2022-01-10T16:10:06Z

This would be the offending line:

https://github.com/blockstack/stacks-blockchain/blob/8d1646047b41be7097783f3c2362f9c76c39df9e/src/chainstate/coordinator/mod.rs#L702

wileyj · 2022-01-20T01:55:26Z

for myself - this also seems to be off, not sure why since it was working previously.

stacks_node_active_miners_total

CharlieC3 · 2022-01-21T21:27:59Z

If possible I'd also like the stacks_node_mempool_outstanding_txs metric to be re-assessed.

Currently it will increment or decrement the number of outstanding TXs for a stacks-node. However, it assumes that when a node starts it has 0 TXs in its mempool. Often when a node starts it may have dozens, hundreds, or thousands of TXs in its mempool. As a result, it's pretty common to see this metric dip heavily into the negative digits for a while after bootup.

It's also very common to see this metric show numbers 6X higher than what the API's mempool endpoint shows, which is making me think the decrement logic either isn't always being triggered when it should, or it's not in all code paths.

Ideally this metric would represent the number of TXs in a stacks-nodes mempool accurately.

wileyj · 2022-01-31T16:05:19Z

number of neighbors is sometimes not being returned
@CharlieC3 , these are the 2 metrics you were mentioning correct?
edit - i wonder if this is related to the boot from snapshot metrics weirdness

stacks_node_neighbors_outbound
stacks_node_neighbors_inbound

CharlieC3 · 2022-01-31T16:30:57Z

@wileyj We're actually querying the /v2/neighbors endpoint directly and converting that JSON data to a prometheus metric because it's treated as an availability test. It's possible this is not an issue with the stacks-node's reporting. I think we can exclude these two metrics from this issue, at least for now until we have clearer evidence that it's a reporting issue.

wileyj · 2022-01-31T17:15:57Z

@wileyj We're actually querying the /v2/neighbors endpoint directly and converting that JSON data to a prometheus metric because it's treated as an availability test. It's possible this is not an issue with the stacks-node's reporting. I think we can exclude these two metrics from this issue, at least for now until we have clearer evidence that it's a reporting issue.

got it, thanks - i think i'll track these values anyways while testing to see if anything strange is reported (at the very least, it'll show us it's reproducable)

wileyj · 2022-02-01T15:21:00Z

stacks_node_mempool_outstanding_txs

doesn't seem to be an easy way to get this value on startup (greg has a binary he wrote, but it's not a quick exec).
another option i'll try here is to simply reset the gauge on startup

CharlieC3 · 2022-02-01T19:36:51Z

@wileyj I think the gauge currently does get reset on startup, but that causes the aforementioned problem where the gauge can dip into negative digits if some mempool TXs leave soon after the node is started.

I'm surprised this number of TXs can't be fetched from the mempool database. Why isn't something like

select count(txid) from mempool;

a valid count of mempool TXs?

kantai · 2022-02-01T19:58:26Z

Why isn't something like

select count(txid) from mempool;

a valid count of mempool TXs?

The mempool database keeps transactions around even after they've been "confirmed". The reason for this is to deal with forks/reorgs (which happen pretty frequently, like 10% of blocks): a transaction that is confirmed in one block may not be confirmed in the other fork, so miners should include the transaction in that fork as well. Because of this, the mempool keeps those transactions around until they've become sufficiently old.

CharlieC3 · 2022-02-01T20:28:17Z

@kantai I see - Is there a key @wileyj can filter on in that table to determine the count of TXs for the node's mempool? Does the node itself have a way of tracking the number of TXs in its mempool?

kantai · 2022-02-01T21:40:30Z

No, the node doesn't have a way of tracking the number of transactions in its mempool in a strict sense. That's why that prometheus metric that attempts to track pending transactions behaves in the way that it currently does.

CharlieC3 · 2022-02-01T21:59:37Z

Thanks @kantai.

It's sounding like there's no way the stacks_node_mempool_outstanding_txs metric can easily be resumed after a restart without making larger changes. Getting accurate access to this metric would be valuable for Hiro, although I'm not sure how big of a change it would entail, so it's likely out of scope for this issue. It seems an additional attribute would have to be added to each TX being tracked in the mempool database.

wileyj · 2022-02-03T21:51:10Z

@CharlieC3 #3033

not addressed in this PR is the mempool metric - that's going to be a heavier lift so we can look at that later

#2987 to be added next release

jcnelson · 2022-03-10T19:21:41Z

I see #3033 got merged, so I'm going to close this out. If there's a need for a separate mempool metric, please open an issue for that instead.

CharlieC3 · 2023-01-25T15:46:15Z

Reopening because we're still seeing this issue in the latest 2.05 release.

jcnelson assigned wileyj Jan 10, 2022

wileyj mentioned this issue Feb 3, 2022

updating metrics for issue 2987 #3033

Merged

wileyj added a commit that referenced this issue Feb 4, 2022

Adding changelog entry for unreleased changes

571a486

#2987 to be added next release

jcnelson closed this as completed Mar 10, 2022

CharlieC3 mentioned this issue Oct 14, 2022

metric stacks_node_mempool_outstanding_txs not working as expected #3342

Open

CharlieC3 reopened this Jan 25, 2023

CharlieC3 mentioned this issue Oct 11, 2024

Fix/nakamoto downloader at reward cycle boundary #5305

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus metrics not updating when quickly downloading blocks #2987

Prometheus metrics not updating when quickly downloading blocks #2987

CharlieC3 commented Jan 7, 2022

kantai commented Jan 10, 2022

wileyj commented Jan 20, 2022

CharlieC3 commented Jan 21, 2022

wileyj commented Jan 31, 2022 •

edited

Loading

CharlieC3 commented Jan 31, 2022

wileyj commented Jan 31, 2022

wileyj commented Feb 1, 2022 •

edited

Loading

CharlieC3 commented Feb 1, 2022

kantai commented Feb 1, 2022

CharlieC3 commented Feb 1, 2022

kantai commented Feb 1, 2022

CharlieC3 commented Feb 1, 2022

wileyj commented Feb 3, 2022

jcnelson commented Mar 10, 2022

CharlieC3 commented Jan 25, 2023

Prometheus metrics not updating when quickly downloading blocks #2987

Prometheus metrics not updating when quickly downloading blocks #2987

Comments

CharlieC3 commented Jan 7, 2022

kantai commented Jan 10, 2022

wileyj commented Jan 20, 2022

CharlieC3 commented Jan 21, 2022

wileyj commented Jan 31, 2022 • edited Loading

CharlieC3 commented Jan 31, 2022

wileyj commented Jan 31, 2022

wileyj commented Feb 1, 2022 • edited Loading

CharlieC3 commented Feb 1, 2022

kantai commented Feb 1, 2022

CharlieC3 commented Feb 1, 2022

kantai commented Feb 1, 2022

CharlieC3 commented Feb 1, 2022

wileyj commented Feb 3, 2022

jcnelson commented Mar 10, 2022

CharlieC3 commented Jan 25, 2023

wileyj commented Jan 31, 2022 •

edited

Loading

wileyj commented Feb 1, 2022 •

edited

Loading