Improve tracked metrics #1001

BrunoCampana · 2024-04-11T16:04:12Z

The objective of this issue is to propose a series of improvements to the metrics produced by hathor-core.

The improvement proposals are as follows:

Type of metrics
Metrics syntax
Metric labels
Subset of health metrics

1 Type of metrics

Problem:
The following metrics are defined as "gauge" type, but are actually of "counter" type:

peer_connection_received_messages
peer_connection_sent_messages
peer_connection_received_bytes
peer_connection_sent_bytes
peer_connection_received_txs
peer_connection_discarded_txs
peer_connection_received_blocks
peer_connection_discarded_blocks
completed_jobs
blocks_found
transaction_cache_hits
transaction_cache_misses

Reference: https://prometheus.io/docs/concepts/metric_types/

Solution: change the type of these metrics in the file read by node_exporter, from:
# TYPE gauge to # TYPE counter.

2 Metrics syntax

Problem:
It is considered good practice to name "counter" metrics that are unitless (there is no SI unit such as seconds, etc.) with the name of the entity being counted in the plural followed by the word "total". Example:
peer_connection_received_messages_total

Reference: https://prometheus.io/docs/practices/naming/

Suggestion:
Make the following changes to the metric names:

From peer_connection_received_messages to peer_connection_received_messages_total
From peer_connection_sent_messages to peer_connection_sent_messages_total
From peer_connection_received_bytes to peer_connection_received_bytes_total
From peer_connection_sent_bytes to peer_connection_received_bytes_total
From peer_connection_received_txs to peer_connection_received_bytes_total
From peer_connection_discarded_txs to peer_connection_received_bytes_total
From peer_connection_received_blocks to peer_connection_received_bytes_total
From peer_connection_discarded_blocks to peer_connection_received_bytes_total
From completed_jobs to completed_jobs_total
From blocks_found to found_blocks_total
From transaction_cache_hits to transaction_cache_hits_total
From transaction_cache_misses to transaction_cache_misses_total

3 Metric labels

Problem:
It is considered good practice that labels do not have a high cardinality in their values. A benchmark is that a label should not have a cardinality greater than 10.

The following metrics have high cardinality:

peer_connection_received_messages
peer_connection_sent_messages
peer_connection_received_bytes
peer_connection_sent_bytes
peer_connection_received_txs
peer_connection_discarded_txs
peer_connection_received_blocks
peer_connection_discarded_blocks

What is the implication of this?

Generation of multiple time series, which may be unnecessary, making analysis unnecessary complex.
Quite considerable increase in the consumption of computing resources.

Reference: https://prometheus.io/docs/practices/naming/#labels

Suggestion:

Check whether it is really necessary to maintain these metrics for full node observability purposes.
If not necessary: remove them.
If necessary: mitigate the problem of high resource consumption and the complexity of analyzing metrics for the average full node operators, using the solution proposed in the following section.

4 Subset of health metrics

Problem:
Most of the metrics produced by the full node can be useful for overall observability of the full node, but are not useful for routine monitoring of the full node's health.

For most node operators, it will only be interesting to monitor a subset of metrics, possibly the following, from those that currently exist:

connected_peers
best_block_height
blocks
transactions
websocket_connections

This subset of metrics, or something close to it, should be enough for routine monitoring of the full node's health, along with monitoring of computational resources, which are not mentioned here because they are generated directly by node_exporter on the status of the hathor_core host .

Forcing the average node operators to monitor all of these metrics will likely cause the following impacts:

Increased unnecessary consumption of computational resources in the monitoring system of their full nodes.
It will make monitoring difficult, due to the complexity of available metrics, which will not have a clear meaning regarding their usefulness for them.

Solution:
Metrics tracking is only performed when the full node is run with the --prometheus option. One solution is to add an argument to this option, so that it is possible to limit the tracking of metrics to only the minimum subset necessary for routine monitoring of the full node's health.

For example:
--prometheus=health or --prometheus=observability.

The text was updated successfully, but these errors were encountered:

luislhl · 2024-04-12T21:19:10Z

1 Type of metrics

Agreed ✅ :

2 Metrics syntax

Agreed ✅

3 Metric labels

I think that, as a start, we could remove their connection_string label. It doesn't seem to be needed. The other label leading to high cardinality is the peer_id, but it can't be removed without compromising the metrics.

Check whether it is really necessary to maintain these metrics for full node observability purposes.

We were mostly interested in keeping metrics related to the network usage when communicating with peers, and some other peer-related metrics were also added just because they were already easily available: #467

But I'm not sure we should remove them, we never know when they could be useful for us or someone else. See my next point.

Quite considerable increase in the consumption of computing resources.

I think the main resource this will consume is storage in the Prometheus server. Other resources will only be consumed if there are queries being performed over these time series.

I think the common use for them will involve only very sporadic queries, so they shouldn't incur in performance impacts for the server. Plus, storage is cheap, and Prometheus only keeps metrics for a specified amount of time by default.

We should have warnings in our docs though, about the high cardinality of these metrics. Anyone that may want to store metrics for longer periods of times should be aware that the high cardinality ones should probably be left our of their long-term storage solution, or should be aggregated to decrease the number of data points.

If necessary: mitigate the problem of high resource consumption and the complexity of analyzing metrics for the average full node operators, using the solution proposed in the following section.

I agree with this, though. Most people won't need these metrics.

4 Subset of health metrics

I agree with the overall suggestion.

I don't like the proposed parameter values --prometheus=health and --prometheus=observability. I think I would go with --prometheus=basic and --prometheus=complete, or something similar.

I would probably also add some more metrics to the list of basic metrics, I don't think we should necessarily limit it to the metrics we think are useful for use cases. Some metrics are very simple and won't do any harm, and leaving all of them to the complete group could cause some problems. For instance, if I wanted the best_block_weight for whatever reason, or the hash_rate, which are simple metrics, and they were in the complete group, I would have to also collect the peer_connection_* metrics which we know that cause the higher storage problem.

My take is that we should only move these to the complete group:

peer_connection_*
transaction_cache_*
tx_storage_total_sst_files_size

jansegre · 2024-10-03T14:47:18Z

It's not clear to me if there's any change in the hathor-core needed to improve these metrics. Can you help clarify? @BrunoCampana @luislhl

jansegre · 2024-10-17T17:18:29Z

This would probably require a migration path to change the names, but it seems like an improvement.

BrunoCampana added the enhancement New feature or request label Apr 11, 2024

jansegre added this to Hathor Network Aug 15, 2024

github-project-automation bot moved this to Todo in Hathor Network Aug 15, 2024

jansegre added the status/needs-review label Aug 15, 2024

jansegre removed the status/needs-review label Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve tracked metrics #1001

Improve tracked metrics #1001

BrunoCampana commented Apr 11, 2024 •

edited

Loading

luislhl commented Apr 12, 2024 •

edited

Loading

jansegre commented Oct 3, 2024

jansegre commented Oct 17, 2024

Improve tracked metrics #1001

Improve tracked metrics #1001

Comments

BrunoCampana commented Apr 11, 2024 • edited Loading

1 Type of metrics

2 Metrics syntax

3 Metric labels

4 Subset of health metrics

luislhl commented Apr 12, 2024 • edited Loading

1 Type of metrics

2 Metrics syntax

3 Metric labels

4 Subset of health metrics

jansegre commented Oct 3, 2024

jansegre commented Oct 17, 2024

BrunoCampana commented Apr 11, 2024 •

edited

Loading

luislhl commented Apr 12, 2024 •

edited

Loading