Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tracked metrics #1001

Open
BrunoCampana opened this issue Apr 11, 2024 · 3 comments
Open

Improve tracked metrics #1001

BrunoCampana opened this issue Apr 11, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@BrunoCampana
Copy link
Member

BrunoCampana commented Apr 11, 2024

The objective of this issue is to propose a series of improvements to the metrics produced by hathor-core.

The improvement proposals are as follows:

  1. Type of metrics
  2. Metrics syntax
  3. Metric labels
  4. Subset of health metrics

1 Type of metrics

Problem:
The following metrics are defined as "gauge" type, but are actually of "counter" type:

  • peer_connection_received_messages
  • peer_connection_sent_messages
  • peer_connection_received_bytes
  • peer_connection_sent_bytes
  • peer_connection_received_txs
  • peer_connection_discarded_txs
  • peer_connection_received_blocks
  • peer_connection_discarded_blocks
  • completed_jobs
  • blocks_found
  • transaction_cache_hits
  • transaction_cache_misses

Reference: https://prometheus.io/docs/concepts/metric_types/

Solution: change the type of these metrics in the file read by node_exporter, from:
# TYPE gauge to # TYPE counter.

2 Metrics syntax

Problem:
It is considered good practice to name "counter" metrics that are unitless (there is no SI unit such as seconds, etc.) with the name of the entity being counted in the plural followed by the word "total". Example:
peer_connection_received_messages_total

Reference: https://prometheus.io/docs/practices/naming/

Suggestion:
Make the following changes to the metric names:

  • From peer_connection_received_messages to peer_connection_received_messages_total
  • From peer_connection_sent_messages to peer_connection_sent_messages_total
  • From peer_connection_received_bytes to peer_connection_received_bytes_total
  • From peer_connection_sent_bytes to peer_connection_received_bytes_total
  • From peer_connection_received_txs to peer_connection_received_bytes_total
  • From peer_connection_discarded_txs to peer_connection_received_bytes_total
  • From peer_connection_received_blocks to peer_connection_received_bytes_total
  • From peer_connection_discarded_blocks to peer_connection_received_bytes_total
  • From completed_jobs to completed_jobs_total
  • From blocks_found to found_blocks_total
  • From transaction_cache_hits to transaction_cache_hits_total
  • From transaction_cache_misses to transaction_cache_misses_total

3 Metric labels

Problem:
It is considered good practice that labels do not have a high cardinality in their values. A benchmark is that a label should not have a cardinality greater than 10.

The following metrics have high cardinality:

  • peer_connection_received_messages
  • peer_connection_sent_messages
  • peer_connection_received_bytes
  • peer_connection_sent_bytes
  • peer_connection_received_txs
  • peer_connection_discarded_txs
  • peer_connection_received_blocks
  • peer_connection_discarded_blocks

What is the implication of this?

  • Generation of multiple time series, which may be unnecessary, making analysis unnecessary complex.
  • Quite considerable increase in the consumption of computing resources.

Reference: https://prometheus.io/docs/practices/naming/#labels

Suggestion:

  • Check whether it is really necessary to maintain these metrics for full node observability purposes.
  • If not necessary: remove them.
  • If necessary: mitigate the problem of high resource consumption and the complexity of analyzing metrics for the average full node operators, using the solution proposed in the following section.

4 Subset of health metrics

Problem:
Most of the metrics produced by the full node can be useful for overall observability of the full node, but are not useful for routine monitoring of the full node's health.

For most node operators, it will only be interesting to monitor a subset of metrics, possibly the following, from those that currently exist:

  • connected_peers
  • best_block_height
  • blocks
  • transactions
  • websocket_connections

This subset of metrics, or something close to it, should be enough for routine monitoring of the full node's health, along with monitoring of computational resources, which are not mentioned here because they are generated directly by node_exporter on the status of the hathor_core host .

Forcing the average node operators to monitor all of these metrics will likely cause the following impacts:

  • Increased unnecessary consumption of computational resources in the monitoring system of their full nodes.
  • It will make monitoring difficult, due to the complexity of available metrics, which will not have a clear meaning regarding their usefulness for them.

Solution:
Metrics tracking is only performed when the full node is run with the --prometheus option. One solution is to add an argument to this option, so that it is possible to limit the tracking of metrics to only the minimum subset necessary for routine monitoring of the full node's health.

For example:
--prometheus=health or --prometheus=observability.

@BrunoCampana BrunoCampana added the enhancement New feature or request label Apr 11, 2024
@luislhl
Copy link
Collaborator

luislhl commented Apr 12, 2024

1 Type of metrics

Agreed ✅ :

2 Metrics syntax

Agreed ✅

3 Metric labels

I think that, as a start, we could remove their connection_string label. It doesn't seem to be needed. The other label leading to high cardinality is the peer_id, but it can't be removed without compromising the metrics.

Check whether it is really necessary to maintain these metrics for full node observability purposes.

We were mostly interested in keeping metrics related to the network usage when communicating with peers, and some other peer-related metrics were also added just because they were already easily available: #467

But I'm not sure we should remove them, we never know when they could be useful for us or someone else. See my next point.

Quite considerable increase in the consumption of computing resources.

I think the main resource this will consume is storage in the Prometheus server. Other resources will only be consumed if there are queries being performed over these time series.

I think the common use for them will involve only very sporadic queries, so they shouldn't incur in performance impacts for the server. Plus, storage is cheap, and Prometheus only keeps metrics for a specified amount of time by default.

We should have warnings in our docs though, about the high cardinality of these metrics. Anyone that may want to store metrics for longer periods of times should be aware that the high cardinality ones should probably be left our of their long-term storage solution, or should be aggregated to decrease the number of data points.

If necessary: mitigate the problem of high resource consumption and the complexity of analyzing metrics for the average full node operators, using the solution proposed in the following section.

I agree with this, though. Most people won't need these metrics.

4 Subset of health metrics

I agree with the overall suggestion.

I don't like the proposed parameter values --prometheus=health and --prometheus=observability. I think I would go with --prometheus=basic and --prometheus=complete, or something similar.

I would probably also add some more metrics to the list of basic metrics, I don't think we should necessarily limit it to the metrics we think are useful for use cases. Some metrics are very simple and won't do any harm, and leaving all of them to the complete group could cause some problems. For instance, if I wanted the best_block_weight for whatever reason, or the hash_rate, which are simple metrics, and they were in the complete group, I would have to also collect the peer_connection_* metrics which we know that cause the higher storage problem.

My take is that we should only move these to the complete group:

  • peer_connection_*
  • transaction_cache_*
  • tx_storage_total_sst_files_size

@jansegre
Copy link
Member

jansegre commented Oct 3, 2024

It's not clear to me if there's any change in the hathor-core needed to improve these metrics. Can you help clarify? @BrunoCampana @luislhl

@jansegre
Copy link
Member

This would probably require a migration path to change the names, but it seems like an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

3 participants