-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve tracked metrics #1001
Comments
1 Type of metricsAgreed ✅ : 2 Metrics syntaxAgreed ✅ 3 Metric labelsI think that, as a start, we could remove their
We were mostly interested in keeping metrics related to the network usage when communicating with peers, and some other peer-related metrics were also added just because they were already easily available: #467 But I'm not sure we should remove them, we never know when they could be useful for us or someone else. See my next point.
I think the main resource this will consume is storage in the Prometheus server. Other resources will only be consumed if there are queries being performed over these time series. I think the common use for them will involve only very sporadic queries, so they shouldn't incur in performance impacts for the server. Plus, storage is cheap, and Prometheus only keeps metrics for a specified amount of time by default. We should have warnings in our docs though, about the high cardinality of these metrics. Anyone that may want to store metrics for longer periods of times should be aware that the high cardinality ones should probably be left our of their long-term storage solution, or should be aggregated to decrease the number of data points.
I agree with this, though. Most people won't need these metrics. 4 Subset of health metricsI agree with the overall suggestion. I don't like the proposed parameter values I would probably also add some more metrics to the list of basic metrics, I don't think we should necessarily limit it to the metrics we think are useful for use cases. Some metrics are very simple and won't do any harm, and leaving all of them to the My take is that we should only move these to the
|
It's not clear to me if there's any change in the hathor-core needed to improve these metrics. Can you help clarify? @BrunoCampana @luislhl |
This would probably require a migration path to change the names, but it seems like an improvement. |
The objective of this issue is to propose a series of improvements to the metrics produced by hathor-core.
The improvement proposals are as follows:
1 Type of metrics
Problem:
The following metrics are defined as "gauge" type, but are actually of "counter" type:
peer_connection_received_messages
peer_connection_sent_messages
peer_connection_received_bytes
peer_connection_sent_bytes
peer_connection_received_txs
peer_connection_discarded_txs
peer_connection_received_blocks
peer_connection_discarded_blocks
completed_jobs
blocks_found
transaction_cache_hits
transaction_cache_misses
Reference: https://prometheus.io/docs/concepts/metric_types/
Solution: change the type of these metrics in the file read by node_exporter, from:
# TYPE gauge
to# TYPE counter
.2 Metrics syntax
Problem:
It is considered good practice to name "counter" metrics that are unitless (there is no SI unit such as seconds, etc.) with the name of the entity being counted in the plural followed by the word "total". Example:
peer_connection_received_messages_total
Reference: https://prometheus.io/docs/practices/naming/
Suggestion:
Make the following changes to the metric names:
peer_connection_received_messages
topeer_connection_received_messages_total
peer_connection_sent_messages
topeer_connection_sent_messages_total
peer_connection_received_bytes
topeer_connection_received_bytes_total
peer_connection_sent_bytes
topeer_connection_received_bytes_total
peer_connection_received_txs
topeer_connection_received_bytes_total
peer_connection_discarded_txs
topeer_connection_received_bytes_total
peer_connection_received_blocks
topeer_connection_received_bytes_total
peer_connection_discarded_blocks
topeer_connection_received_bytes_total
completed_jobs
tocompleted_jobs_total
blocks_found
tofound_blocks_total
transaction_cache_hits
totransaction_cache_hits_total
transaction_cache_misses
totransaction_cache_misses_total
3 Metric labels
Problem:
It is considered good practice that labels do not have a high cardinality in their values. A benchmark is that a label should not have a cardinality greater than 10.
The following metrics have high cardinality:
peer_connection_received_messages
peer_connection_sent_messages
peer_connection_received_bytes
peer_connection_sent_bytes
peer_connection_received_txs
peer_connection_discarded_txs
peer_connection_received_blocks
peer_connection_discarded_blocks
What is the implication of this?
Reference: https://prometheus.io/docs/practices/naming/#labels
Suggestion:
4 Subset of health metrics
Problem:
Most of the metrics produced by the full node can be useful for overall observability of the full node, but are not useful for routine monitoring of the full node's health.
For most node operators, it will only be interesting to monitor a subset of metrics, possibly the following, from those that currently exist:
connected_peers
best_block_height
blocks
transactions
websocket_connections
This subset of metrics, or something close to it, should be enough for routine monitoring of the full node's health, along with monitoring of computational resources, which are not mentioned here because they are generated directly by node_exporter on the status of the hathor_core host .
Forcing the average node operators to monitor all of these metrics will likely cause the following impacts:
Solution:
Metrics tracking is only performed when the full node is run with the
--prometheus
option. One solution is to add an argument to this option, so that it is possible to limit the tracking of metrics to only the minimum subset necessary for routine monitoring of the full node's health.For example:
--prometheus=health
or--prometheus=observability
.The text was updated successfully, but these errors were encountered: