-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Switch to the Python Prometheus library #3256
Conversation
Awesome! I maybe should have mentioned that ideally the changes would be backwards compatible? |
(I also added some metrics in #3252 I'm afraid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woo, this looks great! I haven't gone through it to carefully, but I did run it on jki.re which picked up a few typos in the names.
The main thing that I think we're missing is keeping the user/system CPU metrics. We've had problems in the past where we see spikes in system CPU, so we'd really want to keep that I think.
metrics.register_callback( | ||
"pending_destinations", | ||
LaterGauge( | ||
"synapse_federation_client_pending_destinations", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're in synapse_federation_transaction_queue_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
push_rules_invalidation_counter = Counter( | ||
"synapse_push_bulk_push_role_evaluator_push_rules_invalidation_counter", "") | ||
push_rules_state_size_counter = Counter( | ||
"synapse_push_bulk_push_role_evaluator_push_rules_state_size_counter", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_rule_
not _role_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
synapse/http/request_metrics.py
Outdated
response_size = metrics.register_counter( | ||
"response_size", labels=["method", "servlet", "tag"] | ||
response_size = Counter( | ||
"synapse_http_request_response_size", "", ["method", "servlet", "tag"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http_server
rather than http_request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
synapse/metrics/__init__.py
Outdated
"python_gc_time", | ||
"Time taken to GC (ms)", | ||
["gen"], | ||
buckets=[1, 2, 5, 10, 25, 50, 100, 250, 500, 1000], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gen 2 can sometimes take 30s, so probably worth adding a few more buckets. Getting an idea of large GC times gives us an insight into why things might occasionally be pausing for tens of seconds ,which is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 2.5, 5, 7.5, 15, 30, 45, and 60 second buckets.
synapse/util/caches/__init__.py
Outdated
# Check if the metric is already registered. Unregister it, if so. | ||
metric_name = "cache_%s_%s" % (cache_type, cache_name) | ||
if metric_name in collectors_by_name.keys(): | ||
REGISTRY.unregister(collectors_by_name[metric_name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would they already be registered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests, because otherwise prom throws an error. We could get around this by making them top level and putting the names as labels always, as prom_client will throw an exception if you register something with the same name twice (like in tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, might be worth just putting a comment that its for tests.
Thanks for the review @erikjohnston. I'm going to add back the CPU metrics, and then I'll put it back up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running this on jki.re and it seems like the only issue is milliseconds vs seconds for the histogram metrics.
synapse/storage/_base.py
Outdated
sql_query_timer = metrics.register_distribution("query_time", labels=["verb"]) | ||
sql_txn_timer = metrics.register_distribution("transaction_time", labels=["desc"]) | ||
sql_query_timer = Histogram("synapse_storage_query_time", "", ["verb"]) | ||
sql_txn_timer = Histogram("synapse_storage_transaction_time", "", ["desc"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default histogram buckets expect them to be observed in seconds rather than milliseconds? I think we'll need to go through and change all the observations to use seconds instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(That last approval was meant to be request changes... though once we have corrected the histogram units I think it'll be ready to merge!)
We probably want to add a note to the changelog and docs about which metrics have changed and how to update existing graphs (e.g use |
Some follow-up work (including the changelog?) is happening in #3274 |
Changes in synapse v0.31.0 (2018-06-06) ====================================== Most notable change from v0.30.0 is to switch to python prometheus library to improve system stats reporting. WARNING this changes a number of prometheus metrics in a backwards-incompatible manner. For more details, see `docs/metrics-howto.rst <docs/metrics-howto.rst#removal-of-deprecated-metrics--time-based-counters-becoming-histograms-in-0310>`_. Bug Fixes: * Fix metric documentation tables (PR #3341) * Fix LaterGuage error handling (694968f) * Fix replication metrics (b7e7fd2) Changes in synapse v0.31.0-rc1 (2018-06-04) ========================================== Features: * Switch to the Python Prometheus library (PR #3256, #3274) * Let users leave the server notice room after joining (PR #3287) Changes: * daily user type phone home stats (PR #3264) * Use iter* methods for _filter_events_for_server (PR #3267) * Docs on consent bits (PR #3268) * Remove users from user directory on deactivate (PR #3277) * Avoid sending consent notice to guest users (PR #3288) * disable CPUMetrics if no /proc/self/stat (PR #3299) * Add local and loopback IPv6 addresses to url_preview_ip_range_blacklist (PR #3312) Thanks to @thegcat! * Consistently use six's iteritems and wrap lazy keys/values in list() if they're not meant to be lazy (PR #3307) * Add private IPv6 addresses to example config for url preview blacklist (PR #3317) Thanks to @thegcat! * Reduce stuck read-receipts: ignore depth when updating (PR #3318) * Put python's logs into Trial when running unit tests (PR #3319) Changes, python 3 migration: * Replace some more comparisons with six (PR #3243) Thanks to @NotAFile! * replace some iteritems with six (PR #3244) Thanks to @NotAFile! * Add batch_iter to utils (PR #3245) Thanks to @NotAFile! * use repr, not str (PR #3246) Thanks to @NotAFile! * Misc Python3 fixes (PR #3247) Thanks to @NotAFile! * Py3 storage/_base.py (PR #3278) Thanks to @NotAFile! * more six iteritems (PR #3279) Thanks to @NotAFile! * More Misc. py3 fixes (PR #3280) Thanks to @NotAFile! * remaining isintance fixes (PR #3281) Thanks to @NotAFile! * py3-ize state.py (PR #3283) Thanks to @NotAFile! * extend tox testing for py3 to avoid regressions (PR #3302) Thanks to @krombel! * use memoryview in py3 (PR #3303) Thanks to @NotAFile! Bugs: * Fix federation backfill bugs (PR #3261) * federation: fix LaterGauge usage (PR #3328) Thanks to @intelfx!
See #3218