Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Switch to the Python Prometheus library #3256

Merged
merged 23 commits into from
May 28, 2018
Merged

Switch to the Python Prometheus library #3256

merged 23 commits into from
May 28, 2018

Conversation

hawkowl
Copy link
Contributor

@hawkowl hawkowl commented May 22, 2018

See #3218

@erikjohnston
Copy link
Member

Awesome! I maybe should have mentioned that ideally the changes would be backwards compatible?

@erikjohnston
Copy link
Member

(I also added some metrics in #3252 I'm afraid)

@hawkowl hawkowl changed the base branch from master to develop May 22, 2018 15:58
@hawkowl hawkowl changed the title [WIP] Switch to the Python Prometheus library Switch to the Python Prometheus library May 23, 2018
@hawkowl hawkowl requested a review from erikjohnston May 23, 2018 01:35
Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woo, this looks great! I haven't gone through it to carefully, but I did run it on jki.re which picked up a few typos in the names.

The main thing that I think we're missing is keeping the user/system CPU metrics. We've had problems in the past where we see spikes in system CPU, so we'd really want to keep that I think.

metrics.register_callback(
"pending_destinations",
LaterGauge(
"synapse_federation_client_pending_destinations",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're in synapse_federation_transaction_queue_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

push_rules_invalidation_counter = Counter(
"synapse_push_bulk_push_role_evaluator_push_rules_invalidation_counter", "")
push_rules_state_size_counter = Counter(
"synapse_push_bulk_push_role_evaluator_push_rules_state_size_counter", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_rule_ not _role_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

response_size = metrics.register_counter(
"response_size", labels=["method", "servlet", "tag"]
response_size = Counter(
"synapse_http_request_response_size", "", ["method", "servlet", "tag"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http_server rather than http_request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"python_gc_time",
"Time taken to GC (ms)",
["gen"],
buckets=[1, 2, 5, 10, 25, 50, 100, 250, 500, 1000],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gen 2 can sometimes take 30s, so probably worth adding a few more buckets. Getting an idea of large GC times gives us an insight into why things might occasionally be pausing for tens of seconds ,which is useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 2.5, 5, 7.5, 15, 30, 45, and 60 second buckets.

# Check if the metric is already registered. Unregister it, if so.
metric_name = "cache_%s_%s" % (cache_type, cache_name)
if metric_name in collectors_by_name.keys():
REGISTRY.unregister(collectors_by_name[metric_name])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would they already be registered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests, because otherwise prom throws an error. We could get around this by making them top level and putting the names as labels always, as prom_client will throw an exception if you register something with the same name twice (like in tests).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, might be worth just putting a comment that its for tests.

@erikjohnston erikjohnston assigned hawkowl and unassigned erikjohnston May 23, 2018
@hawkowl
Copy link
Contributor Author

hawkowl commented May 23, 2018

Thanks for the review @erikjohnston.

I'm going to add back the CPU metrics, and then I'll put it back up.

Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running this on jki.re and it seems like the only issue is milliseconds vs seconds for the histogram metrics.

sql_query_timer = metrics.register_distribution("query_time", labels=["verb"])
sql_txn_timer = metrics.register_distribution("transaction_time", labels=["desc"])
sql_query_timer = Histogram("synapse_storage_query_time", "", ["verb"])
sql_txn_timer = Histogram("synapse_storage_transaction_time", "", ["desc"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default histogram buckets expect them to be observed in seconds rather than milliseconds? I think we'll need to go through and change all the observations to use seconds instead.

@erikjohnston erikjohnston assigned hawkowl and unassigned erikjohnston May 24, 2018
Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That last approval was meant to be request changes... though once we have corrected the histogram units I think it'll be ready to merge!)

@erikjohnston
Copy link
Member

We probably want to add a note to the changelog and docs about which metrics have changed and how to update existing graphs (e.g use histogram_quantile(0.5, ...) to get at the averages)

@hawkowl hawkowl merged commit 81717e8 into develop May 28, 2018
@hawkowl hawkowl deleted the 3218-official-prom branch May 28, 2018 13:30
@richvdh
Copy link
Member

richvdh commented May 29, 2018

Some follow-up work (including the changelog?) is happening in #3274

richvdh added a commit that referenced this pull request Jun 4, 2018
fix bug introduced in #3256
neilisfragile added a commit that referenced this pull request Jun 6, 2018
Changes in synapse v0.31.0 (2018-06-06)
======================================

Most notable change from v0.30.0 is to switch to python prometheus library to improve system
stats reporting. WARNING this changes a number of prometheus metrics in a
backwards-incompatible manner. For more details, see
`docs/metrics-howto.rst <docs/metrics-howto.rst#removal-of-deprecated-metrics--time-based-counters-becoming-histograms-in-0310>`_.

Bug Fixes:

* Fix metric documentation tables (PR #3341)
* Fix LaterGuage error handling (694968f)
* Fix replication metrics (b7e7fd2)

Changes in synapse v0.31.0-rc1 (2018-06-04)
==========================================

Features:

* Switch to the Python Prometheus library (PR #3256, #3274)
* Let users leave the server notice room after joining (PR #3287)

Changes:

* daily user type phone home stats (PR #3264)
* Use iter* methods for _filter_events_for_server (PR #3267)
* Docs on consent bits (PR #3268)
* Remove users from user directory on deactivate (PR #3277)
* Avoid sending consent notice to guest users (PR #3288)
* disable CPUMetrics if no /proc/self/stat (PR #3299)
* Add local and loopback IPv6 addresses to url_preview_ip_range_blacklist (PR #3312) Thanks to @thegcat!
* Consistently use six's iteritems and wrap lazy keys/values in list() if they're not meant to be lazy (PR #3307)
* Add private IPv6 addresses to example config for url preview blacklist (PR #3317) Thanks to @thegcat!
* Reduce stuck read-receipts: ignore depth when updating (PR #3318)
* Put python's logs into Trial when running unit tests (PR #3319)

Changes, python 3 migration:

* Replace some more comparisons with six (PR #3243) Thanks to @NotAFile!
* replace some iteritems with six (PR #3244) Thanks to @NotAFile!
* Add batch_iter to utils (PR #3245) Thanks to @NotAFile!
* use repr, not str (PR #3246) Thanks to @NotAFile!
* Misc Python3 fixes (PR #3247) Thanks to @NotAFile!
* Py3 storage/_base.py (PR #3278) Thanks to @NotAFile!
* more six iteritems (PR #3279) Thanks to @NotAFile!
* More Misc. py3 fixes (PR #3280) Thanks to @NotAFile!
* remaining isintance fixes (PR #3281) Thanks to @NotAFile!
* py3-ize state.py (PR #3283) Thanks to @NotAFile!
* extend tox testing for py3 to avoid regressions (PR #3302) Thanks to @krombel!
* use memoryview in py3 (PR #3303) Thanks to @NotAFile!

Bugs:

* Fix federation backfill bugs (PR #3261)
* federation: fix LaterGauge usage (PR #3328) Thanks to @intelfx!
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants