Skip to content
This repository has been archived by the owner on Oct 17, 2022. It is now read-only.

[metrics] add metrics to monitor concurrent tasks in network #472

Merged
merged 1 commit into from
Jul 29, 2022

Conversation

akichidis
Copy link
Contributor

@akichidis akichidis commented Jul 11, 2022

Following the work on #463 I believe that @velvia made a good point on adding metrics to monitor when we reach the maximum of capacity on concurrent tasks. Took the opportunity to add those since we are still hot on adding metrics and keep them until we stabilise things. Once we gain strong confidence on our nodes and we consider this an implementation detail we can probably remove.

use prometheus::{default_registry, register_int_gauge_vec_with_registry, IntGaugeVec, Registry};
use std::sync::Arc;

pub trait NetworkMetrics {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would use only one struct to report the metrics both for worker and primary nodes as those would normally be deployed as separate nodes. However, currently the worker & primary are deployed as part of the same binary in SUI and this would lead to re-registering the same metric (which would lead to an error). So had to split those but to minimise the implementation I used a common interface .

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be OK with the theory of what we're doing here (keeping a finger on the pulse of concurrency), but I'm wary of the very extensive code changes this is introducing and actually concerned we may never manage to either remove those metrics or maintain them well.

  1. is there another tack we could take to reduce the maintenance burden, using e.g. the recently introduced tokio-metrics? https://tokio.rs/blog/2022-02-announcing-tokio-metrics
  2. if not, is there a reduced, simpler variant of the metrics reporting that would e.g. wake up a background task every second and report the concurrency value in the bounded executors of each of the worker & primary? this would not require instrumentation of the whole code base I feel.

@velvia may have other ideas.

@velvia
Copy link
Contributor

velvia commented Jul 12, 2022

@huitseeker @akichidis just had a look at this. Here's my take:

I think in terms of manually maintaining our own metrics -- most of the change in this PR really has to do with adding an inner Metrics struct to each network facility. My guess is that this is going to be hard to avoid with this approach.
I had a thought where in the BoundedExecutor itself where it is acquiring and releasing we could possibly update the gauge within there, which would only help a little bit.

Is there a higher level place which could collect this metric periodically? Then each network eg primary, etc only has to expose a method, and the higher level thing could call down to get the metric. This would be less intrusive in that one would not need to pass in a separate Metrics struct just for monitoring this one thing.

As for Tokio Metrics, I think that is a good approach, though that requires some investment too. That should return much more metrics overall and be really good and rich, but that will likely be a few days.

@akichidis
Copy link
Contributor Author

@huitseeker @velvia thanks both for your comments. So my thoughts regarding the above:

  1. Tokio metrics they seem to be super useful and something we can/should evaluate. From a quick look though they don't seem to offer a way to inject customer metrics, but rather report a series of standard ones (more info here). Here we want to monitor custom values which don't fit in the realm of task / runtime metrics - or at least in the obvious way we need them
  2. @huitseeker your idea around having a separate task to periodically report the values sounds definitely a way for us to avoid injecting all this "inner" code to the handlers. The thing is that in the end we'll need to build and maintain this mechanism which to be honest I don't know how much less hassle will be compared to what we currently have. We'll need pretty much the tagging I've introduced (a bounded executor needs to be identified by the corresponding networking handler) and we'll definitely need a way to "push" the values to this background collector. I mean now it seems that this PR is introducing many changes, but keep in mind that all the boilerplate now is there. If want to add additional metrics this will be only a few lines of change.

That being said I am open to give a quick try on the approach (2) mentioned above and compare the complexity

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, after the experience of #559 I recognize how wise this is and I am now in favor of introducing this PR with minor changes.
I think this mostly needs a rebase.

&["module", "network"],
registry
)
.unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to not panic the node if this fails: metrics are not enough to justify a crash, IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that should happen primarily if we have already registered the same metric in the codebase somewhere else. Having this there will protect us from accidentally declaring the same metric in two different places and go unnoticed - which I believe is quite important. So my expectation is that this will be caught early in our e2e tests before it even hits staging - or production worst case.

Comment on lines 52 to 57
network_concurrent_tasks: register_int_gauge_vec_with_registry!(
"worker_network_concurrent_tasks",
"The number of concurrent tasks running in the network connector",
&["module", "network"],
registry
)
.unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above re: the avoidable panic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to this #472 (comment)

@huitseeker
Copy link
Contributor

@akichidis you may need to rebase this one to pass checks: we changed a linter in #485

@akichidis akichidis force-pushed the feat/add-metrics-on-bounded-executor branch 2 times, most recently from c6a5b00 to 04a2e75 Compare July 28, 2022 14:29
@akichidis
Copy link
Contributor Author

@huitseeker @asonnino @bmwill @velvia I've rebased the PR in the latest master, if you could please re-review

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much 🎉

@akichidis akichidis force-pushed the feat/add-metrics-on-bounded-executor branch from 25a0d25 to 51b02a0 Compare July 29, 2022 16:56
@akichidis akichidis force-pushed the feat/add-metrics-on-bounded-executor branch from 51b02a0 to e48a271 Compare July 29, 2022 17:36
@akichidis akichidis merged commit c17ba23 into main Jul 29, 2022
@akichidis akichidis deleted the feat/add-metrics-on-bounded-executor branch July 29, 2022 17:58
huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 14, 2022
We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706).
PR MystenLabs#472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes MystenLabs#759
huitseeker added a commit that referenced this pull request Aug 15, 2022
)

We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706).
PR #472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes #759
huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 16, 2022
…ystenLabs#763)

We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706).
PR MystenLabs#472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes MystenLabs#759
huitseeker added a commit that referenced this pull request Aug 16, 2022
)

We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706).
PR #472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes #759
mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 30, 2022
mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 30, 2022
…ystenLabs/narwhal#763)

We operate an executor with a bound on the concurrent number of messages (see MystenLabs/narwhal#463, MystenLabs/narwhal#559, MystenLabs/narwhal#706).
PR MystenLabs/narwhal#472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes MystenLabs/narwhal#759
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants