-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Conversation
Thereby incorporate bandwidth measurement along the lines previously done by libp2p itself.
For wasm32 we need to enable unstable features to make `task::Builder::local` available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
I have not been involved much in the noise changes, thus I would like to ask others to review that part. In case no one is available I can take a deeper look myself.
I would suggest the needsburnin
label to make sure we don't introduce any regressions. What do you think?
@twittner made a suggestion for simplification for the bandwidth-tracking which I'm now trying to incorporate. Essentially it may not be necessary to collect multiple data points and the calculation could be done on-demand in the accessor methods. I will update this PR once I tried that out. |
I've been told that only works for branches that are not on forks, so wouldn't work for this PR. |
So, with the most recent commits, I went down the same path as libp2p/rust-libp2p#1670, i.e. to push any further bandwidth calculations to the outer layers. This seems the most efficient as it allows each consumer to perform calculations based on the total inbound/outbound bytes at its own pace. @mxinden would you be so kind to take another look, since I also changed the metrics to only report totals instead of precomputed averages. |
@tomaka I'd appreciate your opinion on whether we should enable the sending of legacy handshake payloads for |
We do need the legacy handshake for now. Polkadot 0.8.22 seems to be stable, but it's only been released a few days ago. |
@@ -1133,7 +1133,7 @@ struct Metrics { | |||
kbuckets_num_nodes: GaugeVec<U64>, | |||
listeners_local_addresses: Gauge<U64>, | |||
listeners_errors_total: Counter<U64>, | |||
network_per_sec_bytes: GaugeVec<U64>, | |||
network_bytes_total: GaugeVec<U64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
network_bytes_total: GaugeVec<U64>, | |
network_bytes_total: CounterVec<U64>, |
Gauge
is for metrics that go up and down, and Counter
is for metrics that only ever increase over time except for restarts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm mistaken, but from the prometheus
crate docs, counters have essentially inc()
and inc_by()
. Gauges can be set
, which is what we need to do for these totals, i.e. these are counter values from existing counters (the running totals) reported as a gauge over time, so this is a monotonic gauge. I'm happy to be corrected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The problem is that this is not only a change in the programmatic API of the prometheus
library.
Whether a metric is a counter or a gauge is ultimately reported to the Prometheus server running as a separate process, and then (I think?) to Grafana.
Since we know that the value can only every increase, we could use Counter::get
to get the current value, then call inc_by
.
Doing this would be racy if there were multiple clones of that Counter
, but since we only ever access that Counter
from a single location, we know that nothing will modify the counter between the call to get
and the call to inc_by
.
@mxinden Any opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem of updating a counter from an existing counter is often discussed in the Prometheus ecosystem. In general we suggest to implement a custom collector, see robustperception for general information, tikv/rust-prometheus#303 specifically for the Rust client and my comment at the very end tikv/rust-prometheus#303 (comment).
The reason why Counter
does not implement set
is, that it allows newcomers to easily shoot themselfes in the foot and thus it was not deemed worth the additional ergonomic for this advanced use case.
I see 3 ways moving forward:
-
Keep it as a
Gauge
. This would expose the metric with theTYPE gauge
header. As @tomaka mentioned above this is ingested by Prometheus and forwarded to Grafana, but only used for query suggestions, thus not a big issue. -
Use a
Counter
. As mentioned by @tomaka this is potentially racy if it ever gets set concurrently. -
Use a custom
impl Collector
. Quite verbose.
I would be fine with using a Gauge
here for now. I would just ask to add a comment above stating that it is monotonic. In the future we can look into a custom Collector
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mxinden Thanks for your input.
I would be fine with using a Gauge here for now. I would just ask to add a comment above stating that it is monotonic. In the future we can look into a custom Collector.
I added a comment to the code. I don't have a strong preference in any case.
Out of curiosity, if gauges "must go up and down" and counters "must only increase", what do you use for a monotonically decreasing metric? I'm not sure I understand the significance of the discussions of counter vs gauge. It would almost seem more sensible to directly offer and use Atomic
for both use-cases, designating it as a "gauge" or "counter" based on the semantic meaning of the metric. I guess I just don't find the distinction between gauge and counter very useful, technically, especially if the counter can "jump", both forward and back to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you use for a monotonically decreasing metric
So far I have never come across a use case for a monotonically decreasing metric. Can you come up with something that can not also be modeled by inverting a monotonically increasing counter?
I'm not sure I understand the significance of the discussions of counter vs gauge.
The benefit of differentiating the two comes into play only at query (aggregation time). E.g. one can apply the rate
function to a counter knowing that the underlying data never breaks monotonicity. I think the best summary is given by robust perception here.
Let me know if the above makes sense @romanb. More than happy to put more effort into a detailed explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I have never come across a use case for a monotonically decreasing metric. [..]
A metric for a bounded resource with fixed capacity and its exhaustion over time? I don't know. The restriction not to be able to do so just seems a bit arbitrary.
The benefit of differentiating the two comes into play only at query (aggregation time). E.g. one can apply the rate function to a counter knowing that the underlying data never breaks monotonicity. I think the best summary is given by robust perception here.
Ok, so there are special query functions with specific characteristics for gauges and counters, like rate()
(counter) vs deriv()
for gauges, both essentially yielding the derivative but the former makes special use of counter properties to better handle things like resets on restarts. Thanks for the pointers @mxinden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
push any further bandwidth calculations to the outer layers.
👍 thanks for following all the way through with this.
@@ -1133,7 +1133,7 @@ struct Metrics { | |||
kbuckets_num_nodes: GaugeVec<U64>, | |||
listeners_local_addresses: Gauge<U64>, | |||
listeners_errors_total: Counter<U64>, | |||
network_per_sec_bytes: GaugeVec<U64>, | |||
network_bytes_total: GaugeVec<U64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem of updating a counter from an existing counter is often discussed in the Prometheus ecosystem. In general we suggest to implement a custom collector, see robustperception for general information, tikv/rust-prometheus#303 specifically for the Rust client and my comment at the very end tikv/rust-prometheus#303 (comment).
The reason why Counter
does not implement set
is, that it allows newcomers to easily shoot themselfes in the foot and thus it was not deemed worth the additional ergonomic for this advanced use case.
I see 3 ways moving forward:
-
Keep it as a
Gauge
. This would expose the metric with theTYPE gauge
header. As @tomaka mentioned above this is ingested by Prometheus and forwarded to Grafana, but only used for query suggestions, thus not a big issue. -
Use a
Counter
. As mentioned by @tomaka this is potentially racy if it ever gets set concurrently. -
Use a custom
impl Collector
. Quite verbose.
I would be fine with using a Gauge
here for now. I would just ask to add a comment above stating that it is monotonic. In the future we can look into a custom Collector
.
I'm marking as "needs-burnin" to note that this PR needs testing and prevent accidental merges. |
Deployed. Node health can be followed here: http://34.71.135.129:9090/graph?g0.range_input=6h&g0.expr=polkadot_sync_peers&g0.tab=0 |
bot rebase (doing this to restart the tests) |
bot rebase |
Rebasing. |
bot rebase |
Rebasing. |
Is there any objection to merge this? Since restarting the CI jobs doesn't seem to work either, could you please push a commit? |
bot merge |
Trying merge. |
commit f8c83bd Author: Roman Borschel <[email protected]> Date: Tue Aug 18 07:59:32 2020 +0200 Add support for sourced metrics. (paritytech#6895) * Add support for sourced metrics. A sourced metric is a metric that obtains its values from an existing source, rather than the values being independently recorded. It thus allows collecting metrics from existing counters or gauges without having to duplicate them in a dedicated prometheus counter or gauge (and hence another atomic value). The first use-case is to feed the bandwidth counters from libp2p directly into prometheus. * Tabs, not spaces. * Tweak bandwidth counter registration. * Add debug assertion for variable labels and values. * Document monotonicity requirement for sourced counters. * CI * Update client/network/src/service.rs Co-authored-by: Max Inden <[email protected]> Co-authored-by: Max Inden <[email protected]> commit 8e1ed7d Author: Shawn Tabrizi <[email protected]> Date: Mon Aug 17 22:59:23 2020 +0200 WeightInfo for System, Timestamp, and Utility (paritytech#6868) * initial updates to system * fix compile * Update writer.rs * update weights * finish system weights * timestamp weights * utility weight * Fix overflow in weight calculations * add back weight notes * Update for whitelisted benchmarks * add trait bounds * Revert "add trait bounds" This reverts commit 12b08b7. * Update weights for unaccounted for read commit 399421a Author: Wei Tang <[email protected]> Date: Mon Aug 17 21:07:30 2020 +0200 Derive Clone for AlwaysCanAuthor, NeverCanAuthor, CanAuthorWithNativeVersion (paritytech#6906) commit 287ecc2 Author: Wei Tang <[email protected]> Date: Mon Aug 17 19:36:29 2020 +0200 pow: add access to pre-digest for algorithm verifiers (paritytech#6900) * pow: fetch pre-runtime digest to verifier * Add Other error type * Fix log target and change docs to refer to pre_runtime commit 488b7c7 Author: Wei Tang <[email protected]> Date: Mon Aug 17 13:41:09 2020 +0200 babe, aura, pow: only call check_inherents if authoring version is compatible (paritytech#6862) * pow: check can_author_with before calling check_inherents * babe: check can_author_with before calling check_inherents * aura: check can_author_with before calling check_inherents * Fix node and node template compile * Add missing comma * Put each parameter on its own line * Add debug print * Fix line width too long * Fix pow line width issue commit fc743da Author: Pierre Krieger <[email protected]> Date: Mon Aug 17 11:19:16 2020 +0200 Add a DirectedGossip struct (paritytech#6803) * Add a DirectedGossip struct * Move protocol from prototype::new to biuld * More traits impls * Explain ordering * Apply suggestions from code review Co-authored-by: Toralf Wittner <[email protected]> * Address concerns * Add basic test * Concerns * More concerns * Remove QueueSenderPrototype * Rename * Apply suggestions from code review Co-authored-by: Max Inden <[email protected]> Co-authored-by: Toralf Wittner <[email protected]> Co-authored-by: parity-processbot <> Co-authored-by: Max Inden <[email protected]> commit 0079140 Author: Bastian Köcher <[email protected]> Date: Sun Aug 16 00:05:36 2020 +0200 Don't take the origin in `can_set_code` (paritytech#6899) It makes no sense that `can_set_code` takes the origin for checking it. Everybody reusing this function is only interested in the other checks that are done by this function. The origin should be checked by every dispatchable individually. commit cd3b62b Author: Seun Lanlege <[email protected]> Date: Sat Aug 15 10:08:31 2020 +0100 RpcHandlers Refactorings (paritytech#6846) * allow access to the underlying Pubsub instance from RpcHandlers * bump Cargo.lock * no more Arc<RpcHandlers> * bump Cargo.lock * Debug,. * Arc<RpcHandlers> * RpcHandler * RpcHandlers::io_handler * remove chain spec from cli * address pr comments * remove stray newline Co-authored-by: Ashley <[email protected]> Co-authored-by: Tomasz Drwięga <[email protected]> Co-authored-by: Ashley <[email protected]> commit eec7d71 Author: Max Inden <[email protected]> Date: Fri Aug 14 18:15:45 2020 +0200 client/authority-discovery: Revert query interval change (paritytech#6897) Revert the accidental query interval change from every one minute back to every 10 minutes. commit 13b0650 Author: Roman Borschel <[email protected]> Date: Fri Aug 14 10:41:47 2020 +0200 Update to libp2p-0.23. (paritytech#6870) * Update to libp2p-0.23. Thereby incorporate bandwidth measurement along the lines previously done by libp2p itself. * Tweak dependencies for wasm32 compilation. For wasm32 we need to enable unstable features to make `task::Builder::local` available. * Simplify dependencies. * Simplify. Leave the calculation of bytes sent/received per second to the outer layers of the code, subject to their own individual update intervals. * Cleanup * Re-add lost dev dependency. * Avoid division by zero. * Remove redundant metric. * Enable sending of noise legacy handshakes. * Add comment about monotonic gauge. * CI commit 0e703a5 Author: Alan Sapede <[email protected]> Date: Fri Aug 14 04:15:59 2020 -0400 Adds debug logs to EVM frame (paritytech#6887) commit f16cbc1 Author: Kian Paimani <[email protected]> Date: Thu Aug 13 23:30:22 2020 +0200 More renaming to move away from phragmen. (paritytech#6886) commit 8993a75 Author: André Silva <[email protected]> Date: Thu Aug 13 19:38:14 2020 +0100 network: don't log re-discovered addresses (paritytech#6881) * network: move LruHashSet to network crate utils * network: don't log re-discovered external addresses * Update client/network/src/utils.rs Co-authored-by: mattrutherford <[email protected]> Co-authored-by: mattrutherford <[email protected]> commit 4d3c948 Author: Alexander Popiak <[email protected]> Date: Thu Aug 13 18:54:05 2020 +0200 add runtime migrations to release notes/changelog (paritytech#6875) commit d019a66 Author: Wei Tang <[email protected]> Date: Thu Aug 13 14:53:42 2020 +0200 pallet-evm: avoid double fee payment (paritytech#6858) * pallet-evm: avoid double fee payment * Only skip fee payment for successful calls commit ed4f7a1 Author: Bastian Köcher <[email protected]> Date: Wed Aug 12 21:35:10 2020 +0200 Make `HexDisplay` useable in `no_std` (paritytech#6883) Actually I use this quite often when debugging some WASM bugs and there is no harm in enabling it by default. Before I just always copied it everytime I needed it. commit 473a23f Author: Max Inden <[email protected]> Date: Wed Aug 12 16:16:40 2020 +0200 client/authority-discovery: Introduce AuthorityDiscoveryService (paritytech#6760) * client/authority-discovery: Rename AuthorityDiscovery to XXXWorker * client/authority-discovery: Introduce AuthorityDiscoveryService Add a basic `AuthorityDiscoveryService` implementation which enables callers to get the addresses for a given `AuthorityId` from the local cache. * client/authority-discovery: Split into worker and service mod Move `Service` and `Worker` to their own Rust modules resulting in the following file structure. ├── build.rs ├── Cargo.toml └── src ├── error.rs ├── lib.rs ├── service.rs ├── tests.rs ├── worker │ ├── addr_cache.rs │ ├── schema │ │ └── dht.proto │ └── tests.rs └── worker.rs * client/authority-discovery: Cache PeerId -> AuthorityId mapping * client/authority-discovery: Update priority group on interval Instead of updating the authority discovery peerset priority group each time a new DHT value is found, update it regularly on an interval. This removes the need for deterministic random selection. Instead of trying to return a random stable set of `Multiaddr`s, the `AddrCache` now returns a random set on each call. * client/authority-discovery: Implement Service::get_authority_id * client/authority-discovery: Use HashMap instead of BTreeMap * client/authority-discovery: Rework priority group interval * client/authority-discovery: Fix comment * bin/node/cli: Update authority discovery constructor * client/authority-discovery: Fuse from_service receiver * client/authority-discovery: Remove Rng import * client/authority-discovery: Ignore Multiaddr without PeerId * client/authority-discovery/service: Add note on returned None * client/authority-discovery/addr_cache: Replace double clone with deref commit c495f89 Author: Cecile Tonglet <[email protected]> Date: Wed Aug 12 16:07:11 2020 +0200 Add async test helper to timeout and provide a task_executor automatically (paritytech#6651) * Initial commit Forked at: 60e3a69 Parent branch: origin/master * Add async test helper to timeout and provide a task_executor automatically * simplify error message to avoid difference between CI and locally * forgot env var * Use runtime env var instead of build env var * Rename variable to SUBSTRATE_TEST_TIMEOUT * CLEANUP Forked at: 60e3a69 Parent branch: origin/master * Apply suggestions from code review Co-authored-by: Bastian Köcher <[email protected]> * Re-export from test-utils * Default value to 120 * fix wrong crate in ci * Revert "Default value to 120" This reverts commit 8e45871. * Fix version * WIP Forked at: 60e3a69 Parent branch: origin/master * WIP Forked at: 60e3a69 Parent branch: origin/master * WIP Forked at: 60e3a69 Parent branch: origin/master * remove feature flag * fix missing dependency * CLEANUP Forked at: 60e3a69 Parent branch: origin/master * fix test * Removed autotests=false * Some doc... * Apply suggestions from code review Co-authored-by: Bastian Köcher <[email protected]> * WIP Forked at: 60e3a69 Parent branch: origin/master * WIP Forked at: 60e3a69 Parent branch: origin/master * Update test-utils/src/lib.rs Co-authored-by: Bastian Köcher <[email protected]> commit 0c3cdf1 Author: mattrutherford <[email protected]> Date: Wed Aug 12 12:53:21 2020 +0100 Implement tracing::Event handling & parent_id for spans and events (paritytech#6672) * implement events handling, implement parent_id for spans & events * add events to sp_io::storage * update test * add tests * adjust limit * let tracing crate handle parent_ids * re-enable current-id tracking * add test for threads with CurrentSpan * fix log level * remove redundant check for non wasm traces * remove duplicate definition in test * Adding conditional events API * prefer explicit parent_id over current, enhance test * limit changes to client::tracing event implementation * remove From impl due to fallback required on parent_id * implement SPAN_LIMIT change event log output * change version of tracing-core * update dependancies * revert limit * remove duplicate dependency * Apply suggestions from code review Co-authored-by: Bastian Köcher <[email protected]> Co-authored-by: Matt Rutherford <[email protected]> Co-authored-by: Benjamin Kampmann <[email protected]> Co-authored-by: Bastian Köcher <[email protected]> commit 5b809d2 Author: Wei Tang <[email protected]> Date: Wed Aug 12 12:46:28 2020 +0200 pallet-evm: fix wrong logic in mutate_account_basic (paritytech#6786) * pallet-evm: fix wrong logic in mutate_account_basic * Add test for mutate account commit f6d66db Author: Pierre Krieger <[email protected]> Date: Wed Aug 12 11:58:01 2020 +0200 Add a warning if users pass --sentry or --sentry-nodes (paritytech#6779) * Add a warning if users pass --sentry or --sentry-nodes * Apply suggestions from code review Co-authored-by: Max Inden <[email protected]> * Fix text Co-authored-by: parity-processbot <> Co-authored-by: Max Inden <[email protected]> commit d7979d0 Author: Shaopeng Wang <[email protected]> Date: Wed Aug 12 21:21:36 2020 +1200 Implement 'transactional' annotation for runtime functions. (paritytech#6763) * Implement 'transactional' annotation for runtime functions. * Allow function attributes for dispatchable calls in decl_module. * decl_module docs: add transactional function example. * decl_module docs: add function attributes notes. * Fix license header. commit d4efdf0 Author: Pierre Krieger <[email protected]> Date: Wed Aug 12 10:58:16 2020 +0200 Fuse the import queue receiver (paritytech#6876) * Fix the import queue receiver * Add logging commit a20fbd5 Author: André Silva <[email protected]> Date: Tue Aug 11 22:21:45 2020 +0100 docs: fix references to code of conduct document (paritytech#6879) commit e7cc595 Author: h4x3rotab <[email protected]> Date: Wed Aug 12 04:12:34 2020 +0800 Add Phala Network SS58 address type (paritytech#6758) commit fe3fc04 Author: André Silva <[email protected]> Date: Tue Aug 11 20:55:15 2020 +0100 docs: convert code of conduct to markdown (paritytech#6878) commit 72addfa Author: Kian Paimani <[email protected]> Date: Tue Aug 11 17:07:17 2020 +0200 Fix wrong staking doc about transaction payment. (paritytech#6873) * Fx paritytech#4616 * Fix paritytech#4616 commit 4064378 Author: André Silva <[email protected]> Date: Tue Aug 11 16:05:59 2020 +0100 grandpa: change some logging from trace to debug (paritytech#6872) * grandpa: change some logging from trace to debug * grandpa: cleanup unused import commit 6f57582 Author: Nikolay Volf <[email protected]> Date: Tue Aug 11 18:05:31 2020 +0300 Move to upstream wasmtime, refactor globals snapshot (paritytech#6759) * refactor globals snapshot * ignore test * update pwasm-utils ref * line width * add doc comment for internal struct * add explanation for iteration * Demote rustdoc to a comment * use 0.14 Co-authored-by: Sergei Shulepov <[email protected]> commit a362997 Author: Arkadiy Paronyan <[email protected]> Date: Tue Aug 11 16:12:30 2020 +0200 Block packet size limit (paritytech#6398) * Block packet size limit * Update client/network/src/protocol.rs Co-authored-by: Pierre Krieger <[email protected]> * Add block response limit Co-authored-by: Pierre Krieger <[email protected]>
This PR updates substrate to
libp2p-0.23
. The main changes are the second step in noise spec-compatibility upgrades and simplified bandwidth tracking in libp2p itself.Regarding the noise upgrade, so far I did not enable the legacy mode here, assuming that the
libp2p-0.22
upgrade is already widely deployed. If we wish to be conservative, I can enable sending of the legacy handshake payloads. Please let me know.Regarding the bandwidth tracking, I tried to incorporate roughly the same moving-average calculation for inbound and outbound network traffic that was previously in libp2p, i.e. to maintain the existing functionality and APIs of
sc-network
.