Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add task counter pairs #6114

Merged
merged 1 commit into from
Jun 9, 2024

Conversation

conradludgate
Copy link
Contributor

@conradludgate conradludgate commented Oct 27, 2023

Motivation

#4073 (comment)

Metrics like active_tasks_count or injection_queue_depth are fast-moving gauges and even taking a snapshot every few seconds doesn't say much about what's going inside Tokio. It would be better to use two counters: one for additions, one for removals

We're hoping to add a prometheus exporter for the tokio metrics information, but a sample rate of 15 seconds will likely miss a lot of task spikes. I could implement some level of eager aggregation, but as the linked comment says, you can still miss some with a sample rate of 500ms.

Solution

In CountedLinkedList, replace the count: usize with a pair of u64s that can only be incremented. One u64 for added items and one for removed items.

Open to bikeshedding on the terminology


Open questions

  • should the active task API return all 3 values in 1, rather than require 3 separate lock calls?
  • what other APIs are current gauges and should be counters?

@github-actions github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR labels Oct 27, 2023
@conradludgate
Copy link
Contributor Author

Gauge like metrics:

  • num_workers (a constant quality, doesn't count)
  • num_blocking_threads
  • num_idle_blocking_threads
  • injection_queue_depth
  • worker_local_queue_depth
  • blocking_queue_depth

num_blocking_threads

Can be treated as blocking_threads_created - blocking_threads_released. Would require 2 atomics, unless it is acceptable to make this a u64 which encodes 2 u32s (how many apps will create 4 billion blocking threads?!)

num_idle_blocking_threads

Same as above, although likely will need 2 u64 counters. blocking_active_total - blocking_idle_total.

injection_queue_depth

injection_pushed - injection_popped. Requires 2 u64 atomic counters.

worker_local_queue_depth

Requires no additional counters, we already have head and tail. They are u32 quantities though and will likely overflow, which makes this tricky. I appreciate that adding extra atomics to this path might introduce noticeable latency spike so I am fine with ignoring this one.

blocking_queue_depth

Same as the other blocking gauges.

@hawkw
Copy link
Member

hawkw commented Oct 30, 2023

IMO using two counters rather than a gauge is definitely more correct for these metrics, so I'm 👍 on this change.

@Darksonn Darksonn added A-tokio Area: The main tokio crate M-metrics Module: tokio/runtime/metrics labels Nov 5, 2023
tokio/src/util/linked_list.rs Outdated Show resolved Hide resolved
tokio/tests/rt_metrics.rs Outdated Show resolved Hide resolved
@Darksonn Darksonn requested a review from hawkw November 5, 2023 14:13
@Darksonn
Copy link
Contributor

Any status update on this?

@conradludgate
Copy link
Contributor Author

I'll try and fix up the flaky tests tomorrow.

Any opinions on the API? Since it's likely that the pair will be accessed together and not separately, doing 2 locks is a bit unfortunate rather than just 1. Probably this should return a tuple pair instead of having 2 functions

@Darksonn
Copy link
Contributor

Returning a tuple makes sense to me. You could even define a struct with two fields to give better names than .0 and .1 to the two properties.

@conradludgate conradludgate force-pushed the metrics-counter-pairs branch 8 times, most recently from 4ad7dd4 to 5355563 Compare November 28, 2023 11:08
@Darksonn
Copy link
Contributor

Hi, it looks like the conflicting PR has been merged now. Sorry that it took so long to get back to you after that. Are you still interested in working on this?

@conradludgate
Copy link
Contributor Author

Are you still interested in working on this?

Yes, I will rebase accordingly. Are there any other changes you think should be included?

@Darksonn
Copy link
Contributor

Hmm, overall it looks good, but I don't love the naming of CounterPair and CounterPair::len.

@conradludgate
Copy link
Contributor Author

Since the sharded list makes use of atomics, I've moved from added/removed to added/count so that is_empty() only needs 1 atomic access.

Hmm, overall it looks good, but I don't love the naming of CounterPair and CounterPair::len.

I'm tempted to remove it then and we can stick with start_task_count and active_task_count functions.

@conradludgate
Copy link
Contributor Author

also renamed start_tasks to spawned_tasks as it is likely more intuitive.

tokio/tests/rt_metrics.rs Outdated Show resolved Hide resolved
tokio/src/runtime/metrics/runtime.rs Outdated Show resolved Hide resolved
tokio/tests/rt_metrics.rs Show resolved Hide resolved
@Darksonn
Copy link
Contributor

Darksonn commented May 3, 2024

There's a CI failure:

FAIL [   0.386s] tokio::rt_metrics num_active_tasks

--- STDOUT:              tokio::rt_metrics num_active_tasks ---

running 1 test
test num_active_tasks ... FAILED

failures:

failures:
    num_active_tasks

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 22 filtered out; finished in 0.33s


--- STDERR:              tokio::rt_metrics num_active_tasks ---
thread 'num_active_tasks' panicked at tokio/tests/rt_metrics.rs:104:5:
assertion `left == right` failed
  left: 0
 right: 1
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:72:14
   2: core::panicking::assert_failed_inner
             at /rustc/9b00[956](https://github.com/tokio-rs/tokio/actions/runs/8936900362/job/24548151879?pr=6114#step:8:957)e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:343:17
   3: core::panicking::assert_failed
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:298:5
   4: rt_metrics::num_active_tasks
             at ./tests/rt_metrics.rs:104:5
   5: rt_metrics::num_active_tasks::{{closure}}
             at ./tests/rt_metrics.rs:85:22
   6: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

@conradludgate
Copy link
Contributor Author

There's a CI failure

Seems to affect only 32bit arm in the multithreaded case. My guess is that the self.count.fetch_sub(1, Ordering::Relaxed) in the worker thread is not being synchronised before the self.count.load(Ordering::Relaxed) in the test thread when reading the metrics.

I don't think it makes too much sense to use a stronger ordering. Maybe in cfg(test) we could go with SeqCst but I don't like that idea very much. I think I would rather remove the assertion for the multi-threaded test, or at least only include it in x86_64/aarch64 which does seem to always work.

@conradludgate
Copy link
Contributor Author

Opted to remove the flaky assert - it works always locally for me. I can't figure out a reliable construction to guarantee the test passes with only Relaxed ordering. I think this is good enough

tokio/tests/rt_metrics.rs Outdated Show resolved Hide resolved
@Darksonn
Copy link
Contributor

Any updates on this?

Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@Darksonn Darksonn merged commit 341b5da into tokio-rs:master Jun 9, 2024
83 checks passed
This was referenced Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-metrics Module: tokio/runtime/metrics R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants