Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce contention in timer driver #6504

Open
3 tasks done
Darksonn opened this issue Apr 21, 2024 · 12 comments
Open
3 tasks done

Reduce contention in timer driver #6504

Darksonn opened this issue Apr 21, 2024 · 12 comments
Labels
A-tokio Area: The main tokio crate E-hard Call for participation: Experience needed to fix: Hard / a lot E-help-wanted Call for participation: Help is requested to fix this issue. M-runtime Module: tokio/runtime M-time Module: tokio/time T-performance Topic: performance and benchmarks

Comments

@Darksonn
Copy link
Contributor

Darksonn commented Apr 21, 2024

I have often heard reports about performance issues due to large amounts of timers. Specifically, this is due to contention on the mutex in the timer driver.

Some ideas:

  • If a Sleep instance is dropped before it is polled, its destructor will still lock the mutex to check whether it needs to be removed from the driver. This seems unnecessary, and will be a common case for timeouts.
  • We could shard the timer into multiple instances with one per runtime thread.
  • It's worth thinking about whether our timers are optimized for the case where most timers complete normally (often the case of tokio::time::sleep), or the case where most timers are cancelled before they complete (often the case for timeouts). Can we do anything to differentiate this in the driver?
  • Perhaps a completely different implementation is appropriate?

Before spending a lot of time on a large change, please check with us beforehand by posting here or in Discord so that we can discuss whether your implementation idea is what we want.

PRs related to this issue:

  • Lazily init timers on first poll (#6512)
  • Using sharded locks instead of a global lock for Timers (#6534)
  • Avoid traversing entries in the time wheel twice (#6584)
@Darksonn Darksonn added E-help-wanted Call for participation: Help is requested to fix this issue. A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime M-time Module: tokio/time E-hard Call for participation: Experience needed to fix: Hard / a lot T-performance Topic: performance and benchmarks labels Apr 21, 2024
@conradludgate
Copy link
Contributor

conradludgate commented Apr 22, 2024

I think sharding could go a long way with relatively little code. On TimerEntry creation, it acquires a random shard index, modulo the number of configured shards on the runtime. DashMap shards by default with 4*num_cpus(). I think starting with the number of worker threads would go a long way on it's own though in reducing the lock contention. Making it later configurable will help for any such corner cases in future.

On second thoughts, driving each shard adds more complexity...

@wathenjiang
Copy link
Contributor

wathenjiang commented Apr 22, 2024

Hi. I would like to provide a possible solution here.

I think for most web applications(or most applications), perhaps canceling timeouts before they are completed is the main source of performance issues.

(1) Background

In production environments, servers that handle network data requests often set timeouts for connections or requests.

See https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/

However, multi-thread concurrent timeout operations may be affected by lock contention performance.

And it because the reason some web server use it own timeout crate instead the timeout in tokio. See https://github.com/cloudflare/pingora/tree/main/pingora-timeout

(2) Solution

We can assign a TimeHandle or Inner in time/mod.rs to each worker thread. Then we move multiple TimeHandles from driver:: Handle to shared: worker:: Shared, and each worker thread is assigned a TimeHandle in the manner of remote: Box<[Remote]>.

The details here can be further discussed and considered, and this is only a preliminary idea :)

Yes, we want to make the timer thread-local for every worker threadl, for reducing the lock contention.

Since sleep/timeout must be executed in the worker thread, each sleep and timeout can find its thread-local timer.

(3) Performance improvement

  • There is no global lock for creating sleeps or timeouts . Each timer is thread-local.
  • Because of the work-stealing, the sleep or timeout might be dropped in the worker thread which is not the thread created it. One worker thread needs to lock the thread-local timer of another worker thread. But we can assume that the task distribution is sufficiently uniform, so the acquisition of locks is also uniform. For that reason, maybe we need a index to tell which timers the TimerShared comes from.
  • On the other hand, as the background says, there are a large number of high concurrency most timers are cancelled before they complete (often the case for timeouts). Such as the following code which is a situation encountered by most web servers. But because of the create and drop is all local, we can get a good performance.
  • In Driver.park_timeoutmethod, we get the all timer locks to get the smallest Instance to calculate timeout which is used for poll driver. Because we get the driver first, the lock condition lock here is small scale.
use std::time::Duration;
use tokio::time::timeout;

#[tokio::main]
async fn main() {

    let fut = async { 1 }; // a vecry quick async task, but might timeout
    let _to = timeout(Duration::from_secs(1), fut);
}

@conradludgate
Copy link
Contributor

conradludgate commented Apr 22, 2024

Since sleep/timeout must be executed in the worker thread, each sleep and timeout can find its thread-local timer.

This is not true. You can spawn a separate thread for the tokio runtime and register/poll the timeout on a different thread by using Handle::enter() on that thread.

This is how https://github.com/smol-rs/async-compat works

@wathenjiang
Copy link
Contributor

Yes, your view is correct. More precisely, sleep and timeout require the handle in the TLS.

@wathenjiang
Copy link
Contributor

I think we can achieve our goal this way:

  • Assigning a thread local timer to each worker thread can make the poll timer logic straightforward.
  • For those sleep and timeout created in other threads, we can determine which timer to place them in based on the calculated Instant (sharding).

The advantage of doing so is that there is no need to introduce additional complexity of sharding.

Always using Instant for sharding is another solution I came up with, but due to the loss of some thread-local, the cost of lock contention might be higher. I prefer the first option, but the second option is also acceptable.

@Darksonn
Copy link
Contributor Author

Here's what I'm thinking. We shard the timer into several shards. When timers are created from within the runtime, they register with whichever timer shard is associated with the current runtime thread. Timers from outside the runtime pick a timer shard randomly.

When a timer is cancelled before completion, it must lock the shard it was registered with. That may or may not be the one on the current thread, depending on whether it has been moved across threads. It could be worth asking someone with a running production system to instrument their timers to see how often that happens.

Anyway, when we poll the IO/timer driver, it's generally done on just one thread. To start with, I don't think we should try to change that. The driver can just lock all shards one after the other, get the minimum timer from each one, and use that for its sleep duration.

I'd like to avoid locking the timer driver mutex in drop of Sleep when not necessary. For example, if the Sleep has not yet been polled, or if it has completed, then I think we can probably avoid taking the mutex.

None of this should happen in one PR. I think these changes can be done incrementally.

@wathenjiang
Copy link
Contributor

I would like to solove this issue by some PRs incrementally. It might start with reducing the unnecessary lock contention most timeout are cancelled before they complete (often the case for timeouts).

@Darksonn
Copy link
Contributor Author

That's great! Specifically, what change do you intend to start with in the first PR?

@wathenjiang
Copy link
Contributor

Hi. I think the most pressing issue currently is that even if timeout has never been registered, locks will still be used in cancel. This will be my first change.

@Darksonn
Copy link
Contributor Author

I agree. That would also be my first change. Go for it. :)

wathenjiang added a commit to wathenjiang/tokio that referenced this issue Apr 23, 2024
See tokio-rs#6504

Below are relevant benchmark results for various GOMAXPROCS values
on m1/:

Below are relevant benchmark results of the current version:
single_thread_timeout   time:   [21.869 ns 21.987 ns 22.135 ns]
                        change: [-3.4429% -2.0709% -0.8759%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

multi_thread_timeout-8  time:   [4.4835 ns 4.6138 ns 4.7614 ns]
                        change: [-4.3554% +0.1643% +4.5114%] (p = 0.95 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) high mild
  1 (1.00%) high severe

Below are relevant benchmark results of this version:

single_thread_timeout   time:   [40.227 ns 40.416 ns 40.691 ns]
                        change: [+81.321% +82.817% +84.121%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

multi_thread_timeout-8  time:   [183.16 ns 186.02 ns 188.21 ns]
                        change: [+3765.0% +3880.4% +3987.4%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low severe
  6 (6.00%) low mild
wathenjiang added a commit to wathenjiang/tokio that referenced this issue Apr 23, 2024
See tokio-rs#6504

Below are relevant benchmark results of the current version on m1 mac:
single_thread_timeout   time:   [21.869 ns 21.987 ns 22.135 ns]
                        change: [-3.4429% -2.0709% -0.8759%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

multi_thread_timeout-8  time:   [4.4835 ns 4.6138 ns 4.7614 ns]
                        change: [-4.3554% +0.1643% +4.5114%] (p = 0.95 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) high mild
  1 (1.00%) high severe

Below are relevant benchmark results of this version on m1 mac:

single_thread_timeout   time:   [40.227 ns 40.416 ns 40.691 ns]
                        change: [+81.321% +82.817% +84.121%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

multi_thread_timeout-8  time:   [183.16 ns 186.02 ns 188.21 ns]
                        change: [+3765.0% +3880.4% +3987.4%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low severe
  6 (6.00%) low mild
wathenjiang added a commit to wathenjiang/tokio that referenced this issue Apr 23, 2024
    See tokio-rs#6504

    Below are relevant benchmark results of this PR on m1 mac:
    single_thread_timeout   time:   [21.869 ns 21.987 ns 22.135 ns]
                            change: [-3.4429% -2.0709% -0.8759%] (p = 0.00 < 0.05)
                            Change within noise threshold.
    Found 7 outliers among 100 measurements (7.00%)
      3 (3.00%) high mild
      4 (4.00%) high severe

    multi_thread_timeout-8  time:   [4.4835 ns 4.6138 ns 4.7614 ns]
                            change: [-4.3554% +0.1643% +4.5114%] (p = 0.95 > 0.05)
                            No change in performance detected.
    Found 9 outliers among 100 measurements (9.00%)
      8 (8.00%) high mild
      1 (1.00%) high severe

    Below are relevant benchmark results of current version on m1 mac:

    single_thread_timeout   time:   [40.227 ns 40.416 ns 40.691 ns]
                            change: [+81.321% +82.817% +84.121%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 14 outliers among 100 measurements (14.00%)
      3 (3.00%) high mild
      11 (11.00%) high severe

    multi_thread_timeout-8  time:   [183.16 ns 186.02 ns 188.21 ns]
                            change: [+3765.0% +3880.4% +3987.4%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 10 outliers among 100 measurements (10.00%)
      4 (4.00%) low severe
      6 (6.00%) low mild
@conradludgate
Copy link
Contributor

I tried removing the lock call on drop, but it wasn't super trivial due to the happens-before nature. Loom was not happy. The not-yet-polled state can be hardcoded. The deregistered state could perhaps be set when sleep returns Ready, I'm not sure if that will solve the loom problems though since that also only reads the state deregistered atomic.

@carllerche
Copy link
Member

Here's what I'm thinking. We shard the timer into several shards. When timers are created from within the runtime, they register with whichever timer shard is associated with the current runtime thread. Timers from outside the runtime pick a timer shard randomly.

When a timer is cancelled before completion, it must lock the shard it was registered with. That may or may not be the one on the current thread, depending on whether it has been moved across threads. It could be worth asking someone with a running production system to instrument their timers to see how often that happens.

Anyway, when we poll the IO/timer driver, it's generally done on just one thread. To start with, I don't think we should try to change that. The driver can just lock all shards one after the other, get the minimum timer from each one, and use that for its sleep duration.

I'd like to avoid locking the timer driver mutex in drop of Sleep when not necessary. For example, if the Sleep has not yet been polled, or if it has completed, then I think we can probably avoid taking the mutex.

None of this should happen in one PR. I think these changes can be done incrementally.

This strategy sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate E-hard Call for participation: Experience needed to fix: Hard / a lot E-help-wanted Call for participation: Help is requested to fix this issue. M-runtime Module: tokio/runtime M-time Module: tokio/time T-performance Topic: performance and benchmarks
Projects
None yet
Development

No branches or pull requests

4 participants