threadpool: spawn new tasks onto a random worker #683

ghost · 2018-10-03T12:48:13Z

Motivation

Once an IO handle gets assigned to a reactor, it stays assigned to it forever. A handle is assigned to a reactor the first time it is polled.

Now that every worker thread in tokio-threadpool drives its own reactor, we'd like to distribute the IO work among reactors as equally as possible. This is only possible if each newly spawned task is polled the first time on a random worker.

Currently, when spawning a new task inside threadpool, it always goes into the local worker's queue. If a task spawns a bunch of child IO tasks, that means a disproportionate number of them will be polled the first time by the current worker thread. Work stealing doesn't help much in this case - it turns out a lot of tasks will be polled before other worker threads get a chance to steal them, thus skewing the IO handle distribution onto reactors.

Solution

When spawning a new task, always pick a random worker and push the task into that worker's inbound queue.

Note: this heuristic is not perfect, but seems to be an improvement over the current task spawning strategy. We might tweak it further in the future.

I've expanded the tokio-reactor benchmark. Here's our baseline using tokio-io-pool:

test io_pool::notify_many    ... bench:  34,479,546 ns/iter (+/- 10,232,063)

This is tokio-threadpool before PR #660 (at commit 886511c), with a single global reactor:

test threadpool::notify_many ... bench:  87,539,637 ns/iter (+/- 4,874,898)

After PR #660 (commit d35d051), with a reactor per worker thread:

test threadpool::notify_many ... bench:  57,666,138 ns/iter (+/- 6,762,130)

Finally, this PR, with randomized task submission:

test threadpool::notify_many ... bench:  46,792,382 ns/iter (+/- 5,140,460)

Still not as fast as tokio-io-pool, but we're getting there.

cc @jonhoo

jonhoo · 2018-10-03T13:56:53Z

Interesting... tokio-io-pool uses a slightly different heuristic here: anything that gets spawned onto the pool (e.g. with Runtime::spawn, Runtime::spawn_all, or through a Handle) is sent to a random worker, whereas anything that is spawned on the pool (through tokio::spawn). The idea behind this was that when you add a new thing onto the pool, it's likely that that that's the result of something like an accept, and then it should definitely be distributed evenly (as you observe in this PR). Once something is on the pool though, we want to keep spawn fast, so we may as well spawn it locally. Specifically, the observation is that I/O tasks don't generally spawn lots of other I/O tasks.

This might be trickier in tokio proper since the listener will also be running on the pool, so we don't have a good way of distinguishing the two...

ghost · 2018-10-03T14:10:50Z

This might be trickier in tokio proper since the listener will also be running on the pool, so we don't have a good way of distinguishing the two...

Yep, that's why I'm leaning towards always spawning a new task onto a random worker - just in case.

Once something is on the pool though, we want to keep spawn fast, so we may as well spawn it locally.

I'm not as experienced with the real-world use of Tokio, but I wonder: Is the cost of spawning something we should be particularly worried about?

My gut feeling is that tasks are relatively rarely spawned and that the cost of spawning tends to be negligible in the grand scheme of things. On the other hand, I'd wager tasks are yielded much more often than they're spawned so we should strive to optimize task switching as much as possible.

jonhoo · 2018-10-03T14:30:45Z

Yeah, you could be right that spawning is relatively rare on the critical path, so this may just be a non-issue :)

carllerche

Looks fine to me, I just had a question about the dependency version bump in tokio-reactor/Cargo.toml, but not a blocker.

carllerche · 2018-10-03T15:12:24Z

tokio-reactor/Cargo.toml

@@ -27,9 +27,9 @@ mio = "0.6.14"
 num_cpus = "1.8.0"
 parking_lot = "0.6.3"
 slab = "0.4.0"
-tokio-executor = { version = "0.1.1", path = "../tokio-executor" }
-tokio-io = { version = "0.1.6", path = "../tokio-io" }
+tokio-executor = { version = "0.1.5", path = "../tokio-executor" }


Why was this bumped? Were the minimal versions wrong?

That was just a desperate attempt while I was struggling to patch the dependencies. :)

In the end, moving [patch.crates-io] to the root Cargo.toml fixed the issue.

Thanks for reporting this to rust-lang/cargo!

carllerche · 2018-10-03T15:20:58Z

Cargo.toml

@@ -89,3 +89,20 @@ serde = "1.0"
 serde_derive = "1.0"
 serde_json = "1.0"
 time = "0.1"
+
+[patch.crates-io]


I filed rust-lang/cargo#6126 to address this.

carllerche · 2018-10-03T15:29:31Z

I'm using this patch to bench the hyper hello example:

With this patch:

$ wrk -t1 -d10 -c50 http://127.0.0.1:3000/
Running 10s test @ http://127.0.0.1:3000/
  1 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   368.86us   69.77us   4.96ms   90.77%
    Req/Sec    75.06k     6.64k   79.88k    88.00%
  746917 requests in 10.00s, 62.68MB read
Requests/sec:  74679.24
Transfer/sec:      6.27MB

Requests/sec: 74679.24

Against `current_thread::Runtime`

wrk -t1 -d10 -c50 http://127.0.0.1:3000/
Running 10s test @ http://127.0.0.1:3000/
  1 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   603.60us   84.34us   5.56ms   91.32%
    Req/Sec    81.17k     2.25k   82.73k    98.00%
  807192 requests in 10.00s, 67.74MB read
Requests/sec:  80708.63
Transfer/sec:      6.77MB

Requests/sec: 80708.63

So, it seems like there is still some source of additional overhead there.

carllerche · 2018-10-03T15:34:26Z

Hyper is doing basic spawning here, so in theory it should be distributed across the runtime.

It could be that multi threading provides no added benefit in this benchmark, but I am not 100% sure what is going on.

ghost · 2018-10-03T20:56:36Z

Here's the same hyper hello benchmark on my machine.

Baseline tokio-io-pool:

$ wrk -t1 -d10 -c50 http://127.0.0.1:3000/
Running 10s test @ http://127.0.0.1:3000/
  1 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   362.54us  603.80us  14.48ms   97.47%
    Req/Sec    91.21k     3.24k   96.54k    72.00%
  907058 requests in 10.00s, 76.12MB read
Requests/sec:  90675.35
Transfer/sec:      7.61MB

tokio-threadpool before PR #660:

$ wrk -t1 -d10 -c50 http://127.0.0.1:3000/
Running 10s test @ http://127.0.0.1:3000/
  1 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   452.42us  743.86us  26.01ms   96.34%
    Req/Sec    77.83k     3.30k   85.16k    83.00%
  774089 requests in 10.00s, 64.96MB read
Requests/sec:  77374.46
Transfer/sec:      6.49MB

PR #660:

$ wrk -t1 -d10 -c50 http://127.0.0.1:3000/
Running 10s test @ http://127.0.0.1:3000/
  1 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   418.65us    0.90ms  18.85ms   97.56%
    Req/Sec    86.94k     2.85k   92.42k    78.00%
  865452 requests in 10.01s, 72.63MB read
Requests/sec:  86487.88
Transfer/sec:      7.26MB

This PR:

$ wrk -t1 -d10 -c50 http://127.0.0.1:3000/
Running 10s test @ http://127.0.0.1:3000/
  1 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   420.79us  812.86us  18.17ms   97.32%
    Req/Sec    87.12k     3.24k   95.14k    71.00%
  867375 requests in 10.01s, 72.79MB read
Requests/sec:  86687.78
Transfer/sec:      7.28MB

carllerche · 2018-10-03T21:04:46Z

@stjepang Interesting. What is the system that you are running the benchmarks on?

carllerche · 2018-10-03T21:05:30Z

@stjepang either way, I have approved the PR, so feel free to merge 👍

ghost · 2018-10-03T21:09:14Z

My machine:

OS: GNU/Linux, kernel v4.17.19
Arch: x86-64
CPU: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
Cores: 2 physical / 4 logical

threadpool: submit new tasks to a random worker

b4b1d42

ghost requested a review from carllerche October 3, 2018 12:48

carllerche approved these changes Oct 3, 2018

View reviewed changes

Revert unnecessary version bumps

a427f27

ghost merged commit e27b0a4 into tokio-rs:master Oct 3, 2018

ghost deleted the submit-random branch October 3, 2018 21:09

ghost mentioned this pull request Nov 13, 2018

runtime: create reactor per worker #660

Merged

ghost mentioned this pull request Nov 21, 2018

Performance regression in tokio-threadpool 0.1.8 #750

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

threadpool: spawn new tasks onto a random worker #683

threadpool: spawn new tasks onto a random worker #683

ghost commented Oct 3, 2018 •

edited by ghost

Loading

jonhoo commented Oct 3, 2018

ghost commented Oct 3, 2018

jonhoo commented Oct 3, 2018

carllerche left a comment

carllerche Oct 3, 2018

ghost Oct 3, 2018

carllerche Oct 3, 2018

carllerche commented Oct 3, 2018

carllerche commented Oct 3, 2018

ghost commented Oct 3, 2018

carllerche commented Oct 3, 2018

carllerche commented Oct 3, 2018

ghost commented Oct 3, 2018

threadpool: spawn new tasks onto a random worker #683

threadpool: spawn new tasks onto a random worker #683

Conversation

ghost commented Oct 3, 2018 • edited by ghost Loading

Motivation

Solution

jonhoo commented Oct 3, 2018

ghost commented Oct 3, 2018

jonhoo commented Oct 3, 2018

carllerche left a comment

Choose a reason for hiding this comment

carllerche Oct 3, 2018

Choose a reason for hiding this comment

ghost Oct 3, 2018

Choose a reason for hiding this comment

carllerche Oct 3, 2018

Choose a reason for hiding this comment

carllerche commented Oct 3, 2018

With this patch:

Against current_thread::Runtime

carllerche commented Oct 3, 2018

ghost commented Oct 3, 2018

carllerche commented Oct 3, 2018

carllerche commented Oct 3, 2018

ghost commented Oct 3, 2018

ghost commented Oct 3, 2018 •

edited by ghost

Loading

Against `current_thread::Runtime`