Use a scalable RW lock in tokio-reactor #517

ghost · 2018-08-03T17:42:10Z

Replaces std::sync::RwLock around Slab<ScheduledIo> in tokio-reactor with a custom implementation of a scalable reader-writer lock.

This lock reduces contention among readers considerably, at the expense of higher memory consumption and slower writers. However, write operations happen only when registering and unregistering IO handles, so this tradeoff is hopefully worth it.

I added a simple benchmark to tokio-reactor/examples (couldn't figure out a better place to put it into - ideas?). The benchmark spawns a number of tasks. Each task registers a custom MIO handle and polls it a number of times. The state of the handle switches between "not ready" and "ready" state at each successive call to poll.

With std::sync::RwLock, the benchmark runs in 0.590 sec.
With parking_lot::RwLock, the benchmark runs in 0.520 sec.
With sharded_lock::RwLock, the benchmark runs in 0.430 sec.

So this an improvement of around 27%.

@seanmonstar I hope this has some effect on your benchmarks mentioned in #426.

@jonhoo This lock is similar to your drwmutex, except it manually assigns IDs to threads and stores them in thread-locals, while your lock executes CPUID to figure out the current core ID and select the appropriate shard.

seanmonstar · 2018-08-03T18:20:50Z

I ran hyper's hello world example in release mode with 3 different configurations, benchmarked against wrk -t1 -d10 -c50, 3 times for each to get an average, these are my results:

Using current_thread:
1. 33,865 requests/sec
2. 33,189 requests/sec
3. 33,406 requests/sec
Using tokio::run from 0.1.7:
1. 28,108 requests/sec
2. 27,703 requests/sec
3. 27,699 requests/sec
Using tokio::run with this branch:
1. 28,353 requests/sec
2. 28,573 requests/sec
3. 28,322 requests/sec

carllerche · 2018-08-03T18:23:44Z

I expect the main difference between current_thread and tokio::run is the fact that the reactor must talk across threads to the task executor.

carllerche · 2018-08-03T18:24:05Z

However, it does look like this results in a small improvement, which is good.

tobz · 2018-08-04T19:32:27Z

(deleted my first two comments because I need to redo all the tests, and I didn't want to cloud this PR with what might be an issue with tokio itself on macOS)

carllerche · 2018-08-08T15:55:41Z

What is the conclusion to this? Should it be merged at this point?

tobz · 2018-08-08T17:02:54Z

My conclusion, chatting with @stjepang, was that I likely wasn't seeing a performance improvement over master due to the performance regression on macOS.

It seems like for everyone else, this has a performance benefit. I expect it would have a benefit for me if the performance regression was taken care of.

Seems like it's worth merging.

ghost · 2018-08-08T19:21:06Z

@tobz Maybe we should also run benchmarks after rebasing on top of #534 :)

tobz · 2018-08-09T11:06:27Z

After working with @stjepang to figure out the macOS issue, I finally have some numbers to report:

low concurrency (single client): 13.6s vs 13s (0.1.7 is faster, 4.5%)
medium concurrency (8 clients): 11.8s vs 13.7s (sharded is faster, 14%)
high concurrency (32 clients): 9.25s vs 10.5s (sharded is faster, 11.5%)

This was running synchrotron against a single backend Redis node, and then running redis-benchmark with a set number of requests, and seeing how long the run took to process all the requests. I executed each benchmark run after a 15 second sleep (to try and give the CPU time to cool off and avoid throttling), ran the benchmark 10 times per case, and took an average of the best 5 runs.

The numbers are probably fungible a percentage point or two: OS jitter, CPU thermal throttling, etc. It seems very likely that the PR adds minimal overhead in the low concurrency case.

ghost · 2018-08-09T12:17:59Z

@tobz Those numbers are looking pretty good! Again, thank you for taking so much time in order to investigate the issue. :)

PR #534 fixed the performance regression so (after rebasing on top of it) perhaps we'd get some improvements even in @seanmonstar's benchmarks.

@carllerche I think we can merge this now. A possible further improvement might be to specify the number of threads (or concurrency level) using the reactor in its builder, but it's probably not worth it.

carllerche · 2018-08-09T16:03:19Z

I can merge this as is if you want. I have a couple of thoughts:

Should this shared rw lock live in crossbeam? It looks generally useful.
I was actually thinking of sharding the slab as well. So, instead of a single slab, there would be one per slab. This would allow the data itself to be sharded and allow writes to be sharded, etc...

Either way, the numbers show an improvement already, so merging seems fine to me.

ghost · 2018-08-09T18:02:36Z

Should this sharded rw lock live in crossbeam? It looks generally useful.

We are considering this, but feel somewhat hesitant because it might be a better fit for something like parking_lot or even a dedicated crate. cc @Amanieu - thoughts?

The part that assigns each thread an 'index' also seems generally useful.

I was actually thinking of sharding the slab as well. So, instead of a single slab, there would be one per slab. This would allow the data itself to be sharded and allow writes to be sharded, etc...

This is also a good idea, but probably not worth the effort, IMO. The reason is that writes are comparatively rare (they only occur when registering or unregistering an Evented), while reads are much more common (they occur on every poll and notification). The benefits of sharding the slab will be apparent only if writes are frequent enough.

I believe the next most promising step in reducing contention is creating a reactor per thread (#424).

This PR introduces `ShardedLock`, which is a variant of `RwLock` where concurrent read operations don't contend as much. `ShardedLock` consists of a bunch of smaller `RwLock`s called *shards*. A read operation read-locks a single shard only. The shard is chosen based on the current thread so that contention is reduced as much as possible. A write operation has to write-lock every shard in order. That means we trade faster reads for slower writes. Another way of looking at it is: sharded locking is just poor man's hardware lock elision. While `parking_lot`'s `RwLock` does use HLE on platforms that support it, we have to resort to `ShardedLock` on others. This PR is basically just a port of [another one](tokio-rs/tokio#517) that added a sharded RW lock to `tokio-reactor` in order to improve the performance of contended reads. I was told that `ShardedLock` would also be useful in [salsa](https://github.com/nikomatsakis/salsa). Let's add it to Crossbeam so that anyone can use it.

Use a scalable RW lock in tokio-reactor

20084a4

carllerche approved these changes Aug 9, 2018

View reviewed changes

carllerche merged commit fd36054 into tokio-rs:master Aug 9, 2018

ghost deleted the sharded-rwlock branch August 9, 2018 18:25

This was referenced Sep 25, 2018

reactor: turn bench-poll into a proper benchmark #662

Merged

Add a sharded reader-writer lock crossbeam-rs/crossbeam-utils#48

Merged

carllerche mentioned this pull request Oct 2, 2018

runtime: create reactor per worker #660

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a scalable RW lock in tokio-reactor #517

Use a scalable RW lock in tokio-reactor #517

ghost commented Aug 3, 2018

seanmonstar commented Aug 3, 2018

carllerche commented Aug 3, 2018

carllerche commented Aug 3, 2018

tobz commented Aug 4, 2018

carllerche commented Aug 8, 2018

tobz commented Aug 8, 2018

ghost commented Aug 8, 2018

tobz commented Aug 9, 2018 •

edited

Loading

ghost commented Aug 9, 2018

carllerche commented Aug 9, 2018

ghost commented Aug 9, 2018

Use a scalable RW lock in tokio-reactor #517

Use a scalable RW lock in tokio-reactor #517

Conversation

ghost commented Aug 3, 2018

seanmonstar commented Aug 3, 2018

carllerche commented Aug 3, 2018

carllerche commented Aug 3, 2018

tobz commented Aug 4, 2018

carllerche commented Aug 8, 2018

tobz commented Aug 8, 2018

ghost commented Aug 8, 2018

tobz commented Aug 9, 2018 • edited Loading

ghost commented Aug 9, 2018

carllerche commented Aug 9, 2018

ghost commented Aug 9, 2018

tobz commented Aug 9, 2018 •

edited

Loading