-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a scalable RW lock in tokio-reactor #517
Conversation
I ran hyper's hello world example in release mode with 3 different configurations, benchmarked against
|
I expect the main difference between |
However, it does look like this results in a small improvement, which is good. |
(deleted my first two comments because I need to redo all the tests, and I didn't want to cloud this PR with what might be an issue with tokio itself on macOS) |
What is the conclusion to this? Should it be merged at this point? |
My conclusion, chatting with @stjepang, was that I likely wasn't seeing a performance improvement over master due to the performance regression on macOS. It seems like for everyone else, this has a performance benefit. I expect it would have a benefit for me if the performance regression was taken care of. Seems like it's worth merging. |
After working with @stjepang to figure out the macOS issue, I finally have some numbers to report:
This was running The numbers are probably fungible a percentage point or two: OS jitter, CPU thermal throttling, etc. It seems very likely that the PR adds minimal overhead in the low concurrency case. |
@tobz Those numbers are looking pretty good! Again, thank you for taking so much time in order to investigate the issue. :) PR #534 fixed the performance regression so (after rebasing on top of it) perhaps we'd get some improvements even in @seanmonstar's benchmarks. @carllerche I think we can merge this now. A possible further improvement might be to specify the number of threads (or concurrency level) using the reactor in its builder, but it's probably not worth it. |
I can merge this as is if you want. I have a couple of thoughts:
Either way, the numbers show an improvement already, so merging seems fine to me. |
We are considering this, but feel somewhat hesitant because it might be a better fit for something like The part that assigns each thread an 'index' also seems generally useful.
This is also a good idea, but probably not worth the effort, IMO. The reason is that writes are comparatively rare (they only occur when registering or unregistering an I believe the next most promising step in reducing contention is creating a reactor per thread (#424). |
This PR introduces `ShardedLock`, which is a variant of `RwLock` where concurrent read operations don't contend as much. `ShardedLock` consists of a bunch of smaller `RwLock`s called *shards*. A read operation read-locks a single shard only. The shard is chosen based on the current thread so that contention is reduced as much as possible. A write operation has to write-lock every shard in order. That means we trade faster reads for slower writes. Another way of looking at it is: sharded locking is just poor man's hardware lock elision. While `parking_lot`'s `RwLock` does use HLE on platforms that support it, we have to resort to `ShardedLock` on others. This PR is basically just a port of [another one](tokio-rs/tokio#517) that added a sharded RW lock to `tokio-reactor` in order to improve the performance of contended reads. I was told that `ShardedLock` would also be useful in [salsa](https://github.com/nikomatsakis/salsa). Let's add it to Crossbeam so that anyone can use it.
Replaces
std::sync::RwLock
aroundSlab<ScheduledIo>
intokio-reactor
with a custom implementation of a scalable reader-writer lock.This lock reduces contention among readers considerably, at the expense of higher memory consumption and slower writers. However, write operations happen only when registering and unregistering IO handles, so this tradeoff is hopefully worth it.
I added a simple benchmark to
tokio-reactor/examples
(couldn't figure out a better place to put it into - ideas?). The benchmark spawns a number of tasks. Each task registers a custom MIO handle and polls it a number of times. The state of the handle switches between "not ready" and "ready" state at each successive call topoll
.With
std::sync::RwLock
, the benchmark runs in 0.590 sec.With
parking_lot::RwLock
, the benchmark runs in 0.520 sec.With
sharded_lock::RwLock
, the benchmark runs in 0.430 sec.So this an improvement of around 27%.
@seanmonstar I hope this has some effect on your benchmarks mentioned in #426.
@jonhoo This lock is similar to your
drwmutex
, except it manually assigns IDs to threads and stores them in thread-locals, while your lock executes CPUID to figure out the current core ID and select the appropriate shard.