Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a scalable RW lock in tokio-reactor #517

Merged
merged 1 commit into from Aug 9, 2018
Merged

Use a scalable RW lock in tokio-reactor #517

merged 1 commit into from Aug 9, 2018

Conversation

ghost
Copy link

@ghost ghost commented Aug 3, 2018

Replaces std::sync::RwLock around Slab<ScheduledIo> in tokio-reactor with a custom implementation of a scalable reader-writer lock.

This lock reduces contention among readers considerably, at the expense of higher memory consumption and slower writers. However, write operations happen only when registering and unregistering IO handles, so this tradeoff is hopefully worth it.

I added a simple benchmark to tokio-reactor/examples (couldn't figure out a better place to put it into - ideas?). The benchmark spawns a number of tasks. Each task registers a custom MIO handle and polls it a number of times. The state of the handle switches between "not ready" and "ready" state at each successive call to poll.

With std::sync::RwLock, the benchmark runs in 0.590 sec.
With parking_lot::RwLock, the benchmark runs in 0.520 sec.
With sharded_lock::RwLock, the benchmark runs in 0.430 sec.

So this an improvement of around 27%.

@seanmonstar I hope this has some effect on your benchmarks mentioned in #426.

@jonhoo This lock is similar to your drwmutex, except it manually assigns IDs to threads and stores them in thread-locals, while your lock executes CPUID to figure out the current core ID and select the appropriate shard.

@seanmonstar
Copy link
Member

I ran hyper's hello world example in release mode with 3 different configurations, benchmarked against wrk -t1 -d10 -c50, 3 times for each to get an average, these are my results:

  • Using current_thread:
    1. 33,865 requests/sec
    2. 33,189 requests/sec
    3. 33,406 requests/sec
  • Using tokio::run from 0.1.7:
    1. 28,108 requests/sec
    2. 27,703 requests/sec
    3. 27,699 requests/sec
  • Using tokio::run with this branch:
    1. 28,353 requests/sec
    2. 28,573 requests/sec
    3. 28,322 requests/sec

@carllerche
Copy link
Member

I expect the main difference between current_thread and tokio::run is the fact that the reactor must talk across threads to the task executor.

@carllerche
Copy link
Member

However, it does look like this results in a small improvement, which is good.

@tobz
Copy link
Member

tobz commented Aug 4, 2018

(deleted my first two comments because I need to redo all the tests, and I didn't want to cloud this PR with what might be an issue with tokio itself on macOS)

@carllerche
Copy link
Member

What is the conclusion to this? Should it be merged at this point?

@tobz
Copy link
Member

tobz commented Aug 8, 2018

My conclusion, chatting with @stjepang, was that I likely wasn't seeing a performance improvement over master due to the performance regression on macOS.

It seems like for everyone else, this has a performance benefit. I expect it would have a benefit for me if the performance regression was taken care of.

Seems like it's worth merging.

@ghost
Copy link
Author

ghost commented Aug 8, 2018

@tobz Maybe we should also run benchmarks after rebasing on top of #534 :)

@tobz
Copy link
Member

tobz commented Aug 9, 2018

After working with @stjepang to figure out the macOS issue, I finally have some numbers to report:

  • low concurrency (single client): 13.6s vs 13s (0.1.7 is faster, 4.5%)
  • medium concurrency (8 clients): 11.8s vs 13.7s (sharded is faster, 14%)
  • high concurrency (32 clients): 9.25s vs 10.5s (sharded is faster, 11.5%)

This was running synchrotron against a single backend Redis node, and then running redis-benchmark with a set number of requests, and seeing how long the run took to process all the requests. I executed each benchmark run after a 15 second sleep (to try and give the CPU time to cool off and avoid throttling), ran the benchmark 10 times per case, and took an average of the best 5 runs.

The numbers are probably fungible a percentage point or two: OS jitter, CPU thermal throttling, etc. It seems very likely that the PR adds minimal overhead in the low concurrency case.

@ghost
Copy link
Author

ghost commented Aug 9, 2018

@tobz Those numbers are looking pretty good! Again, thank you for taking so much time in order to investigate the issue. :)

PR #534 fixed the performance regression so (after rebasing on top of it) perhaps we'd get some improvements even in @seanmonstar's benchmarks.

@carllerche I think we can merge this now. A possible further improvement might be to specify the number of threads (or concurrency level) using the reactor in its builder, but it's probably not worth it.

@carllerche
Copy link
Member

I can merge this as is if you want. I have a couple of thoughts:

  • Should this shared rw lock live in crossbeam? It looks generally useful.
  • I was actually thinking of sharding the slab as well. So, instead of a single slab, there would be one per slab. This would allow the data itself to be sharded and allow writes to be sharded, etc...

Either way, the numbers show an improvement already, so merging seems fine to me.

@ghost
Copy link
Author

ghost commented Aug 9, 2018

Should this sharded rw lock live in crossbeam? It looks generally useful.

We are considering this, but feel somewhat hesitant because it might be a better fit for something like parking_lot or even a dedicated crate. cc @Amanieu - thoughts?

The part that assigns each thread an 'index' also seems generally useful.

I was actually thinking of sharding the slab as well. So, instead of a single slab, there would be one per slab. This would allow the data itself to be sharded and allow writes to be sharded, etc...

This is also a good idea, but probably not worth the effort, IMO. The reason is that writes are comparatively rare (they only occur when registering or unregistering an Evented), while reads are much more common (they occur on every poll and notification). The benefits of sharding the slab will be apparent only if writes are frequent enough.

I believe the next most promising step in reducing contention is creating a reactor per thread (#424).

@carllerche carllerche merged commit fd36054 into tokio-rs:master Aug 9, 2018
@ghost ghost deleted the sharded-rwlock branch August 9, 2018 18:25
ghost pushed a commit to crossbeam-rs/crossbeam-utils that referenced this pull request Oct 1, 2018
This PR introduces `ShardedLock`, which is a variant of `RwLock` where concurrent read operations don't contend as much.

`ShardedLock` consists of a bunch of smaller `RwLock`s called *shards*.

A read operation read-locks a single shard only. The shard is chosen based on the current thread so that contention is reduced as much as possible.

A write operation has to write-lock every shard in order. That means we trade faster reads for slower writes.

Another way of looking at it is: sharded locking is just poor man's hardware lock elision. While `parking_lot`'s `RwLock` does use HLE on platforms that support it, we have to resort to `ShardedLock` on others.

This PR is basically just a port of [another one](tokio-rs/tokio#517) that added a sharded RW lock to `tokio-reactor` in order to improve the performance of contended reads. I was told that `ShardedLock` would also be useful in [salsa](https://github.com/nikomatsakis/salsa). Let's add it to Crossbeam so that anyone can use it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants