Inefficient implementation of futex #853

nyh · 2017-02-12T10:58:14Z

As comments in linux.cc explain, our current futex implementation was not optimized for performance, because the only real use we saw for it was in gcc's _cxa__guard* functions, which are very rarely contended - and only when contended call the futex functions. So we just cared about the correctness of the futex implementation, not its performance.

However, it turns out that golang does use futexes heavily - @benoit-canet noted 10,000 per second (across a few cores) in one run. At that point, having such a bad implementation is not a good idea. Some of the bad things about the current implementation:

All the futexes, even unrelated ones, contend on one lock for the list of futexes.
If a futex is reused often, we constantly create wait queue objects and then delete them.
And probably other things :-)

In #23 there was one idea on how we can have a lock-free futex implementation. But it is likely that with some thought, even a locking implementation could be better.

nyh · 2017-02-12T11:14:06Z

Some untested quick hacks we can use to determine which inefficiencies might be causing performance problems in an actual use case:

We can trivially replace the single queues_mutex and queues by an array of (say) 100 of those (mutexes and maps), and pick the uaddr%100'th mutex and map. This should take contention of this lock out of the equation.
The constant allocation and deallocation of the wait queue objects may be slow. We can comment out the queues.erase(i) line in the code to take it out of the equation.

nyh · 2017-02-12T12:40:43Z

Another untested idea:
The 100 maps in the above idea could be protected by spinlocks (yikes!) instead of mutexes because we know they will only be held for very short durations (and our spinlock does not yet have short-wait spinning optimization).
Then, each "waitqueue" in the map could be, instead, a lock-free queue of waiters, <lockfree/queue-mpsc.hh>.

This patch adds a new benchmark - misc-futex-perf.cc. The goal is to indirectly measure performance of the futex syscall implemented in OSv and compare it to Linux and thus guide us in implementing the improvements described in the issue #853. The benchmark program does it by implementing special mutex - fmutex - based on futex syscall according to the algorithm specified in the Ulrich Drepper's paper "Futexes Are Tricky". The test is similar to the misc-mutex2.cc written by Nadav Har'El. It takes three parameters: mandatory number of threads (nthreads) and a computation length (worklen) and optional number of mutexes (nmutexes) which is equal to 1 by default. The test groups all threads (nthreads * nmutexes) into nmutexes sets of nthreads threads trying to take the group mutex (one out of nmutexes) in a loop and increment the group counter and then do some short computation of the specified length outside the loop. The test runs for 30 seconds, and shows the average total number of lock-protected counter increments per second. The number of cpus is set by using the '-c' option passes to run.py in case of OSv, and using taskset -c 0..n when running the same program on host. The results of the test that show number of total increments (across counters of all groups of threads) per second for both OSv and Linux host are below. It also shows number of total futex syscall calls (wake) captured by adding an atomic counter in the futex implementation for OSv. +------------------+----------------------------+----------------------+ | Run parameters | On OSv guest | On Linux host (op/s) | | | (op/s) (futex called) | +------------------+----------------------------+----------------------+ | 1 0 1 (1 cpu) | 5.1353e+07 0 | 5.21169e+07 | | 2 0 1 (2 cpus) | 2.26067e+07 345,745 | 1.78575e+07 | | 4 0 1 (4 cpus) | 4.93204e+07 2342 | 1.41494e+07 | | 1 500 1 (1 cpu) | 5.67558e+06 0 | 5.75555e+06 | | 2 500 1 (2 cpus) | 9.19294e+06 3618 | 9.78263e+06 | | 4 500 1 (4 cpus) | 5.65933e+06 38,243 | 6.87465e+06 | | 4 500 2 (4 cpus) | 8.30834e+06 266 | 1.15537e+07 | | 4 500 4 (4 cpus) | 1.06216e+07 111 | 1.16908e+07 | | 4 500 8 (4 cpus) | 1.39291e+07 101 | 1.31845e+07 | +------------------+----------------------------+----------------------+ The results are surprising and somewhat confusing. For example the lines 2 and 3 show OSv outperforming Linux by a lot. Also the line 7 (4 500 2) shows OSv peformance worse by ~30% even when number of futex calls is pretty low. Possibly there is a flaw in this test, or some kind of different explanation. Signed-off-by: Waldemar Kozaczuk <[email protected]> Message-Id: <[email protected]>

nyh mentioned this issue Mar 7, 2018

Futex crashes under heavy load #951

Open

wkozaczuk added the Golang label Jul 12, 2021

wkozaczuk added the performance label Aug 26, 2021

wkozaczuk mentioned this issue Dec 19, 2023

kernel: upgrade to C++ 17 #1289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient implementation of futex #853

Inefficient implementation of futex #853

nyh commented Feb 12, 2017

nyh commented Feb 12, 2017

nyh commented Feb 12, 2017

Inefficient implementation of futex #853

Inefficient implementation of futex #853

Comments

nyh commented Feb 12, 2017

nyh commented Feb 12, 2017

nyh commented Feb 12, 2017