Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient implementation of futex #853

Open
nyh opened this issue Feb 12, 2017 · 2 comments
Open

Inefficient implementation of futex #853

nyh opened this issue Feb 12, 2017 · 2 comments

Comments

@nyh
Copy link
Contributor

nyh commented Feb 12, 2017

As comments in linux.cc explain, our current futex implementation was not optimized for performance, because the only real use we saw for it was in gcc's _cxa__guard* functions, which are very rarely contended - and only when contended call the futex functions. So we just cared about the correctness of the futex implementation, not its performance.

However, it turns out that golang does use futexes heavily - @benoit-canet noted 10,000 per second (across a few cores) in one run. At that point, having such a bad implementation is not a good idea. Some of the bad things about the current implementation:

  1. All the futexes, even unrelated ones, contend on one lock for the list of futexes.
  2. If a futex is reused often, we constantly create wait queue objects and then delete them.
  3. And probably other things :-)

In #23 there was one idea on how we can have a lock-free futex implementation. But it is likely that with some thought, even a locking implementation could be better.

@nyh
Copy link
Contributor Author

nyh commented Feb 12, 2017

Some untested quick hacks we can use to determine which inefficiencies might be causing performance problems in an actual use case:

  1. We can trivially replace the single queues_mutex and queues by an array of (say) 100 of those (mutexes and maps), and pick the uaddr%100'th mutex and map. This should take contention of this lock out of the equation.

  2. The constant allocation and deallocation of the wait queue objects may be slow. We can comment out the queues.erase(i) line in the code to take it out of the equation.

@nyh
Copy link
Contributor Author

nyh commented Feb 12, 2017

Another untested idea:
The 100 maps in the above idea could be protected by spinlocks (yikes!) instead of mutexes because we know they will only be held for very short durations (and our spinlock does not yet have short-wait spinning optimization).
Then, each "waitqueue" in the map could be, instead, a lock-free queue of waiters, <lockfree/queue-mpsc.hh>.

wkozaczuk added a commit that referenced this issue Oct 16, 2022
This patch adds a new benchmark - misc-futex-perf.cc. The goal is to
indirectly measure performance of the futex syscall implemented in OSv
and compare it to Linux and thus guide us in implementing the improvements
described in the issue #853.

The benchmark program does it by implementing special mutex - fmutex -
based on futex syscall according to the algorithm specified in the
Ulrich Drepper's paper "Futexes Are Tricky". The test is similar to the
misc-mutex2.cc written by Nadav Har'El. It takes three parameters:
mandatory number of threads (nthreads) and a computation length (worklen)
and optional number of mutexes (nmutexes) which is equal to 1 by default.

The test groups all threads (nthreads * nmutexes) into nmutexes sets
of nthreads threads trying to take the group mutex (one out of nmutexes)
in a loop and increment the group counter and then do some short computation
of the specified length outside the loop. The test runs for 30 seconds, and
shows the average total number of lock-protected counter increments per second.
The number of cpus is set by using the '-c' option passes to run.py in
case of OSv, and using taskset -c 0..n when running the same program on
host.

The results of the test that show number of total increments (across
counters of all groups of threads) per second for both OSv and Linux
host are below. It also shows number of total futex syscall calls (wake)
captured by adding an atomic counter in the futex implementation for OSv.

     +------------------+----------------------------+----------------------+
     | Run parameters   |       On OSv guest         | On Linux host (op/s) |
     |                  |  (op/s)     (futex called) |
     +------------------+----------------------------+----------------------+
     | 1 0   1 (1 cpu)  |  5.1353e+07              0 | 5.21169e+07          |
     | 2 0   1 (2 cpus) |  2.26067e+07       345,745 | 1.78575e+07          |
     | 4 0   1 (4 cpus) |  4.93204e+07          2342 | 1.41494e+07          |
     | 1 500 1 (1 cpu)  |  5.67558e+06             0 | 5.75555e+06          |
     | 2 500 1 (2 cpus) |  9.19294e+06          3618 | 9.78263e+06          |
     | 4 500 1 (4 cpus) |  5.65933e+06        38,243 | 6.87465e+06          |
     | 4 500 2 (4 cpus) |  8.30834e+06           266 | 1.15537e+07          |
     | 4 500 4 (4 cpus) |  1.06216e+07           111 | 1.16908e+07          |
     | 4 500 8 (4 cpus) |  1.39291e+07           101 | 1.31845e+07          |
     +------------------+----------------------------+----------------------+

The results are surprising and somewhat confusing. For example the lines
2 and 3 show OSv outperforming Linux by a lot. Also the line 7 (4 500 2)
shows OSv peformance worse by ~30% even when number of futex calls is
pretty low. Possibly there is a flaw in this test, or some kind of
different explanation.

Signed-off-by: Waldemar Kozaczuk <[email protected]>
Message-Id: <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants