Proposal: lock-free alternative to condvar #23

nyh · 2013-07-17T11:45:52Z

Condvar uses locks (mutexes).

Our condvar implementation could be changed to not use an internal mutex (condvar->m) and rather use some lock-free queue data structure, but this will gain little, because we are still left with the user mutex. I.e., to properly use a condvar one should use the idiom:

waiter:
     mtx.lock();
     if(!condition) {    // or even "while" instead of "if"
        cv.wait(mtx);
     }
     mtx.unlock();

waker:
    mtx.lock();
    condition=true;
    mtx.unlock;
    cv.wake_all();

Note how the waker needs to use a mutex, even if the wake_all() implementation doesn't internally, and if we have concurrent wakers some of them may go to sleep because of this "contention" - even if nobody is even waiting on the condition variable.

To design a truely lock-free alternative to convar, we also need to avoid the user mutex, so we can't use the traditional condvar API and need to use a slightly different one.

The reason that the user mutex was needed was to make the test of the condition, and the wait (in the waiter thread) one critical section - so that we don't lose a wakeup that happens after we tested the condition.
Instead of using a mutex for this purpose, we can do something similar to our wait_until() implementation, in some way rememebering that this thread is about to check the condition, so that a wakeup in this point will avoid a wait, even if the condition evaluated to false.

Basically, this will result in an API similar to our existing wait_until/wake(), just that wait_until() will register in a given object (call this perhaps a waitqueue?) and the target of a "wake" will not be a thread, but rather a waitqueue. For example:

waiter:
         waitq.wait_until([&] {return condition});

waker:
         condition = true;
         waitq.wake_all();

Note how no mutex is needed when setting condition = true (it should be an atomic variable, though, and set with memory_order_relaxed) - wait_until will correctly handle the case where it tests condition, it is still false, but before it starts waiting both the condition=true and the wake_all() come.

P.S. For simplicity, we can implement a first version of this new API using an internal mutex (just like condvar now has, and so do Linux's wait queues) and not a lock-free data structure. It won't be entirely lock-free, but at least the user will not have to use an additional mutex!

avikivity · 2013-07-17T11:59:41Z

This is easy to misuse. Unless 'condition' is something that can be evaluated atomically, it can fail.

Perhaps

void waitqueue::wait_until(std::atomic& flag)

?

nyh · 2013-07-17T12:32:33Z

This is a very good point... I did mention that "condition" needs to be
atomic in my proposal, but I probably I didn't fully appreciate the gravity
of this issue.
The existing sched::wait_until() also shares the same problem - you can't
check a condition which cannot be atomicly tested..

Restricting the condition to a boolean flag, this becomes sort-of an event
semaphore, which is also useful, but a bit more limited. I'll try to think
if I can come up with something in the middle between a boolean flag and a
completely general (and easy to screw up) condition. We can go back to
semaphores (using atomic integer operations), but I was hoping for
something more general.

On Wed, Jul 17, 2013 at 2:59 PM, avikivity [email protected] wrote:

This is easy to misuse. Unless 'condition' is something that can be
evaluated atomically, it can fail.

Perhaps

void waitqueue::wait_until(std::atomic& flag)

?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-21108036
.

Nadav Har'El
[email protected]

loudinthecloud · 2013-07-17T12:58:58Z

On 07/17/2013 02:45 PM, nyh wrote:

Condvar uses locks (mutexes).

Our condvar implementation could be changed to not use an internal
mutex (condvar->m) and rather use some lock-free queue data structure,
but this will gain little, because we are still left with the /user/
mutex. I.e., to properly use a condvar one should use the idiom:

A condvar always expects to get a user mutex, so I guess it can be used
to guard its internal structure as well as synchronizing work amongst
threads, so we don't need the non user's mutex in any case. No?

waiter:
mtx.lock();
if(!condition) { // or even "while" instead of "if"
cv.wait(mtx);
}
mtx.unlock();

The use cases I'm seeing are more like this:

mtx.lock();
while (!exit) {
cv.wait(mtx);
// process work
}
mtx.unlock();

checkout zvol_geom_worker() and ramdisk_thread_fn() for example.

waker:
mtx.lock();
condition=true;
mtx.unlock;
cv.wake_all();

You called wake() outside of the mutex, which is not how it's done in
networking stack and zfs code. I guess this is an established idiom - we
can use user mutex to protect the wait_list of the condvar.

Basically the waker can "queues work", like this -

mtx.lock();
condition = true;
cv.wake_one();
mtx.unlock();

NOTE: we already discussed the scheduler needs to handle wake inside a
lock, so I'm disregarding any problem related to this. Also, I'm showing
how it is done in the networking stack and zfs code.

Note how the waker needs to use a mutex, even if the wake_all()
implementation doesn't internally, and if we have concurrent wakers
some of them may go to sleep because of this "contention" - even if
nobody is even waiting on the condition variable.

Then wake_all() should be used wisely by the user, in order to avoid
such contention.

I think we need to define all the condvar use cases first so we'll be
able to efficiently answer all of them.

To design a truely lock-free alternative to convar, we also need to
avoid the user mutex, so we can't use the traditional condvar API and
need to use a slightly different one.

The reason that the user mutex was needed was to make the test of the
condition, and the wait (in the waiter thread) one critical section -
so that we don't lose a wakeup that happens after we tested the condition.

True, but this idea is mixed with our wait_until() implementation...

Instead of using a mutex for this purpose, we can do something similar
to our wait_until() implementation, in some way rememebering that this
thread is about to check the condition, so that a wakeup in this point
will avoid a wait, even if the condition evaluated to false.

Basically, this will result in an API similar to our existing
wait_until/wake(), just that wait_until() will register in a given
object (call this perhaps a waitqueue?) and the target of a "wake"
will not be a thread, but rather a waitqueue. For example:

waiter:
waitq.wait_until([&] {return condition});

waker:
condition = true;
waitq.wake_all();

Note how no mutex is needed when setting condition = true (it should
be an atomic variable, though, and set with memory_order_relaxed) -
wait_until will correctly handle the case where it tests condition, it
is still false, but before it starts waiting both the condition=true
and the wake_all() come.

P.S. For simplicity, we can implement a first version of this new API
using an internal mutex (just like condvar now has, and so do Linux's
wait queues) and not a lock-free data structure. It won't be entirely
lock-free, but at least the user will not have to use an additional mutex!

I agree that we need a lockfree alternative for a "waitqueue" when the
user is writing lockfree code. Actually I would use your mpsc lockfree
queue in the pcpu worker threads if Avi weren't insisting it is so slow.

loudinthecloud · 2013-07-17T13:17:03Z

Also, If we use condvar as a waiting list, I don't see a problem of contention, even if we call wake_all().

Waker:

mtx.lock();
condition = true;
cv.wake_all();
mtx.unlock();

Waiters:

mtx.lock();
cv.wait_until(mtx, condition);
mtx.unlock();

There are few things to note here:

After cv.wait() returns, the waiter acquires and hold the mutex just briefly and this reduces contention statistically
If the waiters are on the same CPU, there should be almost no contention at all
If the waiters are on different CPUs, there is a very low chance that these threads will get scheduled all at the same time (ipi timing + scheduling decision which should give weight to the time the current thread is running), combine this with point 1 above and you'll get a very low chance for contention.

So, to summarize, I think that most of the issues we're seeing are actually due to a bad scheduling policy.

nyh · 2013-07-17T21:13:14Z

On Wed, Jul 17, 2013 at 3:58 PM, loudinthecloud [email protected]:

The use cases I'm seeing are more like this:

mtx.lock();
while (!exit) {
cv.wait(mtx);
// process work
}
mtx.unlock();

This is a very interesting point.

Code that is written this way - with one mutex used for both doing the
actual work, and the wakeup mechanism - will definitely not be suitable for
the new API I was trying to propse.

In particular, code like you mention above, while in the "process work"
stage are not waiting, and still, a wakeup attempt will hang (because the
waker needs to take the mutex) instead of doing nothing. In many cases,
this is a good-enough design, but I was thinking about cases that want to
avoid waits as much as possible.

waker:
mtx.lock();
condition=true;
mtx.unlock;
cv.wake_all();

You called wake() outside of the mutex, which is not how it's done in
networking stack and zfs code. I guess this is an established idiom - we
can use user mutex to protect the wait_list of the condvar.

I was always under the assumption that holding the lock while waking is bad
practice, but now I'm beginning to think it is done as a solution to the
thread-exits-after-wait problem. I also saw in the pthread_cond_signal(3)
manual page a curious sentence:

" The pthread_cond_broadcast() or pthread_cond_signal() functions may be
called by a thread whether or not it currently owns the mutex
that
threads calling pthread_cond_wait() or pthread_cond_timedwait()
have
associated with the condition variable during their waits; however,
if
predictable scheduling behavior is required, then that mutex shall
be
locked by the thread calling pthread_cond_broadcast()
or
pthread_cond_signal(). "

I don't know what this "predictable scheduling" issue refers to...

Note how the waker needs to use a mutex, even if the wake_all()
implementation doesn't internally, and if we have concurrent wakers
some of them may go to sleep because of this "contention" - even if
nobody is even waiting on the condition variable.

Then wake_all() should be used wisely by the user, in order to avoid
such contention.

I was thinking of the following scenario (I'm not sure if I have a better
example)

Imagine that we have a lockfree work queue - many threads put new work, and
a single thread does this work.
If the work queue is empty, the worker thread wait()s on a condvar.
Whenever new work is pushed, the condvar is wake()ed.
Now, the issue is that in many cases, the worker thread is basy and rarely
wait()s, and still, wake() is done for every item. But if wake() needs a
mutex (surrounding the code which adds an item to the work queue, and
possibly also the wake itself), it not only becomes slower - we also start
to have sleeps when several threads try to go add items concurrently.

I agree that we need a lockfree alternative for a "waitqueue" when the
user is writing lockfree code. Actually I would use your mpsc lockfree
queue in the pcpu worker threads if Avi weren't insisting it is so slow.

This is a good point. Reading your and Avi's comments, I think the
concensus is that my idea for a "lockfree alternative to condvar" will only
be useful to add waits to other lockfree algorithms; Algorithms which
already use a mutex can continue to use the regular condvar and not suffer
additional overheads over what they are already suffering.

Nadav Har'El
[email protected]

loudinthecloud · 2013-07-18T08:54:43Z

On 07/18/2013 12:13 AM, nyh wrote:

On Wed, Jul 17, 2013 at 3:58 PM, loudinthecloud
[email protected]:

The use cases I'm seeing are more like this:

mtx.lock();
while (!exit) {
cv.wait(mtx);
// process work
}
mtx.unlock();

This is a very interesting point.

Code that is written this way - with one mutex used for both doing the
actual work, and the wakeup mechanism - will definitely not be
suitable for
the new API I was trying to propse.

In particular, code like you mention above, while in the "process work"
stage are not waiting, and still, a wakeup attempt will hang (because the
waker needs to take the mutex) instead of doing nothing. In many cases,
this is a good-enough design, but I was thinking about cases that want to
avoid waits as much as possible.

The code above illustrate the use cases of a single consumer,
a multi consumer use case can probably look like this:

mtx.lock();
while (!exit) {
cv.wait(mtx);
item = q.pop();
mtx.unlock();
// do something with item
mtx.lock();
}
mtx.unlock();

The mutex isn't held for a long time, probably not more than 50ns in
each loop (assuming we ditch the condvar's private mutex).

waker:
mtx.lock();
condition=true;
mtx.unlock;
cv.wake_all();

You called wake() outside of the mutex, which is not how it's done in
networking stack and zfs code. I guess this is an established idiom - we
can use user mutex to protect the wait_list of the condvar.

I was always under the assumption that holding the lock while waking
is bad
practice, but now I'm beginning to think it is done as a solution to the
thread-exits-after-wait problem. I also saw in the pthread_cond_signal(3)
manual page a curious sentence:

" The pthread_cond_broadcast() or pthread_cond_signal() functions may be
called by a thread whether or not it currently owns the mutex
that
threads calling pthread_cond_wait() or pthread_cond_timedwait()
have
associated with the condition variable during their waits; however,
if
predictable scheduling behavior is required, then that mutex shall
be
locked by the thread calling pthread_cond_broadcast()
or
pthread_cond_signal(). "

I don't know what this "predictable scheduling" issue refers to...

I agree about the thread-exit-after-wait bug.

I am guessing that predicatable scheduling means that you can be sure
the waiters will not wake before you release the mutex, which is good to
avoid spurious wakeups.

Also, it means that you don't need to do cv.wait_until(....), a simple
wait() (and wake() within a lock) is enough to guarantee the order of
execution.

Note how the waker needs to use a mutex, even if the wake_all()
implementation doesn't internally, and if we have concurrent wakers
some of them may go to sleep because of this "contention" - even if
nobody is even waiting on the condition variable.

Then wake_all() should be used wisely by the user, in order to avoid
such contention.

I was thinking of the following scenario (I'm not sure if I have a better
example)

Imagine that we have a lockfree work queue - many threads put new
work, and
a single thread does this work.
If the work queue is empty, the worker thread wait()s on a condvar.
Whenever new work is pushed, the condvar is wake()ed.
Now, the issue is that in many cases, the worker thread is basy and rarely
wait()s, and still, wake() is done for every item. But if wake() needs a
mutex (surrounding the code which adds an item to the work queue, and
possibly also the wake itself), it not only becomes slower - we also start
to have sleeps when several threads try to go add items concurrently.

Ok, so in a single consumer multi producer scenario like this, using a
mutex shouldn't be very problematic unless you have hundreds of threads
that are concurrently producing work very often. With 5, 10, 15
threads... the contention on the mutex should be very low as it is held
very briefly, just for queuing work.

But nevertheless...

I agree that we need a lockfree alternative for a "waitqueue" when the
user is writing lockfree code. Actually I would use your mpsc lockfree
queue in the pcpu worker threads if Avi weren't insisting it is so slow.

This is a good point. Reading your and Avi's comments, I think the
concensus is that my idea for a "lockfree alternative to condvar" will
only
be useful to add waits to other lockfree algorithms; Algorithms which
already use a mutex can continue to use the regular condvar and not suffer
additional overheads over what they are already suffering.

We need a lockfree waitqueue/condvar to solve the interaction between
the lockless pcpu memory allocator vs. its worker threads

Fix mincore() to deal with unmapped addresses like msync() does. This fixes a SIGSEGV in libunwind's access_mem() when leak detector is enabled: (gdb) bt #0 page_fault (ef=0xffffc0003ffe7008) at ../../core/mmu.cc:871 #1 <signal handler called> #2 ContiguousSpace::block_start_const (this=<optimized out>, p=0x77d2f3968) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/oops/oop.inline.hpp:411 #3 0x00001000008ae16c in GenerationBlockStartClosure::do_space (this=0x2000001f9100, s=<optimized out>) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/memory/generation.cpp:242 #4 0x00001000007f097c in DefNewGeneration::space_iterate (this=0xffffc0003fb68c00, blk=0x2000001f9100, usedOnly=<optimized out>) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/memory/defNewGeneration.cpp:480 #5 0x00001000008aca0e in Generation::block_start (this=<optimized out>, p=<optimized out>) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/memory/generation.cpp:251 #6 0x0000100000b06d2f in os::print_location (st=st@entry=0x2000001f9560, x=32165017960, verbose=verbose@entry=false) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/runtime/os.cpp:868 #7 0x0000100000b11b5b in os::print_register_info (st=0x2000001f9560, context=0x2000001f9740) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:839 #8 0x0000100000c6cde8 in VMError::report (this=0x2000001f9610, st=st@entry=0x2000001f9560) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/utilities/vmError.cpp:551 #9 0x0000100000c6da3b in VMError::report_and_die (this=this@entry=0x2000001f9610) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/utilities/vmError.cpp:984 #10 0x0000100000b1109f in JVM_handle_linux_signal (sig=11, info=0x2000001f9bb8, ucVoid=0x2000001f9740, abort_if_unrecognized=<optimized out>) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:528 #11 0x000000000039f242 in call_signal_handler (frame=0x2000001f9b10) at ../../arch/x64/signal.cc:69 #12 <signal handler called> #13 0x000000000057d721 in access_mem () #14 0x000000000057cb1d in dwarf_get () #15 0x000000000057ce51 in _ULx86_64_step () #16 0x00000000004315fd in backtrace (buffer=0x1ff9d80 <memory::alloc_tracker::remember(void*, int)::bt>, size=20) at ../../libc/misc/backtrace.cc:16 #17 0x00000000003b8d99 in memory::alloc_tracker::remember (this=0x1777ae0 <memory::tracker>, addr=0xffffc0004508de00, size=54) at ../../core/alloctracker.cc:59 #18 0x00000000003b0504 in memory::tracker_remember (addr=0xffffc0004508de00, size=54) at ../../core/mempool.cc:43 #19 0x00000000003b2152 in std_malloc (size=54) at ../../core/mempool.cc:723 #20 0x00000000003b259c in malloc (size=54) at ../../core/mempool.cc:856 #21 0x0000100001615e4c in JNU_GetStringPlatformChars (env=env@entry=0xffffc0003a4dc1d8, jstr=jstr@entry=0xffffc0004591b800, isCopy=isCopy@entry=0x0) at ../../../src/share/native/common/jni_util.c:801 #22 0x000010000161ada6 in Java_java_io_UnixFileSystem_getBooleanAttributes0 (env=0xffffc0003a4dc1d8, this=<optimized out>, file=<optimized out>) at ../../../src/solaris/native/java/io/UnixFileSystem_md.c:111 #23 0x000020000021ed8e in ?? () #24 0x00002000001faa58 in ?? () #25 0x00002000001faac0 in ?? () #26 0x00002000001faa50 in ?? () #27 0x0000000000000000 in ?? () Spotted by Avi Kivity.

bonzini · 2013-09-29T22:34:14Z

I think it doesn't make much sense to make condvar itself lock-free, since it's usually coupled with a mutex. However, there are other interesting sync primitives for which a lock-free version could be possible.

An interesting one has the following primitives

waitset::prepare_wait();
waitset::wait(); // possibly add a timeout and call cancel_wait() when it happens
waitset::cancel_wait(); // not really needed for condvars, except for timeouts
waitset::wake_one();
waitset::wake_all();

Where a wake_one() or wake_all() can make a subsequent wait() exit immediately as long as it a prepare_wait() came before. In some implementations cancel_wait() detects this situation and wakes up another thread, which makes it easier to implement timeouts.

For example, a condvar wait is implemented like this (wakes are trivial):

ws.prepare_wait();
mutex.unlock();
ws.wait();
mutex.lock();

This works because if someone sneaks in between unlock and wait, and signals the condvar, you will not go into the wait.

Now how to make it lock-free? No idea :) but this paper could help; it has a list-based lock-free set data structure using hazard pointers and double-word CAS (cmpxchg16b), but hazard pointers are more or less trivially replaced by RCU. Each thread must use a separate key; keys are kept sorted so the key could be a combination of a CPU number and timestamp...

The same primitive can be used also for futex:

futex_wait(p_futex, value)
    ws = waitset corresponding to address p_futex
    if (*p_futex != value)
       return EWOULDBLOCK;
    ws.prepare_wait();
    if (*p_futex != value) {
       ws.cancel_wait();
       return EWOULDBLOCK;
    } else
       ws.wait();

gleb-cloudius mentioned this issue Feb 7, 2014

elf loader access mmaped elf image with preemption disabled #196

Closed

asias mentioned this issue May 6, 2014

tomcat NIO connector fails #293

Closed

avikivity mentioned this issue Jul 17, 2014

Partial balloon copy failure #398

Open

raphaelsc mentioned this issue Aug 7, 2014

epoll_wait() causes aborts when socket is closed concurrently #424

Closed

vladzcloudius mentioned this issue Aug 31, 2014

net: vm_fault assert comming from route_cache::lookup() #480

Closed

nyh mentioned this issue May 28, 2015

Java image compiled on Fedora 22 doesn't work #635

Closed

benoit-canet mentioned this issue Feb 16, 2016

OSv crashing when using parallel netperf connections #728

Open

nyh mentioned this issue Feb 12, 2017

Inefficient implementation of futex #853

Open

yuwang888 mentioned this issue Jan 2, 2018

reschedule_from_interrupt assert(sched::exception_depth <= 1) when run specjbb2015 #933

Open

wkozaczuk added the optimization label Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: lock-free alternative to condvar #23

Proposal: lock-free alternative to condvar #23

nyh commented Jul 17, 2013

avikivity commented Jul 17, 2013

nyh commented Jul 17, 2013

loudinthecloud commented Jul 17, 2013

loudinthecloud commented Jul 17, 2013

nyh commented Jul 17, 2013

loudinthecloud commented Jul 18, 2013

bonzini commented Sep 29, 2013

Proposal: lock-free alternative to condvar #23

Proposal: lock-free alternative to condvar #23

Comments

nyh commented Jul 17, 2013

avikivity commented Jul 17, 2013

nyh commented Jul 17, 2013

loudinthecloud commented Jul 17, 2013

loudinthecloud commented Jul 17, 2013

nyh commented Jul 17, 2013

loudinthecloud commented Jul 18, 2013

bonzini commented Sep 29, 2013