Add "wait morphing" to improve condvar wake performance #24

nyh · 2013-07-18T15:06:32Z

The existing codvar_wake_one()/all() code wakes up one/all of the threads sleeping on the mutex, and they all wake up and try to take the user mutex associated with the condvar.

This creates two performance problems:

The "thundering herd" problem - When there are many threads waiting on the condvar and they are all woken with condvar_wake_all(), all these threads start to run (requiring a lot of work on scheduling, IPIs, etc.) but all of them except one will immediately go back to sleep waiting to take the user mutex. See a description of this problem and how it was solved in Linux in https://lwn.net/Articles/32746/.
The "locked wakeup" problem - Often, condvar_wake is done with the user mutex locked (see a good discussion in http://www.domaigne.com/blog/computing/condvars-signal-with-mutex-locked-or-not/). Most of the BSD code we use does this, for example. When condvar_wake() wakes the thread, if it wakes up quickly enough, it can try to take the mutex before the waker released it, and need to go back to sleep. This problem is especially noticable when both threads run on a single CPU, and the scheduler decides to context-switch to the wakee when it is woken.

Both problems can be solved by a technique called wait morphing: when the condvar has an associated mutex, condvar_wake_*() does not wake up the threads, but rather move them from waiting on the condition variable, to waiting on the mutex. Each of these threads will later be woken (separately) when the mutex becomes available for it, and when it wakes up it doesn't need to take the mutex, because it was already taken for it.

It is not clear that this issue actually has a lot of performance impact on any of our use cases (once the scheduler bugs are eliminated), but it seems there's popular demand for solving it.

nyh · 2013-07-18T17:32:17Z

Here are some tentative implementation ideas for the convar/mutex wait morphing:

Probably we should no longer support the old spin-based mutex implementation. Doing the following changes to both it and the newer lockfree::mutex will be a waste of time.
Condvar and mutex each uses a "wait record" structure that contain a link to the next record, and a thread pointer. In mutex it is linked_itemsched::thread*, in condvar it is struct ccondvar_waiter. Both of them already have the same content, but we now need them to be the same type, so convert them both to use linked_itemsched::thread* (and create an alias for it, "struct waitrrecord", pointers to which we need to use in the C-language condvar.h).
Add to the waitrecord type a method wake(), which does wr->value->wake_with([&] { wr->value = nullptr; }). This is the idiom we always use to wake up a waiter (in both mutex and condvar) and it will be nice to encapsulate it in a method.
Add the assumption that when a condvar is waited on concurrently, all uses must wait with the same mutex or the results will be undefined. pthread_cond_wait(3) also makes this assumption. This way, it will be enough to remember the last mutex associated with the condvar, and we won't need to remember the mutex associate with each waitrecord. This assumption can be relatively easily lifted, but I think it's not worth the extra effort.
Add a new mutex operation, mutex->lockfor(waitrecord *wr). It will be similar to lock(), but rather than taking the mutex for the current thread it will take it for a different thread which is already waiting on the given waitrecord. lockfor(wr) will attempt to take the lock immediately, and if successful will wake up the given waitrecord (wr->wake() explained above), but if can't get the lock it will add the given wr (not some local wr like in lock()) to the mutex's queue.
Now, condvar->wake_one/wake_all, for each wr it pops from the condvar's wait queue, instead of waking up the waitrecord as we do now, will first check if the condvar has an associate mutex. If it does, it will call mutex->lockfor(wr) and not wake up wr - it will be woken either now or later when it gets the mutex. Note how in a condvar_wake_all() the first popped wr might have a chance to be woken up immediately, but the rest will all just be queued in the mutex's wait queue and not cause any context switches.
When the condvar_wait() is woken up from its wait-record (the wait_until() in condvar_wait) it now knows that it already has the user mutex! So it doesn't need to get it again.
The case of timeout is somewhat special - a timeout that occurs while still waiting on the condition variable will work just as it does now. But the mutex grabbing cannot time out - a condvar_timed_wait will wait forever for the mutex (just like any mutex->lock()) does. I think this is fine, but we need to consider this.

nyh · 2013-07-30T12:33:50Z

Done, in commit aa3a624 and adjacent commits.

Fix mincore() to deal with unmapped addresses like msync() does. This fixes a SIGSEGV in libunwind's access_mem() when leak detector is enabled: (gdb) bt #0 page_fault (ef=0xffffc0003ffe7008) at ../../core/mmu.cc:871 #1 <signal handler called> #2 ContiguousSpace::block_start_const (this=<optimized out>, p=0x77d2f3968) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/oops/oop.inline.hpp:411 #3 0x00001000008ae16c in GenerationBlockStartClosure::do_space (this=0x2000001f9100, s=<optimized out>) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/memory/generation.cpp:242 #4 0x00001000007f097c in DefNewGeneration::space_iterate (this=0xffffc0003fb68c00, blk=0x2000001f9100, usedOnly=<optimized out>) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/memory/defNewGeneration.cpp:480 #5 0x00001000008aca0e in Generation::block_start (this=<optimized out>, p=<optimized out>) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/memory/generation.cpp:251 #6 0x0000100000b06d2f in os::print_location (st=st@entry=0x2000001f9560, x=32165017960, verbose=verbose@entry=false) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/runtime/os.cpp:868 #7 0x0000100000b11b5b in os::print_register_info (st=0x2000001f9560, context=0x2000001f9740) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:839 #8 0x0000100000c6cde8 in VMError::report (this=0x2000001f9610, st=st@entry=0x2000001f9560) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/utilities/vmError.cpp:551 #9 0x0000100000c6da3b in VMError::report_and_die (this=this@entry=0x2000001f9610) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/utilities/vmError.cpp:984 #10 0x0000100000b1109f in JVM_handle_linux_signal (sig=11, info=0x2000001f9bb8, ucVoid=0x2000001f9740, abort_if_unrecognized=<optimized out>) at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:528 #11 0x000000000039f242 in call_signal_handler (frame=0x2000001f9b10) at ../../arch/x64/signal.cc:69 #12 <signal handler called> #13 0x000000000057d721 in access_mem () #14 0x000000000057cb1d in dwarf_get () #15 0x000000000057ce51 in _ULx86_64_step () #16 0x00000000004315fd in backtrace (buffer=0x1ff9d80 <memory::alloc_tracker::remember(void*, int)::bt>, size=20) at ../../libc/misc/backtrace.cc:16 #17 0x00000000003b8d99 in memory::alloc_tracker::remember (this=0x1777ae0 <memory::tracker>, addr=0xffffc0004508de00, size=54) at ../../core/alloctracker.cc:59 #18 0x00000000003b0504 in memory::tracker_remember (addr=0xffffc0004508de00, size=54) at ../../core/mempool.cc:43 #19 0x00000000003b2152 in std_malloc (size=54) at ../../core/mempool.cc:723 #20 0x00000000003b259c in malloc (size=54) at ../../core/mempool.cc:856 #21 0x0000100001615e4c in JNU_GetStringPlatformChars (env=env@entry=0xffffc0003a4dc1d8, jstr=jstr@entry=0xffffc0004591b800, isCopy=isCopy@entry=0x0) at ../../../src/share/native/common/jni_util.c:801 #22 0x000010000161ada6 in Java_java_io_UnixFileSystem_getBooleanAttributes0 (env=0xffffc0003a4dc1d8, this=<optimized out>, file=<optimized out>) at ../../../src/solaris/native/java/io/UnixFileSystem_md.c:111 #23 0x000020000021ed8e in ?? () #24 0x00002000001faa58 in ?? () #25 0x00002000001faac0 in ?? () #26 0x00002000001faa50 in ?? () #27 0x0000000000000000 in ?? () Spotted by Avi Kivity.

See scripts/trace.py prof-wait -h The command is using sched_wait and sched_wait_ret tracepoints to calculate the amount of time a thread was waiting. Samples are collected and presented in a form of call graph tree. By default callees are closer to the root. To inverse the order pass -r|--caller-oriented. If there is too much output, it can be narrowed down using --max-levels and --min-duration options. The presented time spectrum can be narrowed down using --since and --until options which accept timestamps. Example: scripts/trace.py prof-wait --max-levels 3 trace-file === Thread 0xffffc0003eaeb010 === 12.43 s (100.00%, #7696) All |-- 12.43 s (99.99%, #7658) sched::thread::do_wait_until | |-- 10.47 s (84.22%, #6417) condvar::wait(lockfree::mutex*, unsigned long) | | condvar_wait | | |-- 6.47 s (52.08%, #6250) cv_timedwait | | | txg_delay | | | dsl_pool_tempreserve_space | | | dsl_dir_tempreserve_space | | | dmu_tx_try_assign | | | dmu_tx_assign | | | | | |-- 2.37 s (19.06%, #24) arc_read_nolock | | | arc_read | | | dsl_read | | | traverse_visitbp | | | | | |-- 911.75 ms (7.33%, #3) txg_wait_open | | | dmu_tx_wait | | | zfs_write | | | vfs_file::write(uio*, int) | | | sys_write | | | pwritev | | | writev | | | __stdio_write | | | __fwritex | | | fwrite | | | 0x100000005a5f | | | osv::run(std::string, int, char**, int*) By default every thread has a separate tree, because duration is best interpreted in the context of particular thread. There is however an option to merge samples from all threads into one tree: -m|--merge-threads. It may be useful if you want to inspect all paths going in/out to/from particular function. The direction can be changed with -r|--caller-oriented option. Function names is passed to --function parameter. Example: check where zfs_write() blocks: scripts/trace.py prof-wait -rm --function=zfs_write trace-file 7.46 s (100.00%, #7314) All zfs_write |-- 6.48 s (86.85%, #6371) dmu_tx_assign | |-- 6.47 s (86.75%, #6273) dmu_tx_try_assign | | dsl_dir_tempreserve_space | | |-- 6.47 s (86.75%, #6248) dsl_pool_tempreserve_space | | | txg_delay | | | cv_timedwait | | | condvar_wait | | | condvar::wait(lockfree::mutex*, unsigned long) | | | sched::thread::do_wait_until | | | | | |-- 87.87 us (0.00%, #24) mutex_lock | | | sched::thread::do_wait_until | | | | | \-- 6.40 us (0.00%, #1) dsl_dir_tempreserve_impl | | mutex_lock | | sched::thread::do_wait_until | | | \-- 7.32 ms (0.10%, #98) mutex_lock | sched::thread::do_wait_until | |-- 911.75 ms (12.22%, #3) dmu_tx_wait | txg_wait_open | condvar_wait | condvar::wait(lockfree::mutex*, unsigned long) | sched::thread::do_wait_until Signed-off-by: Tomasz Grabiec <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>

ghost assigned nyh Jul 18, 2013

nyh closed this as completed Jul 30, 2013

gleb-cloudius mentioned this issue Feb 7, 2014

elf loader access mmaped elf image with preemption disabled #196

Closed

asias mentioned this issue May 6, 2014

tomcat NIO connector fails #293

Closed

avikivity mentioned this issue Jul 17, 2014

Partial balloon copy failure #398

Open

raphaelsc mentioned this issue Aug 7, 2014

epoll_wait() causes aborts when socket is closed concurrently #424

Closed

vladzcloudius mentioned this issue Aug 31, 2014

net: vm_fault assert comming from route_cache::lookup() #480

Closed

nyh mentioned this issue May 28, 2015

Java image compiled on Fedora 22 doesn't work #635

Closed

yuwang888 mentioned this issue Jan 2, 2018

reschedule_from_interrupt assert(sched::exception_depth <= 1) when run specjbb2015 #933

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "wait morphing" to improve condvar wake performance #24

Add "wait morphing" to improve condvar wake performance #24

nyh commented Jul 18, 2013

nyh commented Jul 18, 2013

nyh commented Jul 30, 2013

Add "wait morphing" to improve condvar wake performance #24

Add "wait morphing" to improve condvar wake performance #24

Comments

nyh commented Jul 18, 2013

nyh commented Jul 18, 2013

nyh commented Jul 30, 2013