-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutex is slower than standard library's on Intel i9 and ARM CPUs #338
Comments
Did you check against stable rustc or nightly rustc? Nightly rustc has switched from pthreads to using futexes directly on linux. |
It's stable rustc with version 1.60.0. The futex seems to have been merged in 2020 (rust-lang/rust#93740). I looked through the source code and confirm this.
|
The futex based mutex has been merged in march of this year: rust-lang/rust#95035 It will end up in rustc 1.62 which is currently nightly. Could you try the benchmarks on nightly too? |
Sure. I run with Intel i7-10750H (6 cores, 12 hthreads)
Intel i9-10980XE (18 cores, 36 hthreads)
ARM Neoverse-N1 (80 cores)
I run the test with wrong number of threads on ARM for stable rust. Here is re-evaluated numbers.
|
Nightly std locks has nontrivial performance gain on Intel chips. The improvement is subtle on ARM server. They are generally faster that parking_lot except on my laptop with small thread count. |
Could you try testing with one small change to see if it makes a difference? Change the if condition in this code: // If there is no queue, try spinning a few times
if state & PARKED_BIT == 0 && spinwait.spin() {
state = self.state.load(Ordering::Relaxed);
continue;
} to this: // Try spinning a few times
if spinwait.spin() {
state = self.state.load(Ordering::Relaxed);
continue;
} |
As your request Stableversion: rustc 1.60.0 (7737e0b5c 2022-04-04) Intel i7-10750H (6 cores, 12 hthreads)
Intel i9-10980XE (18 cores, 36 hthreads)
ARM Neoverse-N1 (80 cores)
Nightlyversion: rustc 1.62.0-nightly (055bf4ccd 2022-04-25) Intel i7-10750H (6 cores, 12 hthreads)
Intel i9-10980XE (18 cores, 36 hthreads)
ARM Neoverse-N1 (80 cores)
|
Adding this tweak turns out to be a bit slower under contention in general, but it's faster for small number of threads on Intel chips. |
I don't think much can be done about this, it's fundamentally part of how parking_lot works. You may want to explore other alternatives like https://github.com/kprotty/usync which is based on Windows's SRWLock. |
FWIW, 5 iterations is a very small amount of work. This makes threads hit the sleeping path almost immediately and which ever sleeps faster under contention has higher throughput. parking_lot effectively emulates futex in userspace which handles sleeping contention worse than futex in the kernel (from what i've seen anecdotally). In practice, lock usage either has some delay between attempts from actual work being done, or the work in the critical section takes longer than 5 floating point additions & multiplications. Would recommend instead trying |
PR #419 seems to be able to fix this issue. Before:
After
|
AMD Ryzen Threadripper PRO 3975WX 32-Cores
|
I run the mutex benchmark using this command for 36 system cores for example:
The parking_lot's mutex is faster only on Intel CPU and with smaller number of threads. It gets drastically slower when full cores/hyperthreads are utilized. Please tell me if there was anything done wrong.
Intel i7-10750H (6 cores, 12 threads)
Intel i9-10980XE (18 cores, 36 threads)
ARM Neoverse-N1 (80 cores)
The text was updated successfully, but these errors were encountered: