runtime: big performance penalty with runtime.LockOSThread #21827

navytux · 2017-09-10T12:58:16Z

This issue reopens #18023.

There it was observed that if a server goroutine is locked to OS thread, such locking imposes big performance penalty compared to the same server code but without handler being locked to OS thread. Relevant golang-nuts thread discusses this and notes that for case when runtime.LockOSThread was used the number of context switches is 10x (ten times, not 1000x times) more compared to the case without OS thread locking. #18023 (comment) notices the context switch can happen because e.g. futex_wake() in kernel can move woken process to a different CPU.

More, it was found that essentially at every CGo call lockOSThread is used internally by Go runtime:

https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107

so even if user code does not use LockOSThread, but uses CGo calls on server side, there are preconditions to presume similar kind of slowdown.

With above in mind #18023 (comment) shows a dirty patch that spins a bit in notesleep() before going to kernel to futex_wait(). This way it is shown that 1) large fraction of performance penalty related to LockOSThread can go away, and 2) the case of CGo calls on server can also receive visible speedup:

name        old time/op  new time/op  delta
Unlocked-4   485ns ± 0%   483ns ± 1%     ~     (p=0.188 n=9+10)
Locked-4    5.22µs ± 1%  1.32µs ± 5%  -74.64%  (p=0.000 n=9+10)
CGo-4        581ns ± 1%   556ns ± 0%   -4.27%  (p=0.000 n=10+10)
CGo10-4     2.20µs ± 6%  1.23µs ± 0%  -44.32%  (p=0.000 n=10+9)

The patch is for sure not completely right (and probably far away from being right) as always spinning unconditionally should sometimes bring harm instead of good. But it shows that with proper scheduler tuning it is possible to avoid context switches and perform better.

I attach my original post here for completeness.

Thanks,
Kirill

/cc @rsc, @ianlancetaylor, @dvyukov, @aclements, @bcmills

#18023 (comment):

Let me chime in a bit. On Linux the context switch can happen, if my reading of futex_wake() is correct (which is probably not), because e.g. wake_up_q() via calling wake_up_process() -> try_to_wake_up() -> select_task_rq() can select another cpu

                cpu = cpumask_any(&p->cpus_allowed);

for woken process.

The Go runtime calls futex_wake() in notewakeup() to wake up an M that was previously stopped via stopm() -> notesleep() (the latter calls futexwait()).

When LockOSThread is used an M is dedicated to G so when that G blocks, e.g. on chan send, that M, if I undestand correctly, has high chances to stop. And if it stops it goes to futexwait and then context switch happens when someone wakes it up because e.g. something was sent to the G via channel.

With this thinking the following patch:

diff --git a/src/runtime/lock_futex.go b/src/runtime/lock_futex.go
index 9d55bd129c..418fe1b845 100644
--- a/src/runtime/lock_futex.go
+++ b/src/runtime/lock_futex.go
@@ -146,7 +157,13 @@ func notesleep(n *note) {
                // Sleep for an arbitrary-but-moderate interval to poll libc interceptors.
                ns = 10e6
        }
-       for atomic.Load(key32(&n.key)) == 0 {
+       for spin := 0; atomic.Load(key32(&n.key)) == 0; spin++ {
+               // spin a bit hoping we'll get wakup soon
+               if spin < 10000 {
+                       continue
+               }
+
+               // no luck -> go to sleep heavily to kernel
                gp.m.blocked = true
                futexsleep(key32(&n.key), 0, ns)
                if *cgo_yield != nil {

makes BenchmarkLocked much faster on my computer:

name        old time/op  new time/op  delta
Unlocked-4   485ns ± 0%   483ns ± 1%     ~     (p=0.188 n=9+10)
Locked-4    5.22µs ± 1%  1.32µs ± 5%  -74.64%  (p=0.000 n=9+10)

I also looked around and found: essentially at every CGo call lockOSThread is used:

https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107

With this in mind I modified the benchmark a bit so that no LockOSThread is explicitly used, but server performs 1 and 10 simple C calls for every request:

CGo-4        581ns ± 1%   556ns ± 0%   -4.27%  (p=0.000 n=10+10)
CGo10-4     2.20µs ± 6%  1.23µs ± 0%  -44.32%  (p=0.000 n=10+9)

which shows the change brings quite visible speedup.

This way I'm not saying my patch is right, but at least it shows that much can be improved. So I suggest to reopen the issue.

Thanks beforehand,
Kirill

/cc @dvyukov, @aclements, @bcmills

full benchmark source:

(tmp_test.go)

package tmp

import (
        "runtime"
        "testing"
)

type in struct {
        c   chan *out
        arg int
}

type out struct {
        ret int
}

func client(c chan *in, arg int) int {
        rc := make(chan *out)
        c <- &in{
                c:   rc,
                arg: arg,
        }
        ret := <-rc
        return ret.ret
}

func _server(c chan *in, argadjust func(int) int) {
        for r := range c {
                r.c <- &out{ret: argadjust(r.arg)}
        }
}

func server(c chan *in) {
        _server(c, func(arg int) int {
                return 3 + arg
        })
}

func lockedServer(c chan *in) {
        runtime.LockOSThread()
        server(c)
        runtime.UnlockOSThread()
}

// server with 1 C call per request
func cserver(c chan *in) {
        _server(c, cargadjust)
}

// server with 10 C calls per request
func cserver10(c chan *in) {
        _server(c, func(arg int) int {
                for i := 0; i < 10; i++ {
                        arg = cargadjust(arg)
                }
                return arg
        })
}

func benchmark(b *testing.B, srv func(chan *in)) {
        inc := make(chan *in)
        go srv(inc)
        for i := 0; i < b.N; i++ {
                client(inc, i)
        }
        close(inc)
}

func BenchmarkUnlocked(b *testing.B)    { benchmark(b, server) }
func BenchmarkLocked(b *testing.B)      { benchmark(b, lockedServer) }
func BenchmarkCGo(b *testing.B)         { benchmark(b, cserver) }
func BenchmarkCGo10(b *testing.B)       { benchmark(b, cserver10) }

(tmp.go)

package tmp

// int argadjust(int arg) { return 3 + arg; }
import "C"

// XXX here because cannot use C in tests directly
func cargadjust(arg int) int {
        return int(C.argadjust(C.int(arg)))
}

The text was updated successfully, but these errors were encountered:

typeless · 2017-09-13T06:11:06Z

The kernel scheduler is invoked when 1. an interrupt handler exits (hardware irqs or timer ticks) 2. a syscall exits. 3. the scheduler is explicitly called. Which means the worst-case resolution of scheduling is equal to the resolution of the timer ticks when the system is idle (no irq and no traffic of syscalls). So it's doubtful that the 100us-ish timeout would work as expected when the granularity of the kernel scheduling is much coarse. And if you check out the manpage of futex, it has a sentence about the timeout saying that "This interval will be rounded up to the system clock granularity". That probably explains why the spinlock gets waken up faster.

dvyukov · 2017-09-13T06:39:14Z

@typeless futex explicitly calls scheduler, so that's number 3

dvyukov · 2017-09-13T06:40:05Z

I don't think this affects cgo calls, notesleep should not be on fast path there.
It should affect cgo callbacks, though.

typeless · 2017-09-13T06:55:01Z

@dvyukov

futex explicitly calls scheduler, so that's number 3

But for the user process being able to call the scheduler, doesn't the process have to be invoked by the scheduler first?

dvyukov · 2017-09-13T06:57:25Z

If it's not running, it won't be able to set n.key either.

typeless · 2017-09-13T07:31:45Z

@dvyukov

Apologies for my ambiguous wording. By processes I actually mean the OS threads.

If it's not running, it won't be able to set n.key either.

Isn't it possible that a running thread being interrupted and yield the CPU to others temporarily (unless it can disable the IRQs)? The timing that the thread gets rescheduled again can only be the next scheduling points, which is beyond the control of the thread.

P.S. FWIW a possible way to occupy the CPU exclusively other than disabling the IRQs is to use sched_setscheduler for the SCHD_FIFO policy.

navytux · 2017-09-13T07:34:13Z

I don't think this affects cgo calls, notesleep should not be on fast path there.
It should affect cgo callbacks, though.

@dvyukov original benchmark uses only CGo calls without CGo callbacks. lockOSThread is used on cgocall fast path for every CGo call:

https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107

and somehow, as benchmark shows, it gets intermixed with the scheduler:

CGo-4        581ns ± 1%   556ns ± 0%   -4.27%  (p=0.000 n=10+10)
CGo10-4     2.20µs ± 6%  1.23µs ± 0%  -44.32%  (p=0.000 n=10+9)

dvyukov · 2017-09-13T07:38:00Z

@typeless sorry, I don't see how this is related the topic.

dvyukov · 2017-09-13T07:41:09Z

@navytux

and somehow, as benchmark shows, it gets intermixed with the scheduler:

What is the result if you comment out lockOSThread/unlockOSThread in cgocall (it's not really needed there)?

navytux · 2017-09-13T08:11:29Z

@dvyukov thanks for the question, I too thought about it just after commenting. So on today's unmodified tip (go version devel +c2f8ed267b Wed Sep 13 07:19:21 2017 +0000 linux/amd64) it gives:

$ benchstat dv0.txt
name     time/op
CGo-4     576ns ± 1%
CGo10-4  2.22µs ± 2%

with adding

diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index ce4d707e06..decc310088 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -106,7 +106,7 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
 
        // Lock g to m to ensure we stay on the same stack if we do a
        // cgo callback. In case of panic, unwindm calls endcgo.
-       lockOSThread()
+//     lockOSThread()
        mp := getg().m
        mp.ncgocall++
        mp.ncgo++
@@ -159,7 +159,7 @@ func endcgo(mp *m) {
                raceacquire(unsafe.Pointer(&racecgosync))
        }
 
-       unlockOSThread() // invalidates mp
+//     unlockOSThread() // invalidates mp
 }
 
 // Call from C back to Go.

it becomes:

$ benchstat dv0.txt dv1.txt 
name     old time/op  new time/op  delta
CGo-4     576ns ± 1%   558ns ± 1%  -3.18%  (p=0.000 n=10+10)
CGo10-4  2.22µs ± 2%  2.03µs ± 4%  -8.63%  (p=0.000 n=9+8)

with adding notesleep spin patch on top (so both cgocall and notesleep are patched) it becomes:

$ benchstat dv1.txt dv2.txt 
name     old time/op  new time/op  delta
CGo-4     558ns ± 1%   552ns ± 2%   -1.00%  (p=0.021 n=10+10)
CGo10-4  2.03µs ± 4%  1.17µs ± 1%  -42.45%  (p=0.000 n=8+8)

which shows the speedup is not only related to lockOSThread and somehow generally (?) applies to scheduler or some other details of CGo calls.

dvyukov · 2017-09-13T08:20:02Z

which shows the speedup is not only related to lockOSThread and somehow generally (?) applies to scheduler or some other details of CGo calls.

This makes sense now.

I guess it is general tradeoff between latency and burning CPU.
If we do more aggressive scheduler spinning, it can make sense to do it directly in findrunnable/stoplockedm.
But generally it's unclear to me if this translates to real world improvements. Synthetic synchronization tests are usually misleading. If a locked goroutine blocks for more than 10us, then we just burn 10us on CPU time in vain.

navytux · 2017-09-13T08:24:47Z

If in cgocall I also disable entersyscall/exitsyscall even without notesleep spin patch, so whole patch is below:

diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index ce4d707e06..70fbe7e7b1 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -106,7 +106,7 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
 
        // Lock g to m to ensure we stay on the same stack if we do a
        // cgo callback. In case of panic, unwindm calls endcgo.
-       lockOSThread()
+//     lockOSThread()
        mp := getg().m
        mp.ncgocall++
        mp.ncgo++
@@ -129,9 +129,9 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
        // "system call", run the Go code (which may grow the stack),
        // and then re-enter the "system call" reusing the PC and SP
        // saved by entersyscall here.
-       entersyscall(0)
+//     entersyscall(0)
        errno := asmcgocall(fn, arg)
-       exitsyscall(0)
+//     exitsyscall(0)
 
        // From the garbage collector's perspective, time can move
        // backwards in the sequence above. If there's a callback into
@@ -159,7 +159,7 @@ func endcgo(mp *m) {
                raceacquire(unsafe.Pointer(&racecgosync))
        }
 
-       unlockOSThread() // invalidates mp
+//     unlockOSThread() // invalidates mp
 }
 
 // Call from C back to Go.

compared to only lockOSThread commented it becomes:

$ benchstat dv1.txt dv1nosys.txt 
name     old time/op  new time/op  delta
CGo-4     558ns ± 1%   509ns ± 1%   -8.86%  (p=0.000 n=10+9)
CGo10-4  2.03µs ± 4%  0.77µs ± 3%  -61.97%  (p=0.000 n=8+10)

And if I comment only syscallenter/syscallexit, so whole patch is:

diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index ce4d707e06..243688f0af 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -129,9 +129,9 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
        // "system call", run the Go code (which may grow the stack),
        // and then re-enter the "system call" reusing the PC and SP
        // saved by entersyscall here.
-       entersyscall(0)
+//     entersyscall(0)
        errno := asmcgocall(fn, arg)
-       exitsyscall(0)
+//     exitsyscall(0)
 
        // From the garbage collector's perspective, time can move
        // backwards in the sequence above. If there's a callback into

compared to unmodified tip it becomes:

$ benchstat dv0.txt dv1nosysonly.txt 
name     old time/op  new time/op  delta
CGo-4     576ns ± 1%   515ns ± 1%  -10.62%  (p=0.000 n=10+10)
CGo10-4  2.22µs ± 2%  0.84µs ± 2%  -62.07%  (p=0.000 n=9+10)

so in case of CGo calls the notesleep spin patch brings speedup not becuase of lockOSThread but due to entersyscall/exitsyscall being there.

dvyukov · 2017-09-13T08:29:45Z

Yes, entersyscall/exitsyscall is what interacts with Go scheduler in case of cgo calls.

navytux · 2017-09-13T08:39:49Z

I guess it is general tradeoff between latency and burning CPU.
If we do more aggressive scheduler spinning, it can make sense to do it directly in findrunnable/stoplockedm.
But generally it's unclear to me if this translates to real world improvements. Synthetic synchronization tests are usually misleading. If a locked goroutine blocks for more than 10us, then we just burn 10us on CPU time in vain.

I generally agree but some kind of adaptive spinning could be brought to notesleep/notewakeup too -
similar to what you did to runtime.lock/unlock in 4e5086b (runtime: improve Linux mutex):

4e5086b9#diff-608e335144c55dc824f257f5a66ac4d3R125
https://github.com/golang/go/blob/c2f8ed26/src/runtime/lock_futex.go#L54

becuase currently notesleep always unconditionally goes directly to sys_futex to kernel.

navytux · 2017-09-13T08:46:30Z

And by the way - many fast syscalls is not synthetic benchmark - they appear in real programs either reading fast from cached files or sending/receiving on network (yes network is epolled but still every send/recv goes through full - not raw - syscall). In my experience every such event has potential to shake the scheduler.

dvyukov · 2017-09-13T08:57:14Z

notesleep is not meant for sleeping for brief periods. There are callers for which spinning will be plain harmful. mutex is meant for brief blocking.
Non-blocking network calls should not invoke scheduler, they don't give up P for 20us, while non-blocking read/write takes on a par of 5us.

navytux · 2017-09-13T09:21:15Z

Thanks for feedback. I understand there is difference between notesleep and mutex and notesleep by definition is more heavier sleep. With this in mind and adjusted patch:

diff --git a/src/runtime/lock_futex.go b/src/runtime/lock_futex.go
index 9d55bd129c..5648ef66f3 100644
--- a/src/runtime/lock_futex.go
+++ b/src/runtime/lock_futex.go
@@ -146,7 +146,14 @@ func notesleep(n *note) {
                // Sleep for an arbitrary-but-moderate interval to poll libc interceptors.
                ns = 10e6
        }
-       for atomic.Load(key32(&n.key)) == 0 {
+       for spin := 0; atomic.Load(key32(&n.key)) == 0; spin++ {
+               // spin a bit hoping we'll get wakup soon hopefully without context switch
+               if spin < 10 {
+                       osyield()
+                       continue
+               }
+
+               // no luck -> go to sleep heavily to kernel; this might result in context switch
                gp.m.blocked = true
                futexsleep(key32(&n.key), 0, ns)
                if *cgo_yield != nil {

it still works for both LockOSThread and CGo cases:

$ benchstat dv00.txt dv0+osyield.txt 
name        old time/op  new time/op  delta
Unlocked-4   482ns ± 1%   478ns ± 0%   -0.90%  (p=0.011 n=10+9)
Locked-4    5.08µs ± 1%  1.51µs ± 4%  -70.29%  (p=0.000 n=9+10)
CGo-4        577ns ± 0%   558ns ± 1%   -3.38%  (p=0.000 n=9+10)
CGo10-4     2.22µs ± 3%  1.30µs ± 1%  -41.47%  (p=0.000 n=10+10)

but the spinning now does not wastefully burn CPU and gives up to OS scheduler via osyield(). This works similarly to "passive spinning" phase of mutex lock. For mutex the N(passive-spin) = 1 and since notesleep is more heavier having N(passive-spin) for it an order of magnitude more seems logical. The osyield will release the CPU to other threads if there is other work to do and thus hopefully should not have negative impact (this has to be verified).

Non-blocking network calls should not invoke scheduler, they don't give up P for 20us, while non-blocking read/write takes on a par of 5us.

Maybe you are right here. I will try to reverify this once getting to a related topic and will hopefully come back with feedback.

kostix · 2017-09-13T09:22:26Z

@dvyukov, could you please clear up that

Non-blocking network calls should not invoke scheduler, they don't give up P for 20us, while non-blocking read/write takes on a par of 5us.

bit for me?

Do I parse it correctly, that network syscalls are treated specially and P is not removed from under a G which spends more than 20 us in such a syscall, like this is done for all other syscalls?

Or does this merely happen "all by itself" — as a byproduct of such syscalls commonly lasting only about 5 us (thanks to the sockets being non-blocking)?

dvyukov · 2017-09-13T09:30:05Z

It happens for all syscalls/cgocalls if they return faster than 20us.

dvyukov · 2017-09-13T09:34:11Z

but the spinning now does not wastefully burn CPU and gives up to OS scheduler via osyield().

This is still wasteful for some callers. What exactly caller of notesleep is affected by this change? Can we improve the caller instead?

navytux · 2017-09-13T10:05:48Z

For LockOSThread case it seems to be stoplockedm/startlockedm:

(svg)

However for CGo case it is less clear, at least from CPU profile point of view:

(svg)

navytux · 2017-09-13T10:14:12Z

Btw, do you have particular example where osyielding in notesleep will be wasteful? (go scheduler is new to me)

dvyukov · 2017-09-13T10:29:27Z

I would try spinning in findrunnable more, before dropping P, that can yield better result without impacting other cases.

Re stoplockedm, I am still not sure. We could also spin there before dropping P, but then this is an optimization for the case when a locked G in unblocked very quickly. It's unclear how often this happens in real life, and even if it does happen it still won't be fast because of the required thread jumps. But this will hurt other goroutines because we are holding the P.
We could spin just in notesleep when called from stoplockedm, but then at least we need to avoid futexwake as well. And it's still not clear if it is the right thing to do.

If you care about performance of locked goroutines, there seems to be a bunch of other missed optimizations. E.g. when a locked G is woken, we wake another M, but that M cannot execute the G, so it wakes the right M and passes G/P to it. I guess we need to wake the right M right away.

navytux · 2017-09-13T12:19:54Z

@dvyukov thanks for your feedback with knowledge sharing - it is appreciated.

My situation is this: I have a case where my server is misperforming because of, I think, go scheduler (tracing shows many spawned goroutines are queued to same P and are not stealed while I need all them to execute in parallel to significantly reduce latency; other Ps are idle) . There I do not use LockOSThread and CGo at all at the moment. I cannot even describe the problem properly yet because it has not been fully analyzed yet. I spent some time learning how go scheduler works to better understand what happens when a goroutine is started, channel sent/received etc. Along the way I did a quick tool to profile how often goroutines are migrated in between Ms:

https://lab.nexedi.com/kirr/neo/blob/851864a9/go/gmigrate.go

because unfortunately it seems to be a frequent event and changing M probably means changing CPU and thus loosing CPU caches.

After studying I quickly looked around for scheduler bugs here on issue tracker and found LockOSThread case. I tried to test whether I understood at least something via trying to fix it, and so we are here.

So if someone else does not fix the LockOSThread and CGo cases before me, I will hopefully try to give it a fresh look while working on my scheduler-related issues. However scheduler bits are currently lower priority for me compared to proper memory and other tuning - so it will be some time before I could dig in more details on this topic.

Thanks again for feedback and appologize for throttling,
Kirill

dvyukov · 2017-09-13T12:25:29Z

tracing shows many spawned goroutines are queued to same P and are not stealed while I need all them to execute in parallel to significantly reduce latency; other Ps are idle

Sounds pretty bad. But does not look like an issue with notesleep.
Repro would be useful (and/or trace file).

RLH · 2017-09-13T12:39:01Z

Could this be a co-tenancy issue? Are there other processes running at the same time the go process is running? If so has GOMAXPROCS been adjusted accordingly? And as Dmitry said a reproducer would be great, if not then a trace so we can see why goroutines aren't being stolen.

…

On Wed, Sep 13, 2017 at 8:25 AM, Dmitry Vyukov ***@***.***> wrote: tracing shows many spawned goroutines are queued to same P and are not stealed while I need all them to execute in parallel to significantly reduce latency; other Ps are idle Sounds pretty bad. But does not look like an issue with notesleep. Repro would be useful (and/or trace file). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#21827 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA7Wn1RnzSJ2SK9i4WejD5HFv_O4k6AJks5sh8nJgaJpZM4PST4g> .

navytux · 2017-09-13T12:52:24Z

It is running on my notebook which has 4 CPU (2 physical * 2 HT). There are 2 programs: server + client. The client tries to issue 512 simultaneous networking requests over same TCP connection to server and then waits for completion, but from what I recall even initiating them go mostly serially. GOMAXPROCS is unadjusted and default to ncpu=4. No other program significantly uses the processor.

I admit networking over loopback is not the same as networking on LAN (on LAN RTT ~= 500μs while for TCP/loopback and separate processes it is aroung 5-8μs). We do care about loopback case too though.

I promise I get back to this once dealing with other urgent tunings. Probably in a week or two.

navytux · 2018-07-09T12:51:58Z

@ianlancetaylor, the sleep for 1μs in server causes some scheduler badness for unlocked case:

$ gotip test -count=3 -bench Unlocked tmp3_test.go 
goos: linux
goarch: amd64
BenchmarkUnlocked-4        20000             66237 ns/op
BenchmarkUnlocked-4        20000             66948 ns/op
BenchmarkUnlocked-4        20000             66971 ns/op

$ gotip test -count=3 -trace x.trace -bench Unlocked tmp3_test.go 
goos: linux
goarch: amd64
BenchmarkUnlocked-4       100000             14993 ns/op
BenchmarkUnlocked-4       100000             14758 ns/op
BenchmarkUnlocked-4       200000             15068 ns/op

i.e. it runs ~ 4.5x slower when run without tracing enabled.

For the reference the time for locked case only somewhat increases if tracing is enabled.

$ gotip test -count=3 -bench Locked tmp3_test.go 
goos: linux
goarch: amd64
BenchmarkLocked-4         100000             13924 ns/op
BenchmarkLocked-4         100000             13800 ns/op
BenchmarkLocked-4         100000             14025 ns/op

$ gotip test -count=3 -trace y.trace -bench Locked tmp3_test.go 
goos: linux
goarch: amd64
BenchmarkLocked-4         100000             18203 ns/op
BenchmarkLocked-4         100000             18162 ns/op
BenchmarkLocked-4         100000             18230 ns/op

go version devel +b56e24782f Mon Jul 9 02:18:16 2018 +0000 linux/amd64

navytux · 2018-07-09T12:53:26Z

I generally agree that what @dvyukov suggests should be the way to go, not my silly spinning patch.

disable a bunch of necessary stuff to benchmark conditional breakpoints also ugly hack to run all of proc.Continue code inside the ptrace thread golang/go#21827

omriperi · 2020-03-03T14:27:28Z

Guys any update on this issue?
We want to use C library that assumes we're not jumping between threads while invoking it (it's assync library) and currently don't know if it solved or not.
I'll be glad for any update ;)

ianlancetaylor · 2020-03-03T14:50:11Z

@omriperi Many programs and libraries use runtime.LockOSThread without issue. The benchmark here is synthetic and unlikely to correspond to most real use cases. Don't worry about this unless you see unexpected performance problems.

navytux · 2020-03-03T15:00:25Z

I can only restate that the benchmark is not synthetic: #21827 (comment).

ianlancetaylor · 2020-03-04T21:39:04Z

My apologies. It would help a great deal if you could share the real code, so that we can understand the real problem.

navytux · 2020-03-05T09:36:47Z

I apologize as well as I don't actively work on this topic now (the issue is from 2017), and cannot provide "the real code". However I clearly remember that scheduler performance issues start to show in practically all scenarious where either many system calls or Cgo calls are made. My understanding is that LockOSThread only brings those performance issues more to the surface.

Here are some evidence, from what I could remember, to justify my words:

#19574
#19574 (comment)
#19563 (comment)

Many projects switch off from using Go for this reason:

#19574 (comment)

In the end, only the code that does not push this pain points stay with Go, and due to that there might be impression that there are no real scheduler/Cgo/LockOSThread performance problems...

aarzilli · 2020-03-05T10:27:28Z

My apologies. It would help a great deal if you could share the real code, so that we can understand the real problem.

I've recently discovered that this problem affects Delve, for example.

Debugger-related calls (ptrace on linux, WaitForDebugEvent/ContinueDebugEvent on windows) need to be done on the same thread as the thread that spawned the target process, for this we use a goroutine that calls runtime.LockOSThread and executes thunks sent to it through a channel.

The code for this is in https://github.com/go-delve/delve/blob/f863be0a172a9c62d679143ec53587ef6255737e/pkg/proc/native/proc.go#L348 which is pretty much identical to the set up in this issue.

Back in january I looked into optimizing the performance of conditional breakpoint evaluation. There were several performance problems on our side (note: not all have landed upstream yet), when those were taken care the flame graph looked like this:

which shows that a lot of our time was spent inside runtime.findrunnnable. By simply moving all the code inside the LockOSThread goroutine I can get this flame graph:

the LockOSThread thing is responsible for 2/3 of the time it takes to evaluate a conditional breakpoint (specifically 0.4ms out of 0.6ms).

ianlancetaylor · 2020-03-05T18:54:26Z

@navytux There is a big difference between cgo overhead, which has gotten somewhat better, and LockOSThread overhead, which remains unclear. As we've discussed before, a cgo call does not call LockOSThread. Only a call from Go to C to Go calls LockOSThread.

navytux · 2020-03-06T08:14:33Z

@ianlancetaylor, I wrote #21827 (comment) with full understanding that LockOSThread is no longer there on Go->C call after 332719f. I still stand on my points based on experience: LockOSThread only highlights Go scheduler performance problems related to crossing Go->non-Go world boundary (Go->C, Go->kernel) that are present there even when goroutines are not locked to OS threads.

Anyway, whatever it is, e.g. Delve case (#21827 (comment); thanks @aarzilli) demonstrates this issue on a real-world program.

mpx · 2020-10-21T12:14:53Z

I recently hit this problem working with GLFW/OpenGL based frameworks. I found a relatively simple program was consuming huge amount of CPU, >50% via park_m and futex.

Go packages commonly use a channel to pass functions that must be executed on the main thread (GLFW/GL requirement). Eg, github.com/faiface/mainthread uses this technique. Channel operations cause a huge number of futex sleep/wake cycles as they swap between the Ms. This adds significant latency to calls (>8-9us).

This cut down example demonstrates the technique, and the kind of CPU profiles it generates:

package main

import (
    "fmt"
    "os"
    "runtime"
    "runtime/pprof"
    "time"
)

const iterations = 1e6

func main() {
    runtime.LockOSThread()

    f, _ := os.Create("cpu.prof")
    defer f.Close()
    _ = pprof.StartCPUProfile(f)
    defer pprof.StopCPUProfile()

    // Increasing chan size significantly improves performance when
    // not waiting for completion (keeps the main thread busy).
    calls := make(chan func(), 1)

    go func() {
        for i := 0; i < iterations; i++ {
            done := make(chan struct{})
            calls <- func() {
                close(done)
            }
            <-done // Comment to reduce parking main thread.
        }
        close(calls)
    }()

    // Process calls on the "main" thread.
    t0 := time.Now()
    for fn := range calls {
        fn()
    }
    fmt.Println("Mean:", time.Since(t0)/iterations)
}

chrisprobst · 2020-12-21T08:53:54Z

@mpx @ianlancetaylor Why is this a problem at all? The doc clearly states that LockOSThread causes a single goroutine to be run exclusively on a single thread. This naturally implies that communication via channels (or whatever) will cross thread boundaries, therefore requires locking/context switch & causes overhead. This seems totally reasonable.

@mpx It is true, that graphics stacks require you to use LockOSThread due to thread-local storage often used in old & rotten C libraries. However, passing callbacks from other goroutines usually work differently. You use a locked slice of callbacks and simply queue them. In the render loop you simply check every frame if there are callbacks and run them. If not, wait until the next frame. In this case, there is less chance of lock contention and park & wake-up. Also, your problem is totally not related to Go. As soon as you start using multiple threads, even in C/C++, you need a way of syncing the main render loop.

@navytux Maybe it would be helpful to summarize the initial problem more clearly?

ianlancetaylor · 2020-12-21T19:51:52Z

@chrisprobst This issue is about speeding up a specific case, for which a reduced test case appears in the initial comment. I agree that using LockOSThread is inevitably slower than not using it. But for this test case the slowdown appears to be unreasonably high.

gopherbot · 2024-02-09T14:52:13Z

Change https://go.dev/cl/562915 mentions this issue: runtime: don't call lockOSThread for every syscall call on Windows

MatejMagat305 · 2024-05-15T06:37:44Z

If in cgocall I also disable entersyscall/exitsyscall even without notesleep spin patch, so whole patch is below:

diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index ce4d707e06..70fbe7e7b1 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -106,7 +106,7 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
 
        // Lock g to m to ensure we stay on the same stack if we do a
        // cgo callback. In case of panic, unwindm calls endcgo.
-       lockOSThread()
+//     lockOSThread()
        mp := getg().m
        mp.ncgocall++
        mp.ncgo++
@@ -129,9 +129,9 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
        // "system call", run the Go code (which may grow the stack),
        // and then re-enter the "system call" reusing the PC and SP
        // saved by entersyscall here.
-       entersyscall(0)
+//     entersyscall(0)
        errno := asmcgocall(fn, arg)
-       exitsyscall(0)
+//     exitsyscall(0)
 
        // From the garbage collector's perspective, time can move
        // backwards in the sequence above. If there's a callback into
@@ -159,7 +159,7 @@ func endcgo(mp *m) {
                raceacquire(unsafe.Pointer(&racecgosync))
        }
 
-       unlockOSThread() // invalidates mp
+//     unlockOSThread() // invalidates mp
 }
 
 // Call from C back to Go.

compared to only lockOSThread commented it becomes:

$ benchstat dv1.txt dv1nosys.txt 
name     old time/op  new time/op  delta
CGo-4     558ns ± 1%   509ns ± 1%   -8.86%  (p=0.000 n=10+9)
CGo10-4  2.03µs ± 4%  0.77µs ± 3%  -61.97%  (p=0.000 n=8+10)

And if I comment only syscallenter/syscallexit, so whole patch is:

diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index ce4d707e06..243688f0af 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -129,9 +129,9 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
        // "system call", run the Go code (which may grow the stack),
        // and then re-enter the "system call" reusing the PC and SP
        // saved by entersyscall here.
-       entersyscall(0)
+//     entersyscall(0)
        errno := asmcgocall(fn, arg)
-       exitsyscall(0)
+//     exitsyscall(0)
 
        // From the garbage collector's perspective, time can move
        // backwards in the sequence above. If there's a callback into

compared to unmodified tip it becomes:

$ benchstat dv0.txt dv1nosysonly.txt 
name     old time/op  new time/op  delta
CGo-4     576ns ± 1%   515ns ± 1%  -10.62%  (p=0.000 n=10+10)
CGo10-4  2.22µs ± 2%  0.84µs ± 2%  -62.07%  (p=0.000 n=9+10)

so in case of CGo calls the notesleep spin patch brings speedup not becuase of lockOSThread but due to entersyscall/exitsyscall being there.

I would like have this version like "unsafecgo call" - use unly lock thread ...

navytux mentioned this issue Sep 10, 2017

runtime: unexpectedly large slowdown with runtime.LockOSThread #18023

Closed

ianlancetaylor added this to the Go1.10 milestone Sep 10, 2017

ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 10, 2017

dstiliadis mentioned this issue Aug 27, 2019

Is locking of go-routine to thread mandatory? mdlayher/netlink#146

Closed

navytux mentioned this issue Mar 11, 2020

Reduce the cpu cycles in runtime.findrunnable google/gvisor#1942

Closed

navytux mentioned this issue Apr 21, 2020

runtime: cgo calls are way faster when enabling CPU profile #38325

Open

aarzilli mentioned this issue Nov 29, 2021

proposal: runtime: allow N goroutines to be simultaneously locked to the same OS thread #49848

Closed

CannibalVox mentioned this issue Mar 1, 2022

JIT: consider locking Goroutine thread until JIT exits tetratelabs/wazero#153

Closed

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

mknyszek added this to Go Compiler / Runtime Jul 7, 2022

mknyszek removed this from Go Compiler / Runtime Jul 13, 2022

prattmic mentioned this issue Aug 23, 2022

runtime: allow short-term drop of work conservation to increase CPU efficiency #54622

Open

This was referenced Feb 7, 2023

runtime: cgocall is low performance for Conn on Windows #58336

Closed

runtime: eliminate the notion of a "syscall state" #58492

Open

drakkan mentioned this issue Mar 10, 2023

Setuid as user for OsFs operations drakkan/sftpgo#1225

Closed

dominikh mentioned this issue May 13, 2024

Provide access to errno ebitengine/purego#244

Open

6 tasks

gabyhelp mentioned this issue Jul 24, 2024

runtime: improve scaling of lock2 #68578

Closed

runtime: big performance penalty with runtime.LockOSThread #21827

runtime: big performance penalty with runtime.LockOSThread #21827

Comments

navytux commented Sep 10, 2017

typeless commented Sep 13, 2017

dvyukov commented Sep 13, 2017

dvyukov commented Sep 13, 2017

typeless commented Sep 13, 2017 • edited Loading

dvyukov commented Sep 13, 2017

typeless commented Sep 13, 2017

navytux commented Sep 13, 2017

dvyukov commented Sep 13, 2017

dvyukov commented Sep 13, 2017

navytux commented Sep 13, 2017

dvyukov commented Sep 13, 2017

navytux commented Sep 13, 2017

dvyukov commented Sep 13, 2017

navytux commented Sep 13, 2017

navytux commented Sep 13, 2017

dvyukov commented Sep 13, 2017

navytux commented Sep 13, 2017

kostix commented Sep 13, 2017

dvyukov commented Sep 13, 2017

dvyukov commented Sep 13, 2017

navytux commented Sep 13, 2017

navytux commented Sep 13, 2017

dvyukov commented Sep 13, 2017

navytux commented Sep 13, 2017 • edited Loading

dvyukov commented Sep 13, 2017

RLH commented Sep 13, 2017 via email

navytux commented Sep 13, 2017

navytux commented Jul 9, 2018

navytux commented Jul 9, 2018

omriperi commented Mar 3, 2020

ianlancetaylor commented Mar 3, 2020

navytux commented Mar 3, 2020

ianlancetaylor commented Mar 4, 2020

navytux commented Mar 5, 2020

aarzilli commented Mar 5, 2020 • edited Loading

ianlancetaylor commented Mar 5, 2020

navytux commented Mar 6, 2020

mpx commented Oct 21, 2020

chrisprobst commented Dec 21, 2020

ianlancetaylor commented Dec 21, 2020

gopherbot commented Feb 9, 2024

MatejMagat305 commented May 15, 2024

typeless commented Sep 13, 2017 •

edited

Loading

navytux commented Sep 13, 2017 •

edited

Loading

aarzilli commented Mar 5, 2020 •

edited

Loading