runtime: preserve extra M across calls from C to Go #51676

doujiang24 · 2022-03-15T09:37:34Z

There are 5 sigprocmask calls and 3 sigaltstack calls when calling every go exported function from C.

syscall during needm:

rt_sigprocmask(SIG_SETMASK, NULL, [], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[], NULL, 8) = 0
sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
sigaltstack({ss_sp=0xc00003e000, ss_flags=0, ss_size=32768}, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

syscall during dropm:

rt_sigprocmask(SIG_SETMASK, ~[], NULL, 8) = 0
sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

We can call PreBindExtraM to bind extra M after loaded go so file and before call any go exported functions, for better performance.
And nothing changes without this PreBindExtraM call.

background:
We are building GoLang extension for Envoy which heavily relies on cgo.

The text was updated successfully, but these errors were encountered:

doujiang24 · 2022-03-15T11:21:52Z

I finished a draft PR for this proposal: #51679
Any feedbacks are welcome, thanks!

With PreBindExtraM, c call go is ~30x faster in the following simple test case:

hello.go

package main

import "C"

//export AddFromGo
func AddFromGo(a int64, b int64) int64 {
    return a + b
}

func main() {}

hello.c

#include <stdio.h>
#include "libgo-hello.h"
#include <stdlib.h>

int main(int argc, char **argv) {
    long a = 2;
    long b = 3;
    long max = 1;

    if (argc > 1) {
        max = atoi(argv[1]);
    }

    printf("max loop: %ld\n", max);

    PreBindExtraM();

    long r;
    for (int i = 0; i < max; i++) {
        r = AddFromGo(a, b);
    }

    printf("%ld + %ld = %ld\n", a, b, r);
}

benchmark with PreBindExtraM:

$ time ./hello 1000000
max loop: 1000000
2 + 3 = 5

real    0m0.150s
user    0m0.156s
sys     0m0.010s

benchmark without PreBindExtraM(just remove it):

$ time ./hello 1000000
max loop: 1000000
2 + 3 = 5

real    0m5.088s
user    0m1.536s
sys     0m4.116s

ianlancetaylor · 2022-03-15T17:39:20Z

Even after looking at the pull request I'm not sure precisely what you are proposing.

Is user code expected to call PreBindExtraM? What is the exact semantics of that function? How would you write user documentation for it? Thanks.

doujiang24 · 2022-03-16T01:20:28Z

@ianlancetaylor Thanks.

Is user code expected to call PreBindExtraM? What is the exact semantics of that function? How would you write user documentation for it? Thanks.

Yes, user code have to call PreBindExtraM to enable this optimization, as shown in the hello.c.
Without the additional call of PreBindExtraM, everything just works as previous, nothing changes.

Let me try to write a bit document for it:

When calling a go exported function in a c process, in short, it works as this flow:

bind an extra M(also a P, we don't care it here),
execute the go function,
drop the extra M (P).

In step 1 (needm) and step 3 (dropm), there are five signal syscalls.

To avoid these five signal syscall, cgo also generated a built-in C function PreBindExtraM.
You can call PreBindExtraM to pre-bind extra M, before you call any go exported functions, after you loaded the go so file.
After pre-binding extra M, step 1 and step 3 will be skipped when calling any go exported functions.

aclements · 2022-03-16T16:00:08Z

I haven't thought through this deeply, but is the TODO(rsc) comment on dropm relevant to this case? It seems like if the runtime could use TLS to bind an M to a C thread, we wouldn't need to manipulate the sigaltstack so frequently. But I may be wrong about that.

ianlancetaylor · 2022-03-16T16:53:14Z

OK, I think that in effect what the suggested change does is, for a thread created by C, set the g TLS variable to a newly created G and and associated M. However, there is no way to actually release that G and M if the thread exits.

So, I agree: the TODO by @rsc is a better approach. With that approach, the first time a C thread calls into Go we allocate a G and M and set the g TLS variable. Then we just keep that around, but if the thread exits we release that G and M and put the M back on the extram list.

Note that we will get into trouble if the C thread calls Go code, then disables the signal stack, then calls Go code again. Perhaps that case is not worth worrying about.

I'm going to take this out of the proposal process because I think we can get the same effect without an API change.

thepudds · 2022-03-16T18:26:29Z

Then we just keep that around, but if the thread exits we release that G and M and put the M back on the extram list.

Sorry for basic question, but today does it already track when a thread created by C exits?

thepudds · 2022-03-16T18:50:00Z

To partly answer my own question, it looks like registering a destructor which would be called on thread exit would be part of the work here...

ianlancetaylor · 2022-03-16T20:49:30Z

Yes, we would use pthread_key_create with a destructor function. We wouldn't actually track when a thread exits as such.

doujiang24 · 2022-03-17T04:02:08Z

Oh, agreed, the TODO by @rsc is a better approach. Using pthread_key_create to register a destructor is a good idea.

set the g TLS variable to a newly created G and and associated M.

Do it need to create a new g? Maybe using the g0 could be a better choice, as it does now.

Does the following change is in the right way? I would love to have a try. Thanks.

pthread_key_create to register a destructor when loading go so file, maybe in the x_cgo_sys_thread_create function.
needm in _cgo_wait_runtime_init_done when thread key value is NULL, also, set the thread key value to a non-NULL value.
when the destructor is called, dropm.

In short, we always try to pre-bind M in every Go exported function. And drop M in destructor to avoid M leaking.

ianlancetaylor · 2022-03-17T04:06:08Z

Do it need to create a new g? Maybe using the g0 could be a better choice, as it does now.

Yes, that is the right thing to do.

Your set of steps sounds basically right.

doujiang24 · 2022-03-18T13:03:29Z

Okay, jumping out of the pre-bind rabbit hole, maybe step 2 change to the following is simpler (or expected)?
2. skip dropm when a destructor is registered.

aclements · 2022-03-18T13:25:52Z

I'm not sure I completely understand what you mean, but I think that's the right direction. cgocallback already calls needm if the g TLS isn't set, so it's probably easiest to let it keep doing that, rather than moving responsibility for that to _cgo_wait_runtime_init_done, and just leave that g/m set. That also means we don't need to access this new pthread_key's value from Go; we're only using it for its destructor.

_cgo_wait_runtime_init_done might be a good place to ensure the pthread_key is set to a non-NULL value for that thread (otherwise the destructor won't be called), and possibly a good place to ensure the pthread_key has been created in the first place.

Creating the m in _cgo_wait_runtime_init_done would probably work, but it's sort on the wrong side of the language divide.

doujiang24 · 2022-03-21T02:39:32Z

Yeah, I mean keep needm in cgocallback, and skip dropm when a destructor is registered by pthread_key_create, since I have noticed the following comment for dropm in source code.

// We may have to keep the current version on systems with cgo
// but without pthreads, like Windows.

_cgo_wait_runtime_init_done might be a good place to ensure the pthread_key is set to a non-NULL value for that thread

Yeah, this sounds better than x_cgo_sys_thread_create. I will have a try. Thanks.

doujiang24 · 2022-03-21T11:56:37Z

I have implemented the new way in CL 387415.
Please help to take a look if it's the right direction. If yes, I'll continue to improve it.
Any feedbacks are welcome, thanks.

In CL 387415, we introduced to variables:

x_cgo_pthread_key_created indicates if we have registered the destructor or not,
x_cgo_dropm to save the cgodropm function address, since I found it's hard to import cgodropm from go to gcc_libinit.c.

gopherbot · 2022-05-30T08:05:35Z

Change https://go.dev/cl/392854 mentions this issue: runtime/cgo: store M for C-created thread in pthread key

thepudds · 2023-03-02T17:47:38Z

Some time ago, I had briefly looked into whether an equivalent solution might be possible on Windows.

FWIW, some people seem to suggest that Fiber Local Storage functions on Windows could provide a destructor call back on thread exit, even if not using Fibers.

It looks like FlsAlloc takes an FlsCallback that is called at thread exit (and fiber deletion):

PFLS_CALLBACK_FUNCTION callback function
An application-defined function. If the FLS slot is in use, FlsCallback is called on fiber deletion, thread exit, and when an FLS index is freed. Specify this function when calling the FlsAlloc function.

Some more from another piece of the Fiber documentation, including how FLS is treated if no fiber switching has happened:

Fiber Local Storage
A fiber can use fiber local storage (FLS) to create a unique copy of a variable for each fiber. If no fiber switching occurs, FLS acts exactly the same as thread local storage. The FLS functions (FlsAlloc, FlsFree, FlsGetValue, and FlsSetValue) manipulate the FLS associated with the current thread. If the thread is executing a fiber and the fiber is switched, the FLS is also switched.

There is also a related discussion in the MSDN forums here about trying to emulate a pthread_key_create destructor on Windows:

https://social.msdn.microsoft.com/Forums/windowsdesktop/en-US/043b1f9f-47f2-4905-a6ab-89c8c6172c28/thread-local-storage?forum=windowssdk

In that discussion, user deltamind106 (who has with 5 MSDN karma points) suggests that Fibers are not an appropriate solution. However, user HomeCloset (who has ~6,000 MSDN karma points) contradicts them and seems to suggest Fibers are useful for similar destructor functionality as pthread_key_create... but the discussion is perhaps open to interpretation. There are some other possible approaches that are also discussed there.

For Fiber Local Storage, one wrinkle would be if the user code is itself using fibers and for example deletes the fiber of interest. At that point, as I understand it the destructor would be called before the thread exited, but maybe things could be set up in a way so that scenario is just a performance hit (that is, ~similar performance as happens today) where the M and whatever other resources are released "early" compared to if the fiber hadn't been deleted? And if the fiber is not deleted, then thread exit still properly releases things, which avoids a leak. Maybe?

In any event, this might not work, and please take with a large grain of salt, but I wanted to at least leave a note here in case it is helpful for any future work after the (exciting!) non-Windows version lands.

A comparison instruction was missing in CL 392854. Should fix ARM builders. For #51676. Change-Id: Ica27a99be10e595bab4fad35e2e6c00a1c68a662 Reviewed-on: https://go-review.googlesource.com/c/go/+/479255 TryBot-Bypass: Cherry Mui <[email protected]> Reviewed-by: Michael Pratt <[email protected]> Run-TryBot: Cherry Mui <[email protected]>

gopherbot · 2023-03-24T18:23:26Z

Change https://go.dev/cl/479255 mentions this issue: runtime: fix ARM assembly code in cgocallback

gopherbot · 2023-03-31T19:53:56Z

Change https://go.dev/cl/481061 mentions this issue: runtime/cgo: store M for C-created thread in pthread key

This reapplies CL 392854, with the followup fixes in CL 479255, CL 479915, and CL 481057 incorporated. CL 392854, by doujiang24 <[email protected]>, speed up C to Go calls by binding the M to the C thread. See below for its description. CL 479255 is a followup fix for a small bug in ARM assembly code. CL 479915 is another followup fix to address C to Go calls after the C code uses some stack, but that CL is also buggy. CL 481057, by Michael Knyszek, is a followup fix for a memory leak bug of CL 479915. [Original CL 392854 description] In a C thread, it's necessary to acquire an extra M by using needm while invoking a Go function from C. But, needm and dropm are heavy costs due to the signal-related syscalls. So, we change to not dropm while returning back to C, which means binding the extra M to the C thread until it exits, to avoid needm and dropm on each C to Go call. Instead, we only dropm while the C thread exits, so the extra M won't leak. When invoking a Go function from C: Allocate a pthread variable using pthread_key_create, only once per shared object, and register a thread-exit-time destructor. And store the g0 of the current m into the thread-specified value of the pthread key, only once per C thread, so that the destructor will put the extra M back onto the extra M list while the C thread exits. When returning back to C: Skip dropm in cgocallback, when the pthread variable has been created, so that the extra M will be reused the next time invoke a Go function from C. This is purely a performance optimization. The old version, in which needm & dropm happen on each cgo call, is still correct too, and we have to keep the old version on systems with cgo but without pthreads, like Windows. This optimization is significant, and the specific value depends on the OS system and CPU, but in general, it can be considered as 10x faster, for a simple Go function call from a C thread. For the newly added BenchmarkCGoInCThread, some benchmark results: 1. it's 28x faster, from 3395 ns/op to 121 ns/op, in darwin OS & Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz 2. it's 6.5x faster, from 1495 ns/op to 230 ns/op, in Linux OS & Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz [CL 479915 description] Currently, when C calls into Go the first time, we grab an M using needm, which sets m.g0's stack bounds using the SP. We don't know how big the stack is, so we simply assume 32K. Previously, when the Go function returns to C, we drop the M, and the next time C calls into Go, we put a new stack bound on the g0 based on the current SP. After CL 392854, we don't drop the M, and the next time C calls into Go, we reuse the same g0, without recomputing the stack bounds. If the C code uses quite a bit of stack space before calling into Go, the SP may be well below the 32K stack bound we assumed, so the runtime thinks the g0 stack overflows. This CL makes needm get a more accurate stack bound from pthread. (In some platforms this may still be a guess as we don't know exactly where we are in the C stack), but it is probably better than simply assuming 32K. Fixes #51676. Fixes #59294. Change-Id: I9bf1400106d5c08ce621d2ed1df3a2d9e3f55494 Reviewed-on: https://go-review.googlesource.com/c/go/+/481061 Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: DeJiang Zhu (doujiang) <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

gopherbot · 2023-04-17T18:33:14Z

Change https://go.dev/cl/485275 mentions this issue: Revert "runtime/cgo: store M for C-created thread in pthread key"

This reverts CL 481061. Reason for revert: When built with C TSAN, x_cgo_getstackbound triggers race detection on `g->stacklo` because the synchronization is in Go, which isn't instrumented. For #51676. For #59294. For #59678. Change-Id: I38afcda9fcffd6537582a39a5214bc23dc147d47 Reviewed-on: https://go-review.googlesource.com/c/go/+/485275 TryBot-Result: Gopher Robot <[email protected]> Auto-Submit: Michael Pratt <[email protected]> Run-TryBot: Michael Pratt <[email protected]> Reviewed-by: Than McIntosh <[email protected]>

ianlancetaylor · 2023-04-17T21:19:04Z

Changes were reverted again, so reopening this issue.

gopherbot · 2023-04-17T21:45:36Z

Change https://go.dev/cl/485500 mentions this issue: runtime/cgo: store M for C-created thread in pthread key

aclements · 2023-05-05T17:03:10Z

Reverted again in CL 492995. 😭

gopherbot · 2023-05-17T16:18:09Z

Change https://go.dev/cl/495855 mentions this issue: runtime/cgo: store M for C-created thread in pthread key

gopherbot · 2023-05-31T21:17:10Z

Change https://go.dev/cl/499716 mentions this issue: doc/go1.21: mention improvement to C-to-Go calls

For #51676. For #58645. Change-Id: I9045051b5a25c6dfc833eef13e6c105a0d8ae763 Reviewed-on: https://go-review.googlesource.com/c/go/+/499716 Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

doujiang24 added the Proposal label Mar 15, 2022

gopherbot added this to the Proposal milestone Mar 15, 2022

doujiang24 mentioned this issue Mar 15, 2022

runtime/cgo: store M for C-created thread in pthread key #51679

Closed

ianlancetaylor changed the title ~~proposal: cgo: add PreBindExtraM to reduce signal syscall.~~ proposal: cmd/cgo: add PreBindExtraM to reduce signal syscall Mar 15, 2022

ianlancetaylor changed the title ~~proposal: cmd/cgo: add PreBindExtraM to reduce signal syscall~~ runtime: preserve extra M across calls from C to Go Mar 16, 2022

ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 16, 2022

ianlancetaylor modified the milestones: Proposal, Backlog Mar 16, 2022

ianlancetaylor added help wanted and removed Proposal labels Mar 16, 2022

gopherbot closed this as completed in ef0dedc Mar 24, 2023

ianlancetaylor reopened this Apr 17, 2023

gopherbot closed this as completed in 7b87461 Apr 26, 2023

dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 2, 2023

dmitshur modified the milestones: Backlog, Go1.21 May 2, 2023

aclements reopened this May 5, 2023

gopherbot closed this as completed in c426c87 May 17, 2023

golang locked and limited conversation to collaborators May 30, 2024

gopherbot added the FrozenDueToAge label May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: preserve extra M across calls from C to Go #51676

runtime: preserve extra M across calls from C to Go #51676

doujiang24 commented Mar 15, 2022

doujiang24 commented Mar 15, 2022

ianlancetaylor commented Mar 15, 2022

doujiang24 commented Mar 16, 2022

aclements commented Mar 16, 2022

ianlancetaylor commented Mar 16, 2022

thepudds commented Mar 16, 2022

thepudds commented Mar 16, 2022

ianlancetaylor commented Mar 16, 2022

doujiang24 commented Mar 17, 2022

ianlancetaylor commented Mar 17, 2022

doujiang24 commented Mar 18, 2022

aclements commented Mar 18, 2022

doujiang24 commented Mar 21, 2022

doujiang24 commented Mar 21, 2022

gopherbot commented May 30, 2022

thepudds commented Mar 2, 2023

gopherbot commented Mar 24, 2023

gopherbot commented Mar 31, 2023

gopherbot commented Apr 17, 2023

ianlancetaylor commented Apr 17, 2023

gopherbot commented Apr 17, 2023

aclements commented May 5, 2023

gopherbot commented May 17, 2023

gopherbot commented May 31, 2023

runtime: preserve extra M across calls from C to Go #51676

runtime: preserve extra M across calls from C to Go #51676

Comments

doujiang24 commented Mar 15, 2022

doujiang24 commented Mar 15, 2022

ianlancetaylor commented Mar 15, 2022

doujiang24 commented Mar 16, 2022

aclements commented Mar 16, 2022

ianlancetaylor commented Mar 16, 2022

thepudds commented Mar 16, 2022

thepudds commented Mar 16, 2022

ianlancetaylor commented Mar 16, 2022

doujiang24 commented Mar 17, 2022

ianlancetaylor commented Mar 17, 2022

doujiang24 commented Mar 18, 2022

aclements commented Mar 18, 2022

doujiang24 commented Mar 21, 2022

doujiang24 commented Mar 21, 2022

gopherbot commented May 30, 2022

thepudds commented Mar 2, 2023

gopherbot commented Mar 24, 2023

gopherbot commented Mar 31, 2023

gopherbot commented Apr 17, 2023

ianlancetaylor commented Apr 17, 2023

gopherbot commented Apr 17, 2023

aclements commented May 5, 2023

gopherbot commented May 17, 2023

gopherbot commented May 31, 2023