runtime: non-spinning Ms spin uselessly when work exists #43997

prattmic · 2021-01-29T19:35:55Z

Background: The scheduler differentiates "spinning" and "non-spinning" Ms. We will allow up to GOMAXPROCS/2 Ms to enter the spinning state (assuming all Ps are busy).

Spinning Ms are eligible to steal runnable goroutines or timers from other Ps and will loop several times over all Ps attempting to steal before giving up and stopping.

For work conservation, the scheduler ensures that there is at least one spinning thread after readying a goroutine. To support this, spinning Ms adhere to a contract: any work submitted when sched.nmspinning > 0 is guaranteed to be discovered.

To maintain this invariant, spinning Ms must:

If work is found and this is the last spinning M, wake another M to spin.
If work is not found and we are going to stop, perform a “delicate dance” to transition to non-spinning state. This means decrementing sched.nmspinning and then re-checking all Ps for new work to steal in case any was submitted since the earlier steal attempt.

This is all good for spinning Ms, but there is a bad case for non-spinning Ms.

If there are already enough spinning Ms, the remaining Ms will skip the main steal loop and go “directly” to stop. However, after the stop label is the steal recheck primarily intended for spinning -> non-spinning transitions, but executed by all Ms.

A non-spinning M can detect work in this loop and jump back to the top to try to steal it (we don’t steal directly in this loop to avoid duplication of the somewhat complex stealing logic). Unfortunately this is a non-spinning M, so assuming other spinning Ms don’t go away we will remain non-spinning and skip right over the steal loop back to stop, at which point we’ll notice the same work again, jump to the top and start all over again.

Note that this can only occur while there is at least one spinning M, which means that this behavior is bounded: either the spinning M will take the work for itself or will take some other work and the non-spinning M can transition to spinning and take work for itself.

I argue that the current behavior is nearly always sub-optimal:

If there are <= sched.nmspinning total available units of work, then the spinning Ms will take all of the work and the non-spinning one will have to ultimately sleep anyways.
- Caveat: of course new work can be submitted at any time, so having an extra quasi-spinning M that can jump in as spinning could reduce latency.
- OTOH, the existing approach by which these “quasi-spinning” Ms aren’t tracked by the scheduler means that a spinning M may needless wake another spinning M to replace itself even though the “quasi-spinning” M could just take over.
If there are > sched.nmspinning total available units of work, then the non-spinning M should eventually get some work. However, the current approach needlessly delays stealing until the existing spinning Ms have found work. I think it would be better if the non-spinning Ms could upgrade themselves to spinning (in excess of the normal GOMAXPROCS/2 limit) if they notice that there is extra work to do, as this would reduce scheduling latency without increasing work (remember that we are basically in a busy-loop anyways).

I discovered this behavior on a Google-internal workload which is basically the worst case scenario for this: one M/P/G is running continuously generating very short-lived work that must be stolen by other Ms. The other Ms spend most of their time in the scheduler looking for work. A simple change to make non-spinning Ms skip the steal recheck reduces application wall time by 12% and CPU time by 23%. I hope to create a similar open source reproducer soon.

Note that this behavior has existed since the introduction of spinning Ms, so this is not a new regression.

Something that is not directly related, but perhaps worth investigating is whether we should even do all of the “delicate dance” checks on all spinning Ms, as it should not affect correctness to limit it to only the 1->0 nmspinning transition, which would further reduce usage.

cc @aclements @mknyszek

See also #18237, #28808

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2021-01-29T20:08:40Z

CC @dvyukov

gopherbot · 2021-01-29T21:42:41Z

Change https://golang.org/cl/288232 mentions this issue: runtime: only recheck for wasSpinning

prattmic · 2021-01-29T22:04:21Z

For reference, here's an application reproducer similar to the one with which I diagnosed this issue (though quite a bit noisier):

Build the program in High CPU usage in scheduler google/gvisor#2033 (comment).

$ gcc -pthread -o ./pthread ./pthread.c

Checkout https://github.com/google/gvisor. I built at 25284ae3c9b5f8d96686ba5b18b53819a961d34d, but the specific version shouldn't matter much.
Edit WORKSPACE to use a local Go toolchain, like so.
Build and run:

$ bazel build //runsc       
INFO: Analyzed target //runsc:runsc (0 packages loaded, 0 targets configured).
INFO: Found 1 target...       
Target //runsc:runsc up-to-date:
  bazel-out/k8-fastbuild-ST-4c64f0b3d5c7/bin/runsc/runsc_/runsc
INFO: Elapsed time: 27.295s, Critical Path: 21.24s
INFO: 1246 processes: 1 internal, 1245 linux-sandbox.
INFO: Build completed successfully, 1246 total actions
$ sudo time -p bazel-out/k8-fastbuild-ST-4c64f0b3d5c7/bin/runsc/runsc_/runsc do ~/Downloads/pthread 10 100000
real 8.53
user 6.67
sys 14.13

https://golang.org/cl/288232 is a basic prototype that takes aim at the first sub-optimal bullet above by skipping the recheck entirely for non-spinning Ms, allowing them to stop. Before and after that CL, I get:

name        old s      new s      delta
WallTime    12.3 ±63%  12.9 ±22%     ~     (p=0.832 n=30+30)
SystemTime  21.4 ±67%  22.6 ±23%     ~     (p=0.775 n=30+30)
UserTime    9.75 ±38%  8.14 ±24%  -16.45%  (p=0.001 n=27+30)
CpuTime     30.5 ±68%  30.7 ±23%     ~     (p=0.458 n=30+30)

This particular benchmark is unfortunately much noisier than my internal benchmark, but show promise (CpuTime samples are the sum of SystemTime and UserTime samples, so UserTime shouldn't be able to drop without both SystemTime and CpuTime flat).

For reference, my internal workload for the same change:

name        old s      new s      delta
WallTime    1.15 ± 4%  1.00 ± 2%  -12.86%  (p=0.008 n=5+5)
SystemTime  0.43 ±10%  0.36 ± 8%  -16.99%  (p=0.008 n=5+5)
UserTime    4.65 ± 4%  3.32 ± 5%  -28.46%  (p=0.008 n=5+5)
CPUTime     5.08 ± 5%  3.68 ± 5%  -27.49%  (p=0.008 n=5+5)

dvyukov · 2021-02-01T08:41:12Z

Interesting find! It's indeed very old and was present in the scheduler since the rewrite. I guess I added this work conservation check during late fine-tuning stage and failed to properly integrate it with the overall logic.

I agree the current spinning behavior is bad and https://golang.org/cl/288232 looks like reasonable to me.

Something that is not directly related, but perhaps worth investigating is whether we should do all of the “delicate dance” checks even on all spinning Ms, as it should not affect correctness to limit it to only the 1->0 nmspinning transition and this would further reduce usage.

I think it should work. The invariant scheduler maintains is that there should be at least 1 spinning M if there is any work, and that 1 spinning M will wake up more if necessary. Since spinning Ms already checked workqueues of all Ps during steal phase, it looks reasonable to not re-do it again for all but last spinning M. That "delicate dance" re-checking is there only to prevent bad edge cases of not realized parallelism, it's not intended to be part of the performance story.

prattmic · 2021-02-01T20:02:52Z

Thanks for the extra background Dmitry!

For posterity, here is what extreme cases of this can look like in the trace viewer. The pink regions are clumps of proc stop/start events where the non-spinning M is busy looping in findrunnable.

ChrisHines · 2021-02-02T20:06:55Z

For what's it's worth, I also observed this behavior of the scheduler while working on the fix for #38860 and thought it was curious, but didn't worry about it since it was not the focus of my work at the time. That said, as the submitter of #28808 I welcome effort to improve the scheduler in this area.

gopherbot · 2021-04-07T18:34:26Z

Change https://golang.org/cl/307914 mentions this issue: runtime: refactor findrunnable spinning recheck

gopherbot · 2021-04-07T18:34:27Z

Change https://golang.org/cl/307913 mentions this issue: runtime: refactor work stealing to dedicated function

gopherbot · 2021-04-07T18:34:29Z

Change https://golang.org/cl/307910 mentions this issue: runtime: move delta computation closer to use

gopherbot · 2021-04-07T18:34:30Z

Change https://golang.org/cl/307911 mentions this issue: runtime: remove redudant tryWakeP component

gopherbot · 2021-04-07T18:34:32Z

Change https://golang.org/cl/307912 mentions this issue: runtime: clarify which work needs spinning coordination

gopherbot · 2021-04-08T21:22:18Z

Change https://golang.org/cl/308654 mentions this issue: runtime: move timer recheck after GC recheck

gopherbot · 2021-04-16T19:59:58Z

Change https://golang.org/cl/310850 mentions this issue: runtime: skip work recheck for non-spinning Ms

findrunnable has a couple places where delta is recomputed from a new pollUntil value. This proves to be a pain in refactoring, as it is easy to forget to do properly. Move computation of delta closer to its use, where it is more logical anyways. This CL should have no functional changes. For #43997. For #44313. Change-Id: I89980fd7f40f8a4c56c7540cae03ff99e12e1422 Reviewed-on: https://go-review.googlesource.com/c/go/+/307910 Trust: Michael Pratt <[email protected]> Run-TryBot: Michael Pratt <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

Here tryWakeP can't already be true, so there is no need to combine the values. This CL should have no functional changes. For #43997. For #44313. Change-Id: I640c7bb88a5f70c8d22f89f0b5b146b3f60c0136 Reviewed-on: https://go-review.googlesource.com/c/go/+/307911 Trust: Michael Pratt <[email protected]> Run-TryBot: Michael Pratt <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

The overview comments discuss readying goroutines, which is the most common source of work, but timers and idle-priority GC work also require the same synchronization w.r.t. spinning Ms. This CL should have no functional changes. For #43997 Updates #44313 Change-Id: I7910a7f93764dde07c3ed63666277eb832bf8299 Reviewed-on: https://go-review.googlesource.com/c/go/+/307912 Trust: Michael Pratt <[email protected]> Run-TryBot: Michael Pratt <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

findrunnable has grown very large and hard to follow over the years. Parts we can split out into logical chunks should help make it more understandable and easier to change in the future. The work stealing loop is one such big chunk that is fairly trivial to split out into its own function, and even has the advantage of simplifying control flow by removing a goto around work stealing. This CL should have no functional changes. For #43997. For #44313. Change-Id: Ie69670c7bc60bd6c114e860184918717829adb22 Reviewed-on: https://go-review.googlesource.com/c/go/+/307913 Trust: Michael Pratt <[email protected]> Run-TryBot: Michael Pratt <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Chris Hines <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

Break the main components of the findrunnable spinning -> non-spinning recheck out into their own functions, which simplifies both findrunnable and the new functions, which can make use of fancy features like early returns. This CL should have no functional changes. For #43997 For #44313 Change-Id: I6d3060fcecda9920a3471ff338f73d53b1d848a3 Reviewed-on: https://go-review.googlesource.com/c/go/+/307914 Trust: Michael Pratt <[email protected]> Run-TryBot: Michael Pratt <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

When rechecking for work after transitioning from a spinning to non-spinning M, checking timers before GC isn't useful. That is, if there is GC work available, it will run immediately and the updated pollUntil is unused. Move this check to just before netpoll, where pollUntil is used. While this technically improves efficiency in the (rare) case that we find GC work in this block, the primary motivation is simply to improve clarity by moving the update closer to use. For #43997 Change-Id: Ibc7fb308ac4a582875c200659c9e272121a89f3b Reviewed-on: https://go-review.googlesource.com/c/go/+/308654 Trust: Michael Pratt <[email protected]> Trust: Michael Knyszek <[email protected]> Run-TryBot: Michael Pratt <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

prattmic · 2022-02-28T20:51:33Z

Something that is not directly related, but perhaps worth investigating is whether we should even do all of the “delicate dance” checks on all spinning Ms, as it should not affect correctness to limit it to only the 1->0 nmspinning transition, which would further reduce usage.

One reason that this would not work today is that we can optimistically increment nmspinning without committing to actually starting a spinning M. e.g., wakep increments nmspinning and calls startm. But if startm doesn't find a P it will simply decrement nmspinning and return. If that was the 1->0 transition, it wouldn't be in a position to do the scheduler checks. We'd need to eliminate these optimistic cases to make this safe.

gopherbot · 2022-03-01T23:27:38Z

Change https://go.dev/cl/389014 mentions this issue: runtime: synchronize P wakeup and dropping Ps

CL 310850 dropped work re-checks on non-spinning Ms to fix #43997. This introduced a new race condition: a non-spinning M may drop its P and then park at the same time a spinning M attempts to wake a P to handle some new work. The spinning M fails to find an idle P (because the non-spinning M hasn't quite made its P idle yet), and does nothing assuming that the system is fully loaded. This results in loss of work conservation. In the worst case we could have a complete deadlock if injectglist fails to wake anything just as all Ps are going idle. sched.needspinning adds new synchronization to cover this case. If work submission fails to find a P, it sets needspinning to indicate that a spinning M is required. When non-spinning Ms prepare to drop their P, they check needspinning and abort going idle to become a spinning M instead. This addresses the race without extra spurious wakeups. In the normal (non-racing case), an M will become spinning via the normal path and clear the flag. injectglist must change in addition to wakep because it is a similar form of work submission, notably used following netpoll at a point when we might not have a P that would guarantee the work runs. Fixes #45867 Change-Id: Ieb623a6d4162fb8c2be7b4ff8acdebcc3a0d69a8 Reviewed-on: https://go-review.googlesource.com/c/go/+/389014 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Michael Pratt <[email protected]> Auto-Submit: Michael Pratt <[email protected]>

prattmic added Performance NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels Jan 29, 2021

prattmic added this to the Go1.17 milestone Jan 29, 2021

prattmic self-assigned this Jan 29, 2021

prattmic mentioned this issue Jan 29, 2021

High CPU usage in scheduler google/gvisor#2033

Closed

jeremyfaller mentioned this issue Feb 2, 2021

Go compiler and runtime meeting notes #43930

Open

prattmic mentioned this issue Feb 2, 2021

runtime/trace, cmd/trace: provide a per-M view #44074

Open

gopherbot closed this as completed in ecfce58 Apr 22, 2021

golang deleted a comment from lubinszARM May 28, 2021

golang locked as resolved and limited conversation to collaborators May 28, 2021

rsc unassigned prattmic Jun 23, 2022

prattmic self-assigned this Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: non-spinning Ms spin uselessly when work exists #43997

runtime: non-spinning Ms spin uselessly when work exists #43997

prattmic commented Jan 29, 2021 •

edited

Loading

ianlancetaylor commented Jan 29, 2021

gopherbot commented Jan 29, 2021

prattmic commented Jan 29, 2021 •

edited

Loading

dvyukov commented Feb 1, 2021

prattmic commented Feb 1, 2021

ChrisHines commented Feb 2, 2021

gopherbot commented Apr 7, 2021

gopherbot commented Apr 7, 2021

gopherbot commented Apr 7, 2021

gopherbot commented Apr 7, 2021

gopherbot commented Apr 7, 2021

gopherbot commented Apr 8, 2021

gopherbot commented Apr 16, 2021

prattmic commented Feb 28, 2022 •

edited

Loading

gopherbot commented Mar 1, 2022

runtime: non-spinning Ms spin uselessly when work exists #43997

runtime: non-spinning Ms spin uselessly when work exists #43997

Comments

prattmic commented Jan 29, 2021 • edited Loading

ianlancetaylor commented Jan 29, 2021

gopherbot commented Jan 29, 2021

prattmic commented Jan 29, 2021 • edited Loading

dvyukov commented Feb 1, 2021

prattmic commented Feb 1, 2021

ChrisHines commented Feb 2, 2021

gopherbot commented Apr 7, 2021

gopherbot commented Apr 7, 2021

gopherbot commented Apr 7, 2021

gopherbot commented Apr 7, 2021

gopherbot commented Apr 7, 2021

gopherbot commented Apr 8, 2021

gopherbot commented Apr 16, 2021

prattmic commented Feb 28, 2022 • edited Loading

gopherbot commented Mar 1, 2022

prattmic commented Jan 29, 2021 •

edited

Loading

prattmic commented Jan 29, 2021 •

edited

Loading

prattmic commented Feb 28, 2022 •

edited

Loading