-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: concurrent GC starves mutator and netpoller #16432
Comments
CC @RLH |
Given how much time is being spent in gcBgMarkWorker, I think this is actually unrelated to assists (or at least the gcBgMarkWorker is masking the assists). Given the duration of many of the gcBgMarkWorker runs and the amount of idle time shown outside of GC in the trace, I would guess that these are idle workers, which run for 10ms when invoked (which was probably a bad idea). @rhysh, could you try https://go-review.googlesource.com/c/24706/? That isn't really the "right" fix for this problem, but it should make it clear in the trace if this is the problem by chopping most of the gcBgMarkWorkers into much shorter windows, and it may help somewhat with latency. |
Hi @aclements, I tried CL 24706 PS 1. When I checked it out and built/ran my program, the program consumed a lot more CPU. An execution trace showed GC runs lasting for about 25ms with about 11ms of application execution between them. During the 1/3rd of the time for which the app was running, the user goroutines were busy doing sweep work. I cherry-picked CL 24706 PS 1 onto go1.7rc2, and it didn't seem to have the effect you describe in the commit message of preempting the gcBgMarkWorker every 1ms. It may well not apply to go1.7rc2 in a meaningful way, I haven't looked into the patches it's based on. Below is an execution trace of the program with go1.7rc2 with CL 24706 PS 1 cherry-picked on top. It doesn't exhibit the behavior of many Ps running gcBgMarkWorker (which sounds like it's not a bug in the first place?), and it also doesn't show the gcBgMarkWorker goroutines allowing themselves to be descheduled after 1ms of work. It still shows a decrease in the frequency of "Network" events, which (not pictured) still have a multi-ms delay (I've found up to 7ms) before the corresponding goroutines are scheduled, vs the <50µs delay typical when the GC is not running. |
Actually, that trace looks like CL 24706 is working just fine; it's just shifted the problem from long idle background marks to long assists (the work has to happen somewhere). I would expect 9 of the Ps to run a dedicated gcBgMarkWorker for the whole cycle, which probably agrees with the trace (clearly you need a bigger monitor :). What's weird (and what agrees with #14812), is that these assists take so long and that the pacer isn't pulling back the start time to reduce the CPU usage. Do you have the gctrace and gcpacertrace from the run with CL 24706? Even with an assist ratio of ~13 (which, you're right, means it will do about 13 million units of scan work per assist), I would only expect it to take ~10ms per assist. That's clearly still too long, but the trace is showing assists much longer than this. You mentioned that this is basically an RPC translator. Any idea how difficult it would be to distill this into a public benchmark that might exhibit a similar pattern of allocation and goroutine activation? |
I've run the program with go1.7rc3 with CL 24706 PS 1 cherry-picked on top. gctrace and gcpacertrace output are below, along with an execution trace zoomed out to show what all 36 Ps are doing. There are indeed 9 that run a gcBgMarkWorker goroutine. I think I'll be able to distill this into a public benchmark, though I don't yet know how the GC pacer works. Can you give me some direction on what sorts of stimuli cause the assist ratio to be high or low? The GC's behavior from what I'd expect in a few ways. Can you clarify what behavior I should expect?
I think that (1) is the root of my surprise. Is (4) what you mean by "the pacer pulling back"? Does the pacer divide the GC into two (or more) phases, during which it imposes different allocation taxes? Looking again at runtime.gcAssistAlloc:
I wonder if scanWork should be calculated differently. It looks like the goals the current code meets are:
I think the problems I'm seeing in my program are because:
I'd propose something like the following, possibly with reducing the value of gcOverAssistBytes.
Would a change like that break other invariants or goals of the GC?
|
I tried out the change I proposed, and it seems to do the trick. Incoming network events continue at their usual rate during the concurrent GC, rather than becoming less frequent. The delay between incoming network events and the resumption of the goroutines they unblock is often several microseconds rather than several milliseconds. Goroutines that are scheduled during the concurrent GC often run for just a few hundred microseconds (I dialed down I'll share an execution trace and some output of gctrace and gcpacertrace soon, and mail a CL. Are there any circumstances under which a fix for this would be part of go1.7? |
I mailed CL 25155. With that change (at PS 1) applied to go1.7rc3, the runtime execution trace of my program shows inbound network events as continuing at the same rate whether or not a concurrent GC is executing. The delay between inbound network events and the execution of the goroutines they unblock is generally less than 10µs even when a garbage collection is in progress. With go1.7rc3 without the change, the delays are often near 10ms. The gctrace output shows that the heap grew more during the concurrent mark phase than with go1.7rc3. In the other gctrace outputs I shared, the GC cycles started when the goal was around 10MB away and completed after growing 1–2MB: the collection finished about 5–10x faster than expected. With this change, the gctrace output shows that the GC cycles start about 20MB before the goal and complete after the 10MB has been allocated: the collection finished only 2x faster than the pacer expected. Finally, the runtime execution trace shows the goroutine run queue growing and shrinking during the concurrent GC, rather than waiting until almost the end of the cycle before draining. It shows many goroutines each being scheduled for a few hundred microseconds, and yielding once their work is done to allow other goroutines to have some CPU time.
|
CL https://golang.org/cl/25155 mentions this issue. |
That's really interesting! I had tried a very similar change when I was looking at #14812: https://go-review.googlesource.com/c/24704/, but found that it didn't help. It's structured slightly differently, but should be equivalent except for two things: I only decreased the over-assist to 512k instead of 64k, and I always added the over-assist rather doing min(over-assist, required work). Would you mind giving CL 24704 a try on your workload, perhaps both with the 512k over-assist and with the 64k over-assist? I want to make sure we really understand what's going on and what the effect of these changes is. |
It's not impossible. Code-wise it's pretty low risk. It clearly satisfies the invariants there. The only risk in my mind is that touching anything related to the pacer can have surprising consequences, but even there I think this is pretty low risk. I'd like to take a long stare at it and have at least one other person do so, too.
Is that true of all allocations? Given what's currently on master, I would expect some allocations to be very expensive, but for the mean cost of allocation to be only slightly higher when GC is running.
There's a fairness issue in the way assists currently work that's probably causing this. If an assist isn't able to fully satisfy itself, it goes on to the assist queue. But an assist that starts later may find more mark work and satisfy itself right away, even though there are older unsatisfied assists on the queue. Since only the background mark worker can satisfy queued assists, if it's not getting enough time because every P is busy doing assists, those queued assists just sit there. This is a lot like very bad lock barging. The result is that you get a pile of blocked assists, and there's a huge spike in runnable goroutines when those get unblocked at GC termination. This is on my list of things to fix for 1.8. My plan is to effectively queue all assists and do all mark work in the background mark worker. This will make the queuing fair, centralize all mark work in one place in the code, and make it possible to satisfy an assist even in the middle of an indivisible scanning block (which will, in turn, make it more practical to count root scanning work towards assists, which will help satisfy assists at the beginning of the GC cycle).
No, actually. What I meant is that the high CPU utilization over the whole GC cycle should cause the pacer to start the next cycle sooner. But you mentioned later that it's also finishing 5--10x sooner than expected, which is probably why this isn't happening.
Your proposed change is fine invariant-wise. The only requirement is that a goroutine is not allowed to return from the assist until gcAssistBytes is non-negative. And using the min(over-assist, required work) should accomplish the desired amortization just as well as (if not better than) the current approach of always adding the over-assist. |
I forgot (not enough text yet!): I wrote a little self-contained RPC-like benchmark yesterday that I think exhibits equivalent symptoms. It's at https://github.com/aclements/go-gcbench/blob/master/gcbench/progs/rpc.go. I run it with GOMAXPROCS set to half my CPUs (since the client and server are separate processes) and fiddle with -reqs-per-sec until I get about 1 GC/sec, which is about -reqs-per-sec 5000 on my laptop with GOMAXPROCS=2. I haven't tried any runtime changes with it yet. |
Trying with my RPC benchmark, I get the following client-observed latency tails:
P50 is 50--100µs in all cases. So, just setting over-assist to 64k helps but isn't sufficient, and just doing over-assist in terms of scan work instead of allocation bytes (cl/24704/1) also isn't sufficient. We need both. And it looks like your CL (25155) behaves very similarly to my CL (24704) with over-assist turned down to 64k, which is good, because it should. :) I think we should move forward with your CL (I'll suggest a few mostly documentation things to pull over from 24704). I'm going to run the benchmark battery against it. And I'll see what other people think of putting this in 1.7. |
I cherry-picked CL 24704 PS 1 onto go1.7rc3, and it helps my application. Objectively, the delay between inbound network events and the execution of the goroutines they unblock is reduced to only a few microseconds, down from several milliseconds. Subjectively, the frequency of inbound network traffic is still lower than when the GC isn't running, but it's not as much of a decrease as with go1.7rc3. With go1.7rc3 plus cherry-pick of CL 24704 PS 1 plus reducing the over-assist to 64k, the inbound network activity continues at just about the same rate whether or not the GC is running. This is somewhat subjective, so we might be in the realm of tuning here.
As far as I can tell, it's not true of all allocations—just the allocations that happen. Specifically, it looks with go1.7rc3 like allocations near the beginning of concurrent mark are very expensive. There are some goroutines that usually run for a few hundred microseconds, but if they're scheduled at the beginning of concurrent mark then they'll end up taking 10ms to do their work. I don't know what would change for those goroutines other than the per-byte cost of allocation. Because the allocations early in the cycle cost so much, there isn't much time left for less expensive ones. It's possible that these average out to something reasonable, but the 10ms penalty for some goroutines early in the cycle is surprising, and doesn't seem to be in the spirit of a low-latency garbage collector. It seems in practice like each garbage collection ends up with three STW phases, with the new and unfortunate "gcAssistAlloc" phase being the longest by far (around 10ms, vs around 1–2ms for mark termination and 100–200µs for sweep termination). Each P will run goroutines (quickly) until it runs a goroutine that allocates, so it's easy for all Ps to end up busy like this, with no progress on interactive user code. In preparing CL 25155, I tried an over-assist of 1<<20 (as in master) and 64<<10 (as in PS2). I picked it because I remember you saying that the GC can scan about 1MB per second, and some old recommendations (from Dmitry Vyukov?) about splitting work into pieces that can be done in about 100µs. Should the new over-assist figure be calculated more scientifically? Here's what the "Network" row looks like during GCs for a few flavors of the Go runtime: go1.7rc3, GC is around t=70–100ms. Note how network activity nearly stops during this time. go1.7rc3 with cherry-pick of cl/24704/1, GC is around t=270–310ms go1.7rc3 with cherry-pick of cl/24704/1 plus reducing the over-assist to 64k, GC is around t=340–370ms |
Great. It sounds like my RPC benchmark was getting similar results to the real thing.
Yes, these are the allocations I would expect to be more expensive, since nobody's built up credit at that point (Rick and I tried lots of "accounting tricks" to spread this out, but invariably mutators just blew through any slack we gave them and ran into the same problem). But you have an interesting point about these consuming so much of the cycle that there isn't much time to bring the average back down.
1MB per millisecond. :) But, yes, this is about right and splitting work up into 100µs chunks is generally pretty reasonable.
I tried 16k, 32k, and 128k with my RPC benchmark and the differences were in the noise. For Go 1.8, I'm considering tying assists to span acquisition, which will amortize the assist cost using a completely different mechanism. If that works, the over-assist knob will be gone anyway. |
Yes. I ran your RPC benchmark (replacing the "golang.org/x/benchmarks/gcbench.newer/heapgen" import with "github.com/aclements/go-gcbench/gcbench/heapgen") and its behavior is similar to what I see in my application. Its behavior also reopens one of my initial questions in this issue: is it right for the GC to consume all available resources at the expense of mutator latency? When run with go1.7rc3, the RPC benchmark shows a few Ps making very slow progress on a few Gs, as the run queue grows. After the GC, there's a flurry of mutator activity as the run queue drains.
When run with cl/25155/3, the execution trace shows all Ps running their gcBgMarkWorker for about 10ms, then a flurry of mutator activity, then another 10–20ms of all Ps running gcBgMarkWorker.
cl/25155/3 is very helpful for my production application, I'd love to see it merged for Go 1.7.
Whoops, yes! Work segments of 100µs sound good. I'm surprised that the gcBgMarkWorker goroutines are able to consume all Ps for 10ms at a time—and possibly longer, the cl/25155/3 screenshot at 1880–1890ms shows a non-zero run queue while all 4 Ps continue to run their gcBgMarkWorkers for more than 10ms. I know you have CL 24706 to change this to 1ms, which should help. A delay of 1ms is much less noticeable in most apps than one of 10ms, but it's still a constant delay that's not under the user's control. The runtime reserves 25% of CPU time for the GC; does it make sense to reserve 25% for the application, preventing the GC from using more than 75%? Here's the execution trace of cl/25155/3 with cl/24706/1 cherry-picked on top. Of the four Ps, three of them alternate between running gcBgMarkWorkers for short periods, and running mutator code. It attenuates the latency mode near 10ms by a factor of around 100–1000, all but eliminating it (although there's now a similarly-shaped lump in the response time population near 1ms).
|
Whoops! goimports gone wrong. I pushed a fix for this (and delete the old gcbench.newer so it doesn't happen agian). I also added printing the 99th and 99.9th percentile numbers.
Well, it's not supposed to do this, but when it does, it doesn't have much choice. If there's any strict limit on GC CPU utilization, then allocation can outpace the garbage collector, leading to overshooting the goal heap size in the best case and never terminating GC in the worst case (until you OOM). It's supposed to prevent this by starting the GC cycle early enough that it can spread out the work over enough time, but for complicated reasons that I think I understand but haven't figured out how to fix yet, this doesn't work very well in situations with really high allocation rates.
Rick and I have agreed that it seems low risk. He's going to take a close look at it tomorrow morning.
What you're seeing are the idle mark workers kicking in because there's a moment on each P when it's not running anything. Unfortunately, these currently run for 10ms when they run, which was a bad design decision (expedient at the time; I didn't realize at the time that it would have this consequence). The 1ms preemption was a stop-gap experiment, which, oddly, didn't help the RPC application we were looking at at the time. But it's certainly not the right solution for the idle worker (it may be okay for the fractional worker). On my list for Go 1.8 is to figure out how to get the idle mark worker to really behave like an idle-priority task and give up the P the moment there's anything else to do. |
I may have been too imprecise. What I had in mind was a limit on the CPU usage of the idle workers. It seems like the assist is what ensures that the mutator doesn't allocate too quickly, which means that mechanism would have to stay in place. I was curious as to whether the runtime could prevent more than 3/4ths of the Ps from concurrently running their idle mark workers—separate from the mandatory assists. Your plan to preempt the idle mark worker once there's other work to do sounds much better all around.
Great! I'll try to be on-hand for edits. |
Oh, I see. Yes, that could work, though it may not work so well with GOMAXPROCS=1. One possible approach to preempting the idle worker as soon as there's work is to only run the idle worker on N-1 of the idle Ps, keeping one P truly idle. If that one gets woken up to do work, it would actively halt one of the idle workers to make another P truly idle. |
On a related note, I always wondered why GC credit/debit is per-g rather than per-p. What's the reasoning behind that? Have we tried to make it per-p? There are obviously pros and cons. But per-p looks more reasonable to me. Is it the case that per-g credit can be left unused for lots of goroutines? Or there is something that prevents that? |
It's now been pushed to master and should get cherry-picked in the last batch before the release. We're keeping a close eye on it and running internal tests, which I should have the results of tomorrow.
I don't think it would be bad to do it per P. The reason we do it per G is that the time a goroutine spends helping the garbage collector is supposed to be proportional to its own allocation rate, so that the cost of assisting roughly balances across the bytes you allocate. Whether or not this actually matters, I'm really not sure. Doing it per P would still maintain the really important invariant that allocation not outpace scanning. I have been considering moving the assist to when a span is acquired from the mcentral to reduce the overhead to the hot allocation path. In that case, it may make more sense to track it per P, along with the mcache. If I also change assists to always queue if there's not enough global credit (and not do any scan work of their own), then there might not be any need to track assist debt at all, per P or per G.
Currently, left over credit is recycled along with the g struct. This seems to work fine in practice, but of course, if lots of goroutines build up spare credit and then exit and new goroutines don't replace the exited goroutines, we'll lose track of this built-up credit. A while ago I wrote https://go-review.googlesource.com/c/24703/ to flush this credit to the global credit pool on goroutine exit, but it didn't seem to have any effect, so it never got merged. |
That's great news, thank you! Its behavior looks good in the few services I operate. I still see that nearly all Ps can end up running their gcBgMarkWorker goroutines, largely starving the mutator of CPU time. I've got an example of 35/36 Ps doing that for around 16ms while the run queue (per "Goroutines" row) is non-empty. It doesn't happen on every GC, so most collections (at ccca9c9, with CL 25155 merged) allow the mutator to continue running with low latency. @aclements do you have a separate issue open to track your plans for preempting the idle mark worker? |
I just created #16528 to track this. You've got some pretty great data on this problem here. If you've got time, feel free to add to the new issue. |
Great work @rhysh and @aclements ! |
@Dieterbe I expect it will be a noticeable improvement. Not sure if it will rise to the level of "solved" :) You should see fewer latency spikes at the cost of slightly more time spent in the GC. |
It would be great to give it a try. There are still some known latency On Jul 30, 2016 6:47 AM, "Dieter Plaetinck" [email protected]
|
It would be great to try master. If there are problems the sooner we know On Sat, Jul 30, 2016 at 12:06 PM, Austin Clements [email protected]
|
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go1.7rc2
What operating system and processor architecture are you using (
go env
)?linux/amd64
What did you do?
I've got a program that accepts data over 200–300 active network connections, unmarshals the data, reformats and remarshals the data, and sends it over a small number of http connections. It usually has around 1300 goroutines and 200MB of live heap, and triggers a garbage collection about twice a second (the apis for encoding/json and others force a lot of allocations). It runs on a 36-core machine with GOMAXPROCS=36.
I took an execution trace from /debug/pprof/trace to try to track down a different GC issue.
What did you expect to see?
I expected the program's mutator goroutines continue to be scheduled as usual during the concurrent mark phase. I'd expect most goroutines to be scheduled for close to the 50–2000µs they need to do their normal work. I expected the "Goroutines" row to indicate that no goroutines were getting stuck in the run queue for long periods of time.
I'd expect no more than 25% of threads to be executing runtime.gcBgMarkWorker goroutines.
I expected the "Network" row to show incoming network events continue to occur during GC, and for the goroutines that are unblocked by network events to be scheduled immediately afterwards.
What did you see instead?
Goroutines that are scheduled during garbage collection run for around 10ms. When runtime.gcBgMarkWorker goroutines run, they go for longer than 10ms. The "Goroutines" row grows and then stays steady, only draining when the GC is nearly complete.
Occasionally, as shown in the execution trace screenshot below, nearly all threads are running a runtime.gcBgMarkWorker goroutine. This one has 35 out of 36 threads running their copy of that goroutine—20 of the 21 pictured there, plus the 15 others offscreen.
The "Network" row shows less inbound network traffic during the GC, which seems to indicate that the netpoller is also getting starved—I don't know a lot yet about the netpoller's operation, but I don't think that there'd be any less inbound network traffic to this process during garbage collections. Additionally, when a network even is detected it takes a long time to run the goroutine that was unblocked by that network event. The red arrow in the execution tracer screenshot shows a delay of around 10ms between data coming in over the network and the correct goroutine running.
I've included some output from running the program with
GODEBUG=gctrace=1,gcpacertrace=1
. I'm not yet very familiar with the pacer, but it looks like "assist ratio=+1.306319e+001" means that when a goroutine goes into allocation debt that it will end up scanning 13,000,000 scan-units before it's allowed to run again per gcAssistAlloc and gcOverAssistBytes.The combination of high assist ratio and all goroutines on all Ps running slowly makes this look similar to #14812, with the added twist that sometimes nearly all Ps are running their gcBgMarkWorker goroutine.
/cc @aclements
The text was updated successfully, but these errors were encountered: