proposal: runtime/pprof: make the CPU profile maximum stack size configurable #56029

nsrip-dd · 2022-10-04T15:43:56Z

CPU profiles currently have a hard-coded maximum of 64 frames per call stack. However, this limit can be too low, especially when programs use middleware libraries or deep recursion. Call stacks deeper than 64 frames get truncated, which makes profiles difficult to interpret.

I propose making the maximum stack size configurable. Specifically, we can build on the API accepted in #42502 and add the following method to configure the maximum stack size:

// SetMaximumStackSize limits call stacks in the profile to no more than n frames.
// If no limit is set, the default is 64 frames.
func (*CPUProfile) SetMaximumStackSize(n int)

(Should n <= 0 imply no limit? Or to use the default? Or should it be considered invalid and panic/cause CPUProfile.Start to return an error?)

Alternatively, the hard-coded limit could be increased. This could be implemented with no new public API. One reason for making the limit configurable, as opposed to just increasing it, would be to manage the overhead of CPU profiling due to collecting call stacks. Users who want to reduce CPU profiling overhead can keep the limit low. Users who want more detailed profiles can raise the limit.

See also #43669. I've limited this proposal to CPU profiles since this change is easier given the current profile implementations. Ideally the limit for all profile types could be increased, especially for the heap, mutex, and block profiles where the current limit is even lower.

The text was updated successfully, but these errors were encountered:

dominikh · 2022-10-04T16:31:08Z

It'd at least be nice to standardize on a number. CPU profiles use 64 frames, some profiles use 32 frames, and runtime/trace uses 128 frames, which can be particularly jarring because runtime/trace can include CPU profiling samples as events, mixing 64 and 128 frame stacks.

felixge · 2022-10-04T18:19:52Z

@dominikh +1 on standardizing the default. FWIW runtime.Stack() (aka pprof.Lookup("goroutine").WriteTo(w, 2)) uses a limit of 100 frames.

But as outlined in #43669 this will be a bigger effort for the profiles with public API surface. Our hope for this proposal (I'm working with @nsrip-dd) is to break up the work into smaller pieces, with this proposal being a nice self-contained step forward.

prattmic · 2022-10-05T15:02:36Z

I also agree that we should at least be consistent in the various interfaces.

Stepping back on the proposal, I'm not sure that we need to add an explicit API vs just significantly increasing or eliminating the limit.

Others that remember the history better can correct me, but I don't think that avoiding poor performance is the primary reason for the fixed 64 frame limit. The internal traceback API, gentraceback is a single monolithic function that writes all frames to an output slice (see also #54466). CPU profile stack samples are collected in the signal handler, where dynamic memory allocation is tricky. Putting a fixed size [64]uintptr array on the signal handler stack and passing that to gentraceback is a simple way to avoid complexity and have things 'just work'.

If we put in the effort to refactor the traceback code so that we could write the frames to the CPU profile buffer in batches, then we could theoretically support an arbitrary number of frames. I think #54466 would get most of the way there.

Of course, extremely deep frames can still cause slowdown, as tracing is proportional to the number of frames. The question I'd have is whether programs with deep frames would ever prefer to have truncation + bounded latency vs visibility into all frames? Given that CPU profiling is opt-in, I suspect that extra visibility would be nearly universally preferred [1].

Thus, I'd propose we don't add an API and simply eliminate the limit (or a very high bound like 1024 frames).

[1] If we do keep an API, it seems like it could just be a binary DisableStackTruncation API. I don't think anyone has a specific reason to want to set a specific number. e.g., why would someone pass 128 vs 512?

cc @golang/runtime

nsrip-dd · 2022-10-06T13:11:55Z

Thanks @prattmic. It makes sense that the current implementation is simple because of the constraint of collecting stacks from a signal handler.

If we put in the effort to refactor the traceback code so that we could write the frames to the CPU profile buffer in batches, then we could theoretically support an arbitrary number of frames. I think #54466 would get most of the way there.

That would be great! Perhaps the current limit could be raised as an incremental improvement until such a change is made?

The question I'd have is whether programs with deep frames would ever prefer to have truncation + bounded latency vs visibility into all frames?

Personally, I'd prefer getting as much information as possible. I suspect that the overhead increase after changing or eliminating the limit would be very small for most applications. That said, increasing the limit would have some
performance impact on some applications. Right now CPU profiling has low enough overhead with the default configuration that we've been able to run it all the time in production at Datadog with no problems. It would be great if it stays that way.

Going off your suggestion of a binary API, maybe unlimited/big stack traces could be opt-out rather than opt-in? Maybe through a GODEBUG variable? Or make the limit configurable through a GODEBUG variable (cpuprofilestackdepth=N?), with the default being no limit? That way profiles default to having more information, but there's still the option of retaining the current low, predictable overhead.

rsc · 2022-10-12T17:53:09Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

rsc · 2022-10-20T17:48:01Z

Part of the reason for the limit is that the implementation stores that many frames for every sample. We are planning to remove that bad implementation decision which will make it easier to handle a much larger default limit, like maybe 256 or 1024. It sounds like we should try that higher limit first before deciding if we need more configurability.

nsrip-dd · 2022-10-25T14:50:19Z

Thanks @rsc. Fortunately, I think you've already addressed that implementation decision with https://go.dev/cl/36712 :)

It makes sense to focus on improving the implementation before worrying about configurability. My overhead concerns are probably outweighed by the usefulness of having fewer truncated stacks. Given that, I'd be fine withdrawing this proposal since the motivating problem (truncated stacks) can be addressed without API changes.

I can send a small CL to change the current limit (maxCPUProfStack), which can be increased a good deal without requiring any other changes to the code. I think that would be a good incremental improvement until the implementation is changed.

gopherbot · 2022-10-25T14:51:16Z

Change https://go.dev/cl/445375 mentions this issue: runtime: increase CPU profile stack size limit

rsc · 2022-10-26T17:25:19Z

@nsrip-dd As @prattmic mentioned above, the problem is that the [64]uintptr (now [512]uintptr in your CL) is stack allocated, and that's kind of a lot to zero during the signal handler. (At least we're on the signal stack so there's no worry about overflow.) We may need to find a better way to save more.

Maybe if profiling is enabled we make sure every m has a slice the signal handler can write to?

This CL increases the hard-coded limit for the number of frames in a CPU profiler call stack to 512 frames. This makes CPU profiles more useful for programs with deep call stacks, as the previous limit of 64 frames could often lead to truncation. This limit is still small enough for the call stack array to fit on the CPU profile signal handler's stack. Updates golang#56029 Change-Id: Ib9edfc161b4f8eafe74f81a4df18feed9239e343

nsrip-dd · 2022-10-28T15:26:50Z

I hadn't considered that implication, thank you for pointing it out. I've done some quick benchmarks to see what the latency increase would be of going from a 64[uintptr] to a 512[uintptr]. On my Intel MacBook, going from 64 to 512 adds an additional 20 nanoseconds of latency to zeroing the array. So making the array bigger does come with a cost. However, call stack unwinding can take several microseconds (more benchmarks). Assuming the relative difference between zeroing the array and unwinding a call stack is consistent across other platforms, I believe the relative latency increase from making the stack bigger would be small.

That said, I've sketched out an implementation of per-m slices for CPU profile call stacks. It seems pretty simple, but on the other hand I think that approach would trade a small amount of overhead for the cognitive load of determining whether the slice will be properly initialized for the right m before it's needed by the CPU profiler. If that tradeoff is OK, I can submit another CL.

rsc · 2022-11-02T17:45:18Z

It sounds like we all agree that we should make the pprof handler record many more stack frames by default, one way or another, and that therefore we don't need to make the CPU profile maximum stack size configurable. Since the configuration has disappeared, we can remove this from the proposal process.

rsc · 2022-11-02T18:01:33Z

Removed from the proposal process.
This was determined not to be a “significant change to the language, libraries, or tools”
or otherwise of significant importance or interest to the broader Go community.
— rsc for the proposal review group

gopherbot · 2023-02-09T22:19:28Z

Change https://go.dev/cl/458218 mentions this issue: runtime: implement traceback iterator

Currently, all stack walking logic is in one venerable, large, and very, very complicated function: runtime.gentraceback. This function has three distinct operating modes: printing, populating a PC buffer, or invoking a callback. And it has three different modes of unwinding: physical Go frames, inlined Go frames, and cgo frames. It also has several flags. All of this logic is very interwoven. This CL reimplements the monolithic gentraceback function as an "unwinder" type with an iterator API. It moves all of the logic for stack walking into this new type, and gentraceback is now a much-simplified wrapper around the new unwinder type that still implements printing, populating a PC buffer, and invoking a callback. Follow-up CLs will replace uses of gentraceback with direct uses of unwinder. Exposing traceback functionality as an iterator API will enable a lot of follow-up work such as simplifying the open-coded defer implementation (which should in turn help with #26813 and #37233), printing the bottom of deep stacks (#7181), and eliminating the small limit on CPU stacks in profiles (#56029). Fixes #54466. Change-Id: I36e046dc423c9429c4f286d47162af61aff49a0d Reviewed-on: https://go-review.googlesource.com/c/go/+/458218 Reviewed-by: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Austin Clements <[email protected]>

aktau · 2024-01-31T09:11:35Z

AFAIK the new traceback generator is in. Can this issue be reconsidered?

nsrip-dd added the Proposal label Oct 4, 2022

gopherbot added this to the Proposal milestone Oct 4, 2022

ianlancetaylor added this to Proposals Oct 4, 2022

ianlancetaylor moved this to Incoming in Proposals Oct 4, 2022

aclements mentioned this issue Oct 6, 2022

runtime: rewrite gentraceback as an iterator API #54466

Closed

rsc moved this from Incoming to Active in Proposals Oct 12, 2022

rsc removed this from Proposals Nov 2, 2022

ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 2, 2022

ianlancetaylor modified the milestones: Proposal, Backlog Nov 2, 2022

dominikh mentioned this issue Nov 13, 2022

Handle truncated stack traces dominikh/gotraceui#32

Open

ianlancetaylor removed the Proposal label Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: runtime/pprof: make the CPU profile maximum stack size configurable #56029

proposal: runtime/pprof: make the CPU profile maximum stack size configurable #56029

nsrip-dd commented Oct 4, 2022

dominikh commented Oct 4, 2022

felixge commented Oct 4, 2022

prattmic commented Oct 5, 2022

nsrip-dd commented Oct 6, 2022

rsc commented Oct 12, 2022

rsc commented Oct 20, 2022

nsrip-dd commented Oct 25, 2022

gopherbot commented Oct 25, 2022

rsc commented Oct 26, 2022

nsrip-dd commented Oct 28, 2022

rsc commented Nov 2, 2022

rsc commented Nov 2, 2022

gopherbot commented Feb 9, 2023

aktau commented Jan 31, 2024

proposal: runtime/pprof: make the CPU profile maximum stack size configurable #56029

proposal: runtime/pprof: make the CPU profile maximum stack size configurable #56029

Comments

nsrip-dd commented Oct 4, 2022

dominikh commented Oct 4, 2022

felixge commented Oct 4, 2022

prattmic commented Oct 5, 2022

nsrip-dd commented Oct 6, 2022

rsc commented Oct 12, 2022

rsc commented Oct 20, 2022

nsrip-dd commented Oct 25, 2022

gopherbot commented Oct 25, 2022

rsc commented Oct 26, 2022

nsrip-dd commented Oct 28, 2022

rsc commented Nov 2, 2022

rsc commented Nov 2, 2022

gopherbot commented Feb 9, 2023

aktau commented Jan 31, 2024