Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime/pprof: make the CPU profile maximum stack size configurable #56029

Open
nsrip-dd opened this issue Oct 4, 2022 · 14 comments
Open
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@nsrip-dd
Copy link
Contributor

nsrip-dd commented Oct 4, 2022

CPU profiles currently have a hard-coded maximum of 64 frames per call stack. However, this limit can be too low, especially when programs use middleware libraries or deep recursion. Call stacks deeper than 64 frames get truncated, which makes profiles difficult to interpret.

I propose making the maximum stack size configurable. Specifically, we can build on the API accepted in #42502 and add the following method to configure the maximum stack size:

// SetMaximumStackSize limits call stacks in the profile to no more than n frames.
// If no limit is set, the default is 64 frames.
func (*CPUProfile) SetMaximumStackSize(n int)

(Should n <= 0 imply no limit? Or to use the default? Or should it be considered invalid and panic/cause CPUProfile.Start to return an error?)

Alternatively, the hard-coded limit could be increased. This could be implemented with no new public API. One reason for making the limit configurable, as opposed to just increasing it, would be to manage the overhead of CPU profiling due to collecting call stacks. Users who want to reduce CPU profiling overhead can keep the limit low. Users who want more detailed profiles can raise the limit.

See also #43669. I've limited this proposal to CPU profiles since this change is easier given the current profile implementations. Ideally the limit for all profile types could be increased, especially for the heap, mutex, and block profiles where the current limit is even lower.

@gopherbot gopherbot added this to the Proposal milestone Oct 4, 2022
@dominikh
Copy link
Member

dominikh commented Oct 4, 2022

It'd at least be nice to standardize on a number. CPU profiles use 64 frames, some profiles use 32 frames, and runtime/trace uses 128 frames, which can be particularly jarring because runtime/trace can include CPU profiling samples as events, mixing 64 and 128 frame stacks.

@felixge
Copy link
Contributor

felixge commented Oct 4, 2022

@dominikh +1 on standardizing the default. FWIW runtime.Stack() (aka pprof.Lookup("goroutine").WriteTo(w, 2)) uses a limit of 100 frames.

But as outlined in #43669 this will be a bigger effort for the profiles with public API surface. Our hope for this proposal (I'm working with @nsrip-dd) is to break up the work into smaller pieces, with this proposal being a nice self-contained step forward.

@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Oct 4, 2022
@prattmic
Copy link
Member

prattmic commented Oct 5, 2022

I also agree that we should at least be consistent in the various interfaces.

Stepping back on the proposal, I'm not sure that we need to add an explicit API vs just significantly increasing or eliminating the limit.

Others that remember the history better can correct me, but I don't think that avoiding poor performance is the primary reason for the fixed 64 frame limit. The internal traceback API, gentraceback is a single monolithic function that writes all frames to an output slice (see also #54466). CPU profile stack samples are collected in the signal handler, where dynamic memory allocation is tricky. Putting a fixed size [64]uintptr array on the signal handler stack and passing that to gentraceback is a simple way to avoid complexity and have things 'just work'.

If we put in the effort to refactor the traceback code so that we could write the frames to the CPU profile buffer in batches, then we could theoretically support an arbitrary number of frames. I think #54466 would get most of the way there.

Of course, extremely deep frames can still cause slowdown, as tracing is proportional to the number of frames. The question I'd have is whether programs with deep frames would ever prefer to have truncation + bounded latency vs visibility into all frames? Given that CPU profiling is opt-in, I suspect that extra visibility would be nearly universally preferred [1].

Thus, I'd propose we don't add an API and simply eliminate the limit (or a very high bound like 1024 frames).

[1] If we do keep an API, it seems like it could just be a binary DisableStackTruncation API. I don't think anyone has a specific reason to want to set a specific number. e.g., why would someone pass 128 vs 512?

cc @golang/runtime

@nsrip-dd
Copy link
Contributor Author

nsrip-dd commented Oct 6, 2022

Thanks @prattmic. It makes sense that the current implementation is simple because of the constraint of collecting stacks from a signal handler.

If we put in the effort to refactor the traceback code so that we could write the frames to the CPU profile buffer in batches, then we could theoretically support an arbitrary number of frames. I think #54466 would get most of the way there.

That would be great! Perhaps the current limit could be raised as an incremental improvement until such a change is made?

The question I'd have is whether programs with deep frames would ever prefer to have truncation + bounded latency vs visibility into all frames?

Personally, I'd prefer getting as much information as possible. I suspect that the overhead increase after changing or eliminating the limit would be very small for most applications. That said, increasing the limit would have some
performance impact on some applications. Right now CPU profiling has low enough overhead with the default configuration that we've been able to run it all the time in production at Datadog with no problems. It would be great if it stays that way.

Going off your suggestion of a binary API, maybe unlimited/big stack traces could be opt-out rather than opt-in? Maybe through a GODEBUG variable? Or make the limit configurable through a GODEBUG variable (cpuprofilestackdepth=N?), with the default being no limit? That way profiles default to having more information, but there's still the option of retaining the current low, predictable overhead.

@rsc
Copy link
Contributor

rsc commented Oct 12, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals Oct 12, 2022
@rsc
Copy link
Contributor

rsc commented Oct 20, 2022

Part of the reason for the limit is that the implementation stores that many frames for every sample. We are planning to remove that bad implementation decision which will make it easier to handle a much larger default limit, like maybe 256 or 1024. It sounds like we should try that higher limit first before deciding if we need more configurability.

@nsrip-dd
Copy link
Contributor Author

Thanks @rsc. Fortunately, I think you've already addressed that implementation decision with https://go.dev/cl/36712 :)

It makes sense to focus on improving the implementation before worrying about configurability. My overhead concerns are probably outweighed by the usefulness of having fewer truncated stacks. Given that, I'd be fine withdrawing this proposal since the motivating problem (truncated stacks) can be addressed without API changes.

I can send a small CL to change the current limit (maxCPUProfStack), which can be increased a good deal without requiring any other changes to the code. I think that would be a good incremental improvement until the implementation is changed.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/445375 mentions this issue: runtime: increase CPU profile stack size limit

@rsc
Copy link
Contributor

rsc commented Oct 26, 2022

@nsrip-dd As @prattmic mentioned above, the problem is that the [64]uintptr (now [512]uintptr in your CL) is stack allocated, and that's kind of a lot to zero during the signal handler. (At least we're on the signal stack so there's no worry about overflow.) We may need to find a better way to save more.

Maybe if profiling is enabled we make sure every m has a slice the signal handler can write to?

nsrip-dd added a commit to nsrip-dd/go that referenced this issue Oct 28, 2022
This CL increases the hard-coded limit for the number of frames in a CPU
profiler call stack to 512 frames. This makes CPU profiles more useful
for programs with deep call stacks, as the previous limit of 64 frames
could often lead to truncation. This limit is still small enough for the
call stack array to fit on the CPU profile signal handler's stack.

Updates golang#56029

Change-Id: Ib9edfc161b4f8eafe74f81a4df18feed9239e343
@nsrip-dd
Copy link
Contributor Author

I hadn't considered that implication, thank you for pointing it out. I've done some quick benchmarks to see what the latency increase would be of going from a 64[uintptr] to a 512[uintptr]. On my Intel MacBook, going from 64 to 512 adds an additional 20 nanoseconds of latency to zeroing the array. So making the array bigger does come with a cost. However, call stack unwinding can take several microseconds (more benchmarks). Assuming the relative difference between zeroing the array and unwinding a call stack is consistent across other platforms, I believe the relative latency increase from making the stack bigger would be small.

That said, I've sketched out an implementation of per-m slices for CPU profile call stacks. It seems pretty simple, but on the other hand I think that approach would trade a small amount of overhead for the cognitive load of determining whether the slice will be properly initialized for the right m before it's needed by the CPU profiler. If that tradeoff is OK, I can submit another CL.

@rsc
Copy link
Contributor

rsc commented Nov 2, 2022

It sounds like we all agree that we should make the pprof handler record many more stack frames by default, one way or another, and that therefore we don't need to make the CPU profile maximum stack size configurable. Since the configuration has disappeared, we can remove this from the proposal process.

@rsc
Copy link
Contributor

rsc commented Nov 2, 2022

Removed from the proposal process.
This was determined not to be a “significant change to the language, libraries, or tools”
or otherwise of significant importance or interest to the broader Go community.
— rsc for the proposal review group

@rsc rsc removed this from Proposals Nov 2, 2022
@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 2, 2022
@ianlancetaylor ianlancetaylor modified the milestones: Proposal, Backlog Nov 2, 2022
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/458218 mentions this issue: runtime: implement traceback iterator

gopherbot pushed a commit that referenced this issue Mar 10, 2023
Currently, all stack walking logic is in one venerable, large, and
very, very complicated function: runtime.gentraceback. This function
has three distinct operating modes: printing, populating a PC buffer,
or invoking a callback. And it has three different modes of unwinding:
physical Go frames, inlined Go frames, and cgo frames. It also has
several flags. All of this logic is very interwoven.

This CL reimplements the monolithic gentraceback function as an
"unwinder" type with an iterator API. It moves all of the logic for
stack walking into this new type, and gentraceback is now a
much-simplified wrapper around the new unwinder type that still
implements printing, populating a PC buffer, and invoking a callback.
Follow-up CLs will replace uses of gentraceback with direct uses of
unwinder.

Exposing traceback functionality as an iterator API will enable a lot
of follow-up work such as simplifying the open-coded defer
implementation (which should in turn help with #26813 and #37233),
printing the bottom of deep stacks (#7181), and eliminating the small
limit on CPU stacks in profiles (#56029).

Fixes #54466.

Change-Id: I36e046dc423c9429c4f286d47162af61aff49a0d
Reviewed-on: https://go-review.googlesource.com/c/go/+/458218
Reviewed-by: Michael Pratt <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Austin Clements <[email protected]>
@aktau
Copy link
Contributor

aktau commented Jan 31, 2024

AFAIK the new traceback generator is in. Can this issue be reconsidered?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

8 participants