-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: runtime/pprof: make the CPU profile maximum stack size configurable #56029
Comments
It'd at least be nice to standardize on a number. CPU profiles use 64 frames, some profiles use 32 frames, and runtime/trace uses 128 frames, which can be particularly jarring because runtime/trace can include CPU profiling samples as events, mixing 64 and 128 frame stacks. |
@dominikh +1 on standardizing the default. FWIW But as outlined in #43669 this will be a bigger effort for the profiles with public API surface. Our hope for this proposal (I'm working with @nsrip-dd) is to break up the work into smaller pieces, with this proposal being a nice self-contained step forward. |
I also agree that we should at least be consistent in the various interfaces. Stepping back on the proposal, I'm not sure that we need to add an explicit API vs just significantly increasing or eliminating the limit. Others that remember the history better can correct me, but I don't think that avoiding poor performance is the primary reason for the fixed 64 frame limit. The internal traceback API, If we put in the effort to refactor the traceback code so that we could write the frames to the CPU profile buffer in batches, then we could theoretically support an arbitrary number of frames. I think #54466 would get most of the way there. Of course, extremely deep frames can still cause slowdown, as tracing is proportional to the number of frames. The question I'd have is whether programs with deep frames would ever prefer to have truncation + bounded latency vs visibility into all frames? Given that CPU profiling is opt-in, I suspect that extra visibility would be nearly universally preferred [1]. Thus, I'd propose we don't add an API and simply eliminate the limit (or a very high bound like 1024 frames). [1] If we do keep an API, it seems like it could just be a binary cc @golang/runtime |
Thanks @prattmic. It makes sense that the current implementation is simple because of the constraint of collecting stacks from a signal handler.
That would be great! Perhaps the current limit could be raised as an incremental improvement until such a change is made?
Personally, I'd prefer getting as much information as possible. I suspect that the overhead increase after changing or eliminating the limit would be very small for most applications. That said, increasing the limit would have some Going off your suggestion of a binary API, maybe unlimited/big stack traces could be opt-out rather than opt-in? Maybe through a GODEBUG variable? Or make the limit configurable through a GODEBUG variable ( |
This proposal has been added to the active column of the proposals project |
Part of the reason for the limit is that the implementation stores that many frames for every sample. We are planning to remove that bad implementation decision which will make it easier to handle a much larger default limit, like maybe 256 or 1024. It sounds like we should try that higher limit first before deciding if we need more configurability. |
Thanks @rsc. Fortunately, I think you've already addressed that implementation decision with https://go.dev/cl/36712 :) It makes sense to focus on improving the implementation before worrying about configurability. My overhead concerns are probably outweighed by the usefulness of having fewer truncated stacks. Given that, I'd be fine withdrawing this proposal since the motivating problem (truncated stacks) can be addressed without API changes. I can send a small CL to change the current limit ( |
Change https://go.dev/cl/445375 mentions this issue: |
@nsrip-dd As @prattmic mentioned above, the problem is that the [64]uintptr (now [512]uintptr in your CL) is stack allocated, and that's kind of a lot to zero during the signal handler. (At least we're on the signal stack so there's no worry about overflow.) We may need to find a better way to save more. Maybe if profiling is enabled we make sure every m has a slice the signal handler can write to? |
This CL increases the hard-coded limit for the number of frames in a CPU profiler call stack to 512 frames. This makes CPU profiles more useful for programs with deep call stacks, as the previous limit of 64 frames could often lead to truncation. This limit is still small enough for the call stack array to fit on the CPU profile signal handler's stack. Updates golang#56029 Change-Id: Ib9edfc161b4f8eafe74f81a4df18feed9239e343
I hadn't considered that implication, thank you for pointing it out. I've done some quick benchmarks to see what the latency increase would be of going from a 64[uintptr] to a 512[uintptr]. On my Intel MacBook, going from 64 to 512 adds an additional 20 nanoseconds of latency to zeroing the array. So making the array bigger does come with a cost. However, call stack unwinding can take several microseconds (more benchmarks). Assuming the relative difference between zeroing the array and unwinding a call stack is consistent across other platforms, I believe the relative latency increase from making the stack bigger would be small. That said, I've sketched out an implementation of per-m slices for CPU profile call stacks. It seems pretty simple, but on the other hand I think that approach would trade a small amount of overhead for the cognitive load of determining whether the slice will be properly initialized for the right m before it's needed by the CPU profiler. If that tradeoff is OK, I can submit another CL. |
It sounds like we all agree that we should make the pprof handler record many more stack frames by default, one way or another, and that therefore we don't need to make the CPU profile maximum stack size configurable. Since the configuration has disappeared, we can remove this from the proposal process. |
Removed from the proposal process. |
Change https://go.dev/cl/458218 mentions this issue: |
Currently, all stack walking logic is in one venerable, large, and very, very complicated function: runtime.gentraceback. This function has three distinct operating modes: printing, populating a PC buffer, or invoking a callback. And it has three different modes of unwinding: physical Go frames, inlined Go frames, and cgo frames. It also has several flags. All of this logic is very interwoven. This CL reimplements the monolithic gentraceback function as an "unwinder" type with an iterator API. It moves all of the logic for stack walking into this new type, and gentraceback is now a much-simplified wrapper around the new unwinder type that still implements printing, populating a PC buffer, and invoking a callback. Follow-up CLs will replace uses of gentraceback with direct uses of unwinder. Exposing traceback functionality as an iterator API will enable a lot of follow-up work such as simplifying the open-coded defer implementation (which should in turn help with #26813 and #37233), printing the bottom of deep stacks (#7181), and eliminating the small limit on CPU stacks in profiles (#56029). Fixes #54466. Change-Id: I36e046dc423c9429c4f286d47162af61aff49a0d Reviewed-on: https://go-review.googlesource.com/c/go/+/458218 Reviewed-by: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Austin Clements <[email protected]>
AFAIK the new traceback generator is in. Can this issue be reconsidered? |
CPU profiles currently have a hard-coded maximum of 64 frames per call stack. However, this limit can be too low, especially when programs use middleware libraries or deep recursion. Call stacks deeper than 64 frames get truncated, which makes profiles difficult to interpret.
I propose making the maximum stack size configurable. Specifically, we can build on the API accepted in #42502 and add the following method to configure the maximum stack size:
(Should n <= 0 imply no limit? Or to use the default? Or should it be considered invalid and panic/cause CPUProfile.Start to return an error?)
Alternatively, the hard-coded limit could be increased. This could be implemented with no new public API. One reason for making the limit configurable, as opposed to just increasing it, would be to manage the overhead of CPU profiling due to collecting call stacks. Users who want to reduce CPU profiling overhead can keep the limit low. Users who want more detailed profiles can raise the limit.
See also #43669. I've limited this proposal to CPU profiles since this change is easier given the current profile implementations. Ideally the limit for all profile types could be increased, especially for the heap, mutex, and block profiles where the current limit is even lower.
The text was updated successfully, but these errors were encountered: