Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: add cpuprofile_dumper #75799

Closed
tbg opened this issue Feb 1, 2022 · 6 comments · Fixed by #95623
Closed

server: add cpuprofile_dumper #75799

tbg opened this issue Feb 1, 2022 · 6 comments · Fixed by #95623
Assignees
Labels
A-kv-observability A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs

Comments

@tbg
Copy link
Member

tbg commented Feb 1, 2022

Is your feature request related to a problem? Please describe.

We already dump heaps and goroutines when they grow. We haven't done the same for CPU profiles since CPU profiling has non-negligible runtime overhead. But it would be very useful anyway, as in support escalations in which high CPU utilization is observed, the node is often restarted as a first course of action and without taking the time to download a profile.

Describe the solution you'd like

Implement a CPU profile version of the heap dumper.
The overhead should be made small by some combination of a short sampling duration and a low sampling frequency (see #75801).

Describe alternatives you've considered

Additional context

Inspired by (internal) https://github.com/cockroachlabs/support/issues/1408 but also would have been helpful in multiple previous escalations.

Jira issue: CRDB-12840

gz#15620

@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Feb 1, 2022
@tbg tbg added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Feb 1, 2022
@tbg
Copy link
Member Author

tbg commented Feb 1, 2022

Heads up when implementing this, only one CPU profile can be going on at any given point in time, but there are multiple endpoints that may request a profile (of which the cpuprofiler would be a new one).

We solve this by having all profiles go through (*statusServer).Profile, which is passed in here:

debugServer := debug.NewServer(cfg.BaseConfig.AmbientCtx, st, sqlServer.pgServer.HBADebugFn(), sStatus)

It uses a mutex to serialize profiling attempts. So we should make sure that we also use it in cpuprofiler.

@tbg
Copy link
Member Author

tbg commented Oct 13, 2022

Related: #86012 #82464 #60508

I'm here because we just had an escalation (https://github.com/cockroachlabs/support/issues/1840) where there were periodic load spikes and nobody can figure out what caused them. We didn't manage to take CPU profiles at the right time or maybe we did; the profiler tags don't cross RPC boundaries so the queries wouldn't be identifiable anyway.

Having profiles taken automatically when a node gets hot, and having useful labels in them even when the work is caused by distSQL leaves, would make a radical difference in these frequent escalations.

@thtruo
Copy link
Contributor

thtruo commented Dec 5, 2022

Heads up @nkodali pre-assigned to you for triage after the p99 latency sync

@daniel-crlabs
Copy link
Contributor

@bryan-yongwon-kwon is asking if there is a release timeline for this feature request.

@tbg
Copy link
Member Author

tbg commented Feb 21, 2023

It's in review now, and, without having the authority to commit to this, it seems very likely that this will be in 23.1, since it's close to landing now.

@daniel-crlabs
Copy link
Contributor

Thank you for the update, @bryan-yongwon-kwon please see above.

craig bot pushed a commit that referenced this issue Mar 2, 2023
95623: server: add cpu profiler r=Santamaura a=Santamaura

This PR adds a cpu profiler to the server package.
The following cluster settings have been added to configure
the cpu profiler:
- server.cpu_profile.cpu_usage_combined_threshold is the baseline
value for when cpu profiles should be taken
- server.cpu_profile.interval is when the high water mark resets to
the cpu_usage_combined_threshold value
- server.cpu_profile.duration is how long a cpu profile is taken
- server.cpu_profile.enabled is whether the on/off switch of the
cpu profiler

Fixes: #75799

Release note: None

Co-authored-by: Santamaura <[email protected]>
@craig craig bot closed this as completed in cf97b4f Mar 2, 2023
craig bot pushed a commit that referenced this issue Mar 8, 2023
This PR adds a cpu profiler to the server package.
The following cluster settings have been added to configure
the cpu profiler:
- server.cpu_profile.cpu_usage_combined_threshold is the baseline
value for when cpu profiles should be taken
- server.cpu_profile.interval is when the high water mark resets to
the cpu_usage_combined_threshold value
- server.cpu_profile.duration is how long a cpu profile is taken
- server.cpu_profile.enabled is whether the on/off switch of the
cpu profiler

Fixes: #75799

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants