server: add cpuprofile_dumper #75799

tbg · 2022-02-01T16:35:42Z

Is your feature request related to a problem? Please describe.

We already dump heaps and goroutines when they grow. We haven't done the same for CPU profiles since CPU profiling has non-negligible runtime overhead. But it would be very useful anyway, as in support escalations in which high CPU utilization is observed, the node is often restarted as a first course of action and without taking the time to download a profile.

Describe the solution you'd like

Implement a CPU profile version of the heap dumper.
The overhead should be made small by some combination of a short sampling duration and a low sampling frequency (see #75801).

Describe alternatives you've considered

Additional context

Inspired by (internal) https://github.com/cockroachlabs/support/issues/1408 but also would have been helpful in multiple previous escalations.

Jira issue: CRDB-12840

gz#15620

tbg · 2022-02-01T16:40:01Z

Heads up when implementing this, only one CPU profile can be going on at any given point in time, but there are multiple endpoints that may request a profile (of which the cpuprofiler would be a new one).

We solve this by having all profiles go through (*statusServer).Profile, which is passed in here:

cockroach/pkg/server/server.go

Line 724 in 65db9cf

    
           debugServer := debug.NewServer(cfg.BaseConfig.AmbientCtx, st, sqlServer.pgServer.HBADebugFn(), sStatus)

It uses a mutex to serialize profiling attempts. So we should make sure that we also use it in cpuprofiler.

tbg · 2022-10-13T20:58:45Z

Related: #86012 #82464 #60508

I'm here because we just had an escalation (https://github.com/cockroachlabs/support/issues/1840) where there were periodic load spikes and nobody can figure out what caused them. We didn't manage to take CPU profiles at the right time or maybe we did; the profiler tags don't cross RPC boundaries so the queries wouldn't be identifiable anyway.

Having profiles taken automatically when a node gets hot, and having useful labels in them even when the work is caused by distSQL leaves, would make a radical difference in these frequent escalations.

thtruo · 2022-12-05T17:35:03Z

Heads up @nkodali pre-assigned to you for triage after the p99 latency sync

daniel-crlabs · 2023-02-13T15:59:05Z

@bryan-yongwon-kwon is asking if there is a release timeline for this feature request.

tbg · 2023-02-21T08:17:45Z

It's in review now, and, without having the authority to commit to this, it seems very likely that this will be in 23.1, since it's close to landing now.

daniel-crlabs · 2023-02-21T16:21:40Z

Thank you for the update, @bryan-yongwon-kwon please see above.

95623: server: add cpu profiler r=Santamaura a=Santamaura This PR adds a cpu profiler to the server package. The following cluster settings have been added to configure the cpu profiler: - server.cpu_profile.cpu_usage_combined_threshold is the baseline value for when cpu profiles should be taken - server.cpu_profile.interval is when the high water mark resets to the cpu_usage_combined_threshold value - server.cpu_profile.duration is how long a cpu profile is taken - server.cpu_profile.enabled is whether the on/off switch of the cpu profiler Fixes: #75799 Release note: None Co-authored-by: Santamaura <[email protected]>

This PR adds a cpu profiler to the server package. The following cluster settings have been added to configure the cpu profiler: - server.cpu_profile.cpu_usage_combined_threshold is the baseline value for when cpu profiles should be taken - server.cpu_profile.interval is when the high water mark resets to the cpu_usage_combined_threshold value - server.cpu_profile.duration is how long a cpu profile is taken - server.cpu_profile.enabled is whether the on/off switch of the cpu profiler Fixes: #75799 Release note: None

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Feb 1, 2022

tbg added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Feb 1, 2022

tbg mentioned this issue Feb 1, 2022

server: enable collecting CPU profiles at a lower rate to limit the CPU overhead of the profiling #75801

Open

thtruo added the T-observability-inf label Dec 5, 2022

blathers-crl bot added the A-observability-inf label Dec 5, 2022

thtruo assigned nkodali Dec 5, 2022

thtruo added the T-kv-observability label Dec 21, 2022

blathers-crl bot added the A-kv-observability label Dec 21, 2022

thtruo removed the T-observability-inf label Dec 21, 2022

exalate-issue-sync bot assigned Santamaura and unassigned nkodali Jan 17, 2023

Santamaura mentioned this issue Jan 20, 2023

server: add cpu profiler #95623

Merged

craig bot closed this as completed in cf97b4f Mar 2, 2023

Santamaura mentioned this issue Mar 2, 2023

release-22.2: server: add cpu profiler #97929

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: add cpuprofile_dumper #75799

server: add cpuprofile_dumper #75799

tbg commented Feb 1, 2022 •

edited by RoachietheSupportRoach

Loading

tbg commented Feb 1, 2022

tbg commented Oct 13, 2022

thtruo commented Dec 5, 2022

daniel-crlabs commented Feb 13, 2023

tbg commented Feb 21, 2023

daniel-crlabs commented Feb 21, 2023

server: add cpuprofile_dumper #75799

server: add cpuprofile_dumper #75799

Comments

tbg commented Feb 1, 2022 • edited by RoachietheSupportRoach Loading

tbg commented Feb 1, 2022

tbg commented Oct 13, 2022

thtruo commented Dec 5, 2022

daniel-crlabs commented Feb 13, 2023

tbg commented Feb 21, 2023

daniel-crlabs commented Feb 21, 2023

tbg commented Feb 1, 2022 •

edited by RoachietheSupportRoach

Loading