-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: add cpuprofile_dumper #75799
Comments
Heads up when implementing this, only one CPU profile can be going on at any given point in time, but there are multiple endpoints that may request a profile (of which the cpuprofiler would be a new one). We solve this by having all profiles go through cockroach/pkg/server/server.go Line 724 in 65db9cf
It uses a mutex to serialize profiling attempts. So we should make sure that we also use it in |
I'm here because we just had an escalation (https://github.com/cockroachlabs/support/issues/1840) where there were periodic load spikes and nobody can figure out what caused them. We didn't manage to take CPU profiles at the right time or maybe we did; the profiler tags don't cross RPC boundaries so the queries wouldn't be identifiable anyway. Having profiles taken automatically when a node gets hot, and having useful labels in them even when the work is caused by distSQL leaves, would make a radical difference in these frequent escalations. |
Heads up @nkodali pre-assigned to you for triage after the p99 latency sync |
@bryan-yongwon-kwon is asking if there is a release timeline for this feature request. |
It's in review now, and, without having the authority to commit to this, it seems very likely that this will be in 23.1, since it's close to landing now. |
Thank you for the update, @bryan-yongwon-kwon please see above. |
95623: server: add cpu profiler r=Santamaura a=Santamaura This PR adds a cpu profiler to the server package. The following cluster settings have been added to configure the cpu profiler: - server.cpu_profile.cpu_usage_combined_threshold is the baseline value for when cpu profiles should be taken - server.cpu_profile.interval is when the high water mark resets to the cpu_usage_combined_threshold value - server.cpu_profile.duration is how long a cpu profile is taken - server.cpu_profile.enabled is whether the on/off switch of the cpu profiler Fixes: #75799 Release note: None Co-authored-by: Santamaura <[email protected]>
This PR adds a cpu profiler to the server package. The following cluster settings have been added to configure the cpu profiler: - server.cpu_profile.cpu_usage_combined_threshold is the baseline value for when cpu profiles should be taken - server.cpu_profile.interval is when the high water mark resets to the cpu_usage_combined_threshold value - server.cpu_profile.duration is how long a cpu profile is taken - server.cpu_profile.enabled is whether the on/off switch of the cpu profiler Fixes: #75799 Release note: None
Is your feature request related to a problem? Please describe.
We already dump heaps and goroutines when they grow. We haven't done the same for CPU profiles since CPU profiling has non-negligible runtime overhead. But it would be very useful anyway, as in support escalations in which high CPU utilization is observed, the node is often restarted as a first course of action and without taking the time to download a profile.
Describe the solution you'd like
Implement a CPU profile version of the heap dumper.
The overhead should be made small by some combination of a short sampling duration and a low sampling frequency (see #75801).
Describe alternatives you've considered
Additional context
Inspired by (internal) https://github.com/cockroachlabs/support/issues/1408 but also would have been helpful in multiple previous escalations.
Jira issue: CRDB-12840
gz#15620
The text was updated successfully, but these errors were encountered: