-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli: build in pprof-loop.sh for CPU profiles and Go execution traces #97174
Comments
For me, ideally there would be a place in the DB Console -> Advanced Debug page where I can input the types of profiles I want and the node(s) I want to gather the profiles on, hit a button and continually gather profiles until I hit a button again (or have some length of time I can configure it for). And I would get a zip download that is a directory of node-specific folders with subdirectories for each of the profile types I gathered, with the timestamp in each profile's filename. |
@kevinkokomani points out that the script likely doesn't work with secure clusters. Another reason to build it into CRDB. We can change the script so that the user supplants a working "curl" invocation, but this adds even more friction (need to use |
There is #102734 which is related, though not quite the same, since pprof-loop also allows runtime traces, etc, and targets a single node. @kevinkokomani it would be helpful to get TSEs opinion on what gaps are most important after #102734. |
@tbg Sorry, I'm just seeing this. Reading through #102734, seems like it proposes a merged point-in-time CPU profile for troubleshooting cluster-wide issues. I'm not sure if this addresses the same issue we want to address here. Having a merged cluster-wide CPU profile is nice, but what we're after in this issue is the ability to capture profiles in situations where the spikes are very short and sharp. When the CPU increases are sustained, we can easily grab a CPU profile at our leisure, even for multiple nodes. But when they're not sustained and instead "random" and spiky, the That said, is this all moot given the fact that (AFAIK, at least) we are planning to implement the automatic CPU profiling when spikes are detected, similar to heap profile? If that is true but productionizing In terms of which profiling endpoints are the most important in general; we are normally well covered with heap profiles due to the automatic heap profiling + ability to gather them on-demand. CPU profiles are the next most common thing we need to look at (if not the most common thing). All other endpoints are much less often used. edit: I've also shared this to field some more opinions. |
@tbg here are some of my thoughts about pprofs as a whole:
To summarize, CPU and Memory pprofs are incredibly powerful... but who are they most useful for? If we're going to make them much more usable for TSEs, readability and freedom to quickly acquire pprofs are going to be the primary requirements to make this happen. |
@NigelNavarro see the workaround for that issue in #101523 (comment), mind documenting this somewhere the TSEs can find? Thanks for the other points, I think the CPU pprofs are supposedly much better in 23.1 because they contain the labels for the SQL statements. I think the jury is still out on how well the automatic CPU profiles work, for one datapoint: they default to off, so they would only be available after an additional round-trip to the customer and a reoccurrence.
I think we struggle to give the L2 teams the right profiles at the right time, that seems like a good plumbing problem to solve, if then the profiles also turn out to be good enough for TSE that would be an added bonus, but I think proper labels should go a very long way at least when the spikes are workload-induced. |
Sure thing, caaan do! I've added to the "troubleshooting steps" section of this KB I created not too long ago. You're welcome to edit it and add any additional verbiage on what the hack is actually doing if you'd like. Thanks for reading all of my pprof observations/concerns. I'm excited to see what we come up with! |
In #102734, @adityamaru added an endpoint to collect a cluster-wide CPU profile. So pointing today's pprof-loop at that endpoint should take the difficulty out of the process, at least when a CPU profile is requested. For traces, we could take a similar approach, filed #105035. |
Had a quick offline conversation with @kevinkokomani about the desired UX for TSEs for sidestepping debug zips and getting access to CPU profiles in a more convenient manner. Noting it here so we don't lose track:
|
Is your feature request related to a problem? Please describe.
We have the pprof-loop script1 which helps us periodically collect cluster-wide profiles. This is necessary, for example, when we are experiencing rare events which need to be introspected with Go runtime support (NUMA issues, GC pressure, generally unexplainable latency in traces), or there are intermittent spikes of high CPU activity that are difficult to catch in a manual profile2
In all such cases, we have customers run the script over a longer period of time until the event of interest occurs.
The script is hard to use, since it needs to be invoked on all nodes in the cluster simultaneously, followed by an artifacts collection step.
If we added an out-of-the-box solution that fanned out to the cluster (or a specified set of nodes) and collected the results in a single directory, this would be much easier.
Describe the solution you'd like
Build that out-of-the-box solution, with option to get CPU profile or Go execution trace (both are important in different contexts, though CPU is easier since we're almost there). Here is a prototype: #96749
Replace the custom 10s fan-out CPU profile with an invocation of this tool, for a 10s CPU profile and a subsequent 1s runtime trace.
Describe alternatives you've considered
Additional context
Jira issue: CRDB-28055
Epic CRDB-32402
Footnotes
https://github.com/cockroachdb/cockroach/blob/master/scripts/pprof-loop.sh ↩
though in such cases hopefully the about-to-be-introduced CPU profiler will get us there right away! ↩
The text was updated successfully, but these errors were encountered: