Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: build in pprof-loop.sh for CPU profiles and Go execution traces #97174

Open
tbg opened this issue Feb 15, 2023 · 9 comments
Open

cli: build in pprof-loop.sh for CPU profiles and Go execution traces #97174

tbg opened this issue Feb 15, 2023 · 9 comments
Labels
A-observability-inf C-escalation-improvement Having this feature would have made an escalation easier O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-observability

Comments

@tbg
Copy link
Member

tbg commented Feb 15, 2023

Is your feature request related to a problem? Please describe.

We have the pprof-loop script1 which helps us periodically collect cluster-wide profiles. This is necessary, for example, when we are experiencing rare events which need to be introspected with Go runtime support (NUMA issues, GC pressure, generally unexplainable latency in traces), or there are intermittent spikes of high CPU activity that are difficult to catch in a manual profile2

In all such cases, we have customers run the script over a longer period of time until the event of interest occurs.

The script is hard to use, since it needs to be invoked on all nodes in the cluster simultaneously, followed by an artifacts collection step.

If we added an out-of-the-box solution that fanned out to the cluster (or a specified set of nodes) and collected the results in a single directory, this would be much easier.

Describe the solution you'd like

Build that out-of-the-box solution, with option to get CPU profile or Go execution trace (both are important in different contexts, though CPU is easier since we're almost there). Here is a prototype: #96749

Replace the custom 10s fan-out CPU profile with an invocation of this tool, for a 10s CPU profile and a subsequent 1s runtime trace.

Describe alternatives you've considered

Additional context

Jira issue: CRDB-28055

Epic CRDB-32402

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/master/scripts/pprof-loop.sh

  2. though in such cases hopefully the about-to-be-introduced CPU profiler will get us there right away!

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability-inf labels Feb 15, 2023
@kevinkokomani
Copy link
Contributor

For me, ideally there would be a place in the DB Console -> Advanced Debug page where I can input the types of profiles I want and the node(s) I want to gather the profiles on, hit a button and continually gather profiles until I hit a button again (or have some length of time I can configure it for). And I would get a zip download that is a directory of node-specific folders with subdirectories for each of the profile types I gathered, with the timestamp in each profile's filename.

@tbg
Copy link
Member Author

tbg commented May 16, 2023

@kevinkokomani points out that the script likely doesn't work with secure clusters. Another reason to build it into CRDB. We can change the script so that the user supplants a working "curl" invocation, but this adds even more friction (need to use cockroach auth-session login <sql_user> --certs-dir=<certs_dir> etc).

@tbg
Copy link
Member Author

tbg commented May 17, 2023

There is #102734 which is related, though not quite the same, since pprof-loop also allows runtime traces, etc, and targets a single node.

@kevinkokomani it would be helpful to get TSEs opinion on what gaps are most important after #102734.

@kevinkokomani
Copy link
Contributor

kevinkokomani commented May 25, 2023

@tbg Sorry, I'm just seeing this. Reading through #102734, seems like it proposes a merged point-in-time CPU profile for troubleshooting cluster-wide issues. I'm not sure if this addresses the same issue we want to address here. Having a merged cluster-wide CPU profile is nice, but what we're after in this issue is the ability to capture profiles in situations where the spikes are very short and sharp. When the CPU increases are sustained, we can easily grab a CPU profile at our leisure, even for multiple nodes. But when they're not sustained and instead "random" and spiky, the pprof-loop is our only recourse. Allowing us to continually gather CPU profiles in chunks over a period of time also gives us a continuous stream of data that lets us compare CPU valleys and peaks easily, which can be quite useful.

That said, is this all moot given the fact that (AFAIK, at least) we are planning to implement the automatic CPU profiling when spikes are detected, similar to heap profile? If that is true but productionizing pprof-loop is still seen as useful, would that be because there are still expected cases where spikes are short and severe enough so as not to be captured by an automatic profiler?

In terms of which profiling endpoints are the most important in general; we are normally well covered with heap profiles due to the automatic heap profiling + ability to gather them on-demand. CPU profiles are the next most common thing we need to look at (if not the most common thing). All other endpoints are much less often used.

edit: I've also shared this to field some more opinions.

@NigelNavarro
Copy link
Collaborator

@tbg here are some of my thoughts about pprofs as a whole:

  • As it currently stands, the "graph" option in the CPU/Heap pprof doesn't always load (providing a text message instead that says that we should perhaps install graphviz, even if the latest is installed).
    • The thing is, this does not mean that the pprof is corrupted, because if you take the base URL (such as http://localhost:12345/ui/) and append another page besides the main "graph" page (such as http://localhost:12345/ui/flamegraph), the pprof is able to pull up the flamegraph successfully. I don't know who or what is in charge of pulling up pprof details, but this inconsistency needs to be addressed.
  • We have attempted before (See this draft TSE KB) to help ourselves understand the contents of the pprof. While those with much longer experience reading these graphs can generally decipher what may be going on based off of a historical-analysis approach, we realistically don't have some sort of encyclopedia or definition table that can help us understand the patterns and behaviors in the pprofs themselves. This isn't sustainable long term, and will continue to be something we page #KV to assist with.
  • Gathering pprofs as immediately as we request them is crucial for understanding issues as they happen. Times when we would like pprofs but aren't readily available to obtain are (but not limited to):
    • CPU peak saturation and near-OOM scenarios
    • Intermittent spikes in CPU that only last for seconds/minutes
    • Cluster unresponsive scenarios (DB Console/CLI unavailability, Asymmetric Network Partitions, etc.)
  • This is where automatic pprofs saved would be helpful, taken periodically and/or at a designated window of the user's choosing (thinking along the lines of a cluster setting to X profiles per second or something).
    • I'm not sure of how the intricacies work with profiling, but if disk space is a concern, perhaps there is a solution out there that utilizes pointers instead, allowing a pprof request to point to a certain timeframe, and the pprof is generated afterwards based off of historical data (like from a snapshot).
  • Similar to the previous point, historical CPU/Memory analysis is just as important as being able to ad-hoc capture a pprof immediately. As Kevin mentions above, comparing CPU behavior over time is quite helpful in establishing pattern that may lead to a specific workload.
  • Tying pprof behavior back to workload has always been a primary concern. For each of our cases where pprofs are gathered, we had to use the results of the analysis to approximate what potential query or set of transactions resulted in the CPU/Memory to spike/cause concern. I know that we have some things to address this in 23.1; however, we must continually make an active effort to make it easier to correlate CPU/Memory pprofs to the offending transactions.

To summarize, CPU and Memory pprofs are incredibly powerful... but who are they most useful for? If we're going to make them much more usable for TSEs, readability and freedom to quickly acquire pprofs are going to be the primary requirements to make this happen.

@tbg
Copy link
Member Author

tbg commented May 26, 2023

s it currently stands, the "graph" option in the CPU/Heap pprof doesn't always load (providing a text message instead that says that we should perhaps install graphviz, even if the latest is installed).
The thing is, this does not mean that the pprof is corrupted, because if you take the bas

@NigelNavarro see the workaround for that issue in #101523 (comment), mind documenting this somewhere the TSEs can find?

Thanks for the other points, I think the CPU pprofs are supposedly much better in 23.1 because they contain the labels for the SQL statements. I think the jury is still out on how well the automatic CPU profiles work, for one datapoint: they default to off, so they would only be available after an additional round-trip to the customer and a reoccurrence.

To summarize, CPU and Memory pprofs are incredibly powerful... but who are they most useful for? If we're going to make them much more usable for TSEs, readability and freedom to quickly acquire pprofs are going to be the primary requirements to make this happen.

I think we struggle to give the L2 teams the right profiles at the right time, that seems like a good plumbing problem to solve, if then the profiles also turn out to be good enough for TSE that would be an added bonus, but I think proper labels should go a very long way at least when the spikes are workload-induced.

@NigelNavarro
Copy link
Collaborator

@NigelNavarro see the workaround for that issue in #101523 (comment), mind documenting this somewhere the TSEs can find?

Sure thing, caaan do! I've added to the "troubleshooting steps" section of this KB I created not too long ago. You're welcome to edit it and add any additional verbiage on what the hack is actually doing if you'd like.

Thanks for reading all of my pprof observations/concerns. I'm excited to see what we come up with!

@tbg
Copy link
Member Author

tbg commented Jun 16, 2023

In #102734, @adityamaru added an endpoint to collect a cluster-wide CPU profile. So pointing today's pprof-loop at that endpoint should take the difficulty out of the process, at least when a CPU profile is requested. For traces, we could take a similar approach, filed #105035.

@maryliag maryliag added C-escalation-improvement Having this feature would have made an escalation easier O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs and removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Jul 6, 2023
@exalate-issue-sync exalate-issue-sync bot added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs C-escalation-improvement Having this feature would have made an escalation easier labels Jul 11, 2023
@jlinder jlinder added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs C-escalation-improvement Having this feature would have made an escalation easier and removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Jul 11, 2023
@thtruo
Copy link
Contributor

thtruo commented Aug 8, 2023

Had a quick offline conversation with @kevinkokomani about the desired UX for TSEs for sidestepping debug zips and getting access to CPU profiles in a more convenient manner. Noting it here so we don't lose track:

I imagine this would be implemented as like a button in the DB console which provides the following configuration options:

  • which profiles to get (heap, CPU, goroutine, etc)
  • which nodes to get profiles from
  • how long to get profiles for
    • one time
    • for a discrete amount of time (where this pprof loop idea comes in)
    • indefinitely, until I issue a cancel

@dhartunian dhartunian added P-2 Issues/test failures with a fix SLA of 3 months and removed P-2 Issues/test failures with a fix SLA of 3 months labels Jan 16, 2024
@dhartunian dhartunian added the P-3 Issues/test failures with no fix SLA label Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-escalation-improvement Having this feature would have made an escalation easier O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-observability
Projects
None yet
Development

No branches or pull requests

8 participants