Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc,sql: propagate pprof labels for SQL queries to KV #86012

Closed
tbg opened this issue Aug 12, 2022 · 12 comments
Closed

rpc,sql: propagate pprof labels for SQL queries to KV #86012

tbg opened this issue Aug 12, 2022 · 12 comments
Labels
A-kv-observability A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented Aug 12, 2022

Is your feature request related to a problem? Please describe.

While CPU profiling, we usually generate pprof labels that could help pinning work to queries. This is all pretty half-baked (I think as is it misses lots of queries due to the code path it annotates), but it has been helpful in some investigations.

labels := pprof.Labels(
"appname", ex.sessionData().ApplicationName,
"addr", remoteAddr,
"stmt.tag", ast.StatementTag(),
"stmt.no.constants", stmtNoConstants,
)
pprof.Do(ctx, labels, func(ctx context.Context) {
ev, payload, err = ex.execStmtInOpenState(ctx, parserStmt, prepared, pinfo, res, canAutoCommit)
})

However, these profiler labels don't propagate across RPC boundaries, and so for any query that fans out, only the local portion will have the labels (since there is a direct goroutine descendancy relationship between the goroutine we annotated and the goroutines that do the work). In particular, if the expensive portion of the processing happens on a remote node, it will not be annotated. The investigator may then be mislead into thinking that another query that is local (but maybe not expensive) is driving the load, which would could lead to a lengthy detour.

Describe the solution you'd like

Propagate pprof labels across RPC boundaries. The right way to do this is likely gRPC interceptors. I wrote them as part of a past prototype, #60508, so some code exists but isn't merged.

Ideally while revisiting this we also move the label annotation to a more useful place, noting what was done (but not merged) in #85977 as I realized there that we weren't capturing all classes of statememts. If we are to use this feature, it can't have big blind spots; wrong information is worse than no information in this case.

Describe alternatives you've considered

We could decide that we don't wish to leverage profiler labels to find offending queries, and instead rely on something else, for example #85977. However, pprof labels seem like a good stopgap measure and I think most of the work is already done.

Additional context

Jira issue: CRDB-18526

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability-inf labels Aug 12, 2022
@irfansharif
Copy link
Contributor

Would we always annotate contexts with profiler labels containing statement data? Even if that node wasn't currently being profiled? Seeing as how we might be grabbing a CPU profile on a remote node where the statement didn't originate and would be relying on propagation to have happened. We should evaluate the overhead, if any, if always annotating. #85948 describes some techniques we use in KV to make this more alloc efficient.

@tbg
Copy link
Member Author

tbg commented Aug 12, 2022

That's a good point. If the originator of the work sits elsewhere, we need to tell it to put the labels.
The cheap way to do it is to profile the entire cluster at the same time, over a long enough interval (so that the overlap is ~100% of the time); I think 10s will already be enough.

We could also try to always add labels but it involves a longer performance fight and I tend to think that fleet-wide profiling is a good enough workaround.

@sumeerbhola
Copy link
Collaborator

I would really like for us to build per-statement distributed CPU profiling. In a past life this has been invaluable in understanding slow queries, and driving system performance improvements, in a way that tracing was inadequate for. It is very hard to understand what code paths are responsible for a slow query, in a production environment, since there are other queries running concurrently (this can be non-trivial even in complex benchmarks where many queries are running) and it is hard to even know which nodes a query will touch. See https://github.com/cockroachlabs/support/issues/2033#issuecomment-1409654095 of an example where traces did not help us understand the root cause of the slowness.

This would roughly work as follows:

  • At most one query profile can be running at a node at a time. This is fine since this will be used infrequently, and we can assume human coordination to typically only run one of these in a cluster at a time.

  • At the node level there is a refcount of the "root" goroutines that have "joined" the currently running profile, and the statement ID of that running profile. When a new "root" goroutine (say incoming RPC's goroutine) sees that its payload is requesting profiling it compares its payload's statement ID with the running profile and either (a) starts a cpu profile (if no profile is running), and joins it by setting a go profiler label, (b) joins the existing profiling by setting a profiler label, (c) remembers that it could not profile so it can return a profiling-error when it eventually returns the results. (a) and (b) increment the refcount.

    • The profile for a statement can be started and stopped at a node multiple times (due to multiple passes, e.g. multiple BatcheRequests etc.).
  • When the "root" goroutine is done, it decrements the refcount and if the refcount drops to 0, it stops the profile, fetches the result, filters out the part that does not match the profiler label, and packages this to piggyback in the RPC response.

  • The "real" root is at the gateway node that started executing the statement, and it aggregates the responses received and its own profile, buckets it by node, and writes it to files when the statement execution is complete. This allows us to look at profiles per node (to discover which node was responsible for the slowness) and also as an aggregate.

@kevin-v-ngo
Copy link

kevin-v-ngo commented Feb 7, 2023

Chatting with @maryliag, SQL Obs will need to collaborate with potentially KV or other subject matter experts. FYI @thtruo

@thtruo
Copy link
Contributor

thtruo commented Feb 7, 2023

Thanks for the heads up Kevin! cc @nkodali for your awareness

@jordanlewis
Copy link
Member

DistSQL flow requests carry their initiating SQL in the request.

So, to enable this for DistSQL, it should be straightforward: we can run pprof.Labels when the DistSQL flow begins execution on the remote node, tagging the stack with stmt.no.constants (and maybe also with distsql.remote or something, so it's clear which ones are remote).

@jordanlewis
Copy link
Member

jordanlewis commented Feb 13, 2023

Update: #82464 solved the missing DistSQL label propagation in 22.2, but there's still a need to propagate the pprof labels into the KV layer as per the original issue.

@tbg
Copy link
Member Author

tbg commented Feb 15, 2023

Update: #82464 solved this in 22.2.

That's great! This is a significant improvement over the status quo. Are we aware of cases where this wouldn't have been enough? I think if we had a node which had lots of leases for write-heavy ranges, but the writes originated from a remote gateway, we would still be flying blind (if there were lots of reads, it's likely though not guaranteed that there'd be a local DistSQL leaf, which would have the tags).

@jordanlewis
Copy link
Member

Yeah, there are still plenty of cases where we'd need the KV tag propagation, like the cases you mention as well as point reads (don't use DistSQL), index joins, zigzag joins, lookup joins, etc. My comment about "solving" was just about the DistSQL but it wasn't entirely clear, I've edited it to prevent confusion.

@tbg
Copy link
Member Author

tbg commented Jun 16, 2023

Summarizing the current state.

The motivating example: two nodes A, B where A drives all of the load (i.e. is SQL gateway) for B, and we want to understand which query/user/app drives the load. Currently, if we profile on B, we don't get any labels (note that we don't necessarily know which node A is; there might be 100s of nodes in this cluster).
Even if we profile on A and B concurrently, we don't get any labels because labels on A are not encoded on the wire in RPCs (and not added to the goroutine's labels on the receiving handler on B even if they were).

Scope of this issue:

  • the scope of this issue continues to propagate pprof labels (if they are set) on the sender side of an RPC to the receiver side of an RPC (if it is also profiling).
  • we have server: add debug/pprof/ui/cpu/?node=all endpoint for cluster-wide CPU profile #102734 so we can easily get all nodes to profile simultaneously and get a merged profile with complete labelling, so this approach just needs to be adopted by Support for releases that support it (23.2+)

Related ideas (please discuss them on their respective tracking issues):

Footnotes

  1. https://github.com/cockroachdb/cockroach/issues/86012#issuecomment-1411004870

@adityamaru
Copy link
Contributor

the scope of this issue continues to propagate pprof labels (if they are set) on the sender side of an RPC to the receiver side of an RPC (if it is also profiling).

@tbg have you seen #101404? The hypothesis was that handling this for BatchRequests gets us most of the way there.

@tbg
Copy link
Member Author

tbg commented Jun 16, 2023

I hadn't! Great, that's the missing piece. We could always do this "more holistically" but this is fine, let's close this issue since all follow-up items are tracked separately.

@tbg tbg closed this as completed Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

7 participants