Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvclient,server: propagate pprof labels for BatchRequests #101404

Merged
merged 2 commits into from
Apr 22, 2023

Conversation

adityamaru
Copy link
Contributor

This change teaches DistSender to populate a BatchRequest's header with the pprof labels set on the sender's context. If the node processing the request on the server side has CPU profiling with labels enabled, then the labels stored in the BatchRequest will be applied to the root context of the goroutine executing the request on the server. Doing so will ensure that all server-side CPU samples of the root goroutine and all its spawned goroutine will be labeled correctly.

Propagating these labels across RPC boundaries is useful to correlate server side samples with the sender. For example, in a CPU profile generated with this change, we will be able to identify which backup job sent an ExportRequest that is causing a CPU hotspot on a remote node.

Fixes: #100166
Release note: None

@adityamaru adityamaru requested review from dt, tbg and miretskiy April 13, 2023 00:55
@adityamaru adityamaru requested review from a team as code owners April 13, 2023 00:55
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru
Copy link
Contributor Author

I wanted to get some feedback on this approach and whether we need a benchmark on how expensive it is for us to check for pprof labels on each sent BatchRequest. I tried this patch out on a roachprod cluster where the majority of ExportRequests were sent from n7,8,9 but were processed on n1,2,3 and it seems to work as expected. This is a profile on remote n1 while the coordinator of the job is on n9:

❯ go tool pprof profile.pb.gz
File: cockroach
Type: cpu
Time: Apr 12, 2023 at 12:52pm (EDT)
Duration: 5.14s, Total samples = 1.79s (34.85%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) tags
 distsql.gateway: Total 610.0ms
                  610.0ms (  100%): 8

 f: Total 610.0ms
    610.0ms (  100%): 73b7a58d

 job: Total 610.0ms
      610.0ms (  100%): BACKUP id=856056997655871496

 n: Total 620.0ms
    610.0ms (98.39%): 9
     10.0ms ( 1.61%): 3

 pebble: Total 240.0ms
         180.0ms (75.00%): compact
          60.0ms (25.00%): wal-sync

 r: Total 10.0ms
    10.0ms (  100%): 58

 range_str: Total 610.0ms
            260.0ms (42.62%): 579/6:/Table/108/1/167{3/5/…-5/3/…}
            240.0ms (39.34%): 1352/2:/Table/108/1/16{84/8/…-96/3/…}
            110.0ms (18.03%): 1348/2:/Table/108/1/167{5/3/…-9/2/…}

 remote_node_id: Total 50.0ms
                 40.0ms (80.00%): 7
                 10.0ms (20.00%): 9

 s: Total 10.0ms
    10.0ms (  100%): 3

Profile with -tagfocus set to our BACKUP job.

Screenshot 2023-04-12 at 9 05 24 PM

Screenshot 2023-04-12 at 9 07 08 PM

pkg/server/node.go Outdated Show resolved Hide resolved
@adityamaru
Copy link
Contributor Author

adityamaru commented Apr 18, 2023

@tbg I'm keen to get your take on this approach. We wanted something that we could potentially backport to older releases and thought that this change gave us enough coverage to begin with (all BatchRequests would propagate labels). The alternative would be writing grpc interceptors to get complete coverage but that will probably be harder to reason about for a backport. Are there any benchmarks or optimizations that you would like to see as part of this change?

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @miretskiy, and @tbg)


pkg/server/node.go line 1251 at r2 (raw file):

	} else {
		// We had this tag before the ResetAndAnnotateCtx() call above.
		ctx = logtags.AddTag(ctx, "tenant", tenantID.String())

This should really be AddTag(ctx, "tenant", tenantID) so it doesn't become redactable. Feel free to add a commit to do that.

This change teaches DistSender to populate a BatchRequest's
header with the pprof labels set on the sender's context.
If the node processing the request on the server side has CPU
profiling with labels enabled, then the labels stored in the BatchRequest
will be applied to the root context of the goroutine executing the
request on the server. Doing so will ensure that all server-side
CPU samples of the root goroutine and all its spawned goroutine
will be labeled correctly.

Propagating these labels across RPC boundaries is useful to correlate
server side samples with the sender. For example, in a CPU profile
generated with this change, we will be able to identify which backup
job sent an ExportRequest that is causing a CPU hotspot on a remote node.

Fixes: cockroachdb#100166
Release note: None
Copy link
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @knz, @miretskiy, and @tbg)


pkg/server/node.go line 1251 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

This should really be AddTag(ctx, "tenant", tenantID) so it doesn't become redactable. Feel free to add a commit to do that.

Done.

@adityamaru
Copy link
Contributor Author

TFTR!

bors r=dt

@craig
Copy link
Contributor

craig bot commented Apr 22, 2023

Build succeeded:

@craig craig bot merged commit 1d8a8a0 into cockroachdb:master Apr 22, 2023
@adityamaru
Copy link
Contributor Author

blathers backport 23.1 22.2

@adityamaru
Copy link
Contributor Author

I will let this bake on master for a couple of weeks before merging the backports

@blathers-crl
Copy link

blathers-crl bot commented Apr 22, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from e9883f5 to blathers/backport-release-22.2-101404: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: propagate job pprof labels to KV requests sent on behalf of the job
4 participants