server: add `debug/pprof/ui/cpu/?node=all` endpoint for cluster-wide CPU profile #102734

adityamaru · 2023-05-02T18:22:43Z

This change adds a debug endpoint to the status server that collects a CPU profiles from all nodes in the cluster and merges them before rendering the profile. The merge is performed using pprof.Merge https://pkg.go.dev/github.com/google/pprof/profile#Merge.

A future patch will teach the jobs page to filter out samples from this cluster-wide profile that are only labelled with the job's job ID. In this way a user can get the cluster-wide CPU profile for a particular job making it easier to identify the cause for a hotspot or a performance regression.

Informs: #102735
Release note: None

cockroach-teamcity · 2023-05-02T18:22:51Z

This change is

adityamaru · 2023-05-08T12:58:11Z

friendly ping @dhartunian, just so that this is on your radar!

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dhartunian)

pkg/server/status.go line 1496 at r1 (raw file):

	}

	resps := make([]profData, len(nodeList))

I think this wants to use memory accounting.

pkg/server/status.go line 1500 at r1 (raw file):

		nodeID := roachpb.NodeID(nodeList[i])
		wg.Add(1)
		go func(ctx context.Context, i int) {

I feel this needs a semaphore.

Why not useiterateNodes() here?

adityamaru

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)

pkg/server/status.go line 1496 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I think this wants to use memory accounting.

Hmm I don't think the status server has a monitor hanging off it today, or am I just missing it? The admin server seems to have one but its an unlimited monitor that logs above some constant threshold -

cockroach/pkg/server/admin.go

Line 249 in 3c5c368

server.memMonitor = mon.NewUnlimitedMonitor(

. Are you envisioning something similar but for the status server?

pkg/server/status.go line 1500 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I feel this needs a semaphore.

Why not useiterateNodes() here?

Nice! I did not know about this method. I'll use this.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)

pkg/server/status.go line 1496 at r1 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

Hmm I don't think the status server has a monitor hanging off it today, or am I just missing it? The admin server seems to have one but its an unlimited monitor that logs above some constant threshold -

cockroach/pkg/server/admin.go

Line 249 in 3c5c368

server.memMonitor = mon.NewUnlimitedMonitor(

. Are you envisioning something similar but for the status server?

it doesn't appear that the existing profile path has monitoring either:

cockroach/pkg/server/status_local_file_retrieval.go

Line 36 in 5792e92

var buf bytes.Buffer

I'd say adding monitoring -- to it and this new path -- sounds like a separate, standalone change rather than something to pull in to this change?

adityamaru · 2023-05-11T15:05:37Z

@knz would you like to see anything more done here before merging?

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dhartunian)

pkg/server/status.go line 1500 at r1 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

Nice! I did not know about this method. I'll use this.

I checked the docs + implementation and it appears you cannot use pprof.Merge unless all the profiles come from the same binary. So you need to check the node ExecutableVersion in each remote node before taking a profile there.
I feel icky about collecting all the profiles in RAM. I disagree with David it's a separate project; the other profiling we do only collects a profile for just 1 node, not for all of them. Maybe a compromise would be to store them into temp files and never keep more than 1-2 of them at a time for the merge operation.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)

pkg/server/status.go line 1500 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I checked the docs + implementation and it appears you cannot use pprof.Merge unless all the profiles come from the same binary. So you need to check the node ExecutableVersion in each remote node before taking a profile there.

I feel icky about collecting all the profiles in RAM. I disagree with David it's a separate project; the other profiling we do only collects a profile for just 1 node, not for all of them. Maybe a compromise would be to store them into temp files and never keep more than 1-2 of them at a time for the merge operation.

My argument for doing these separately is that we're no worse off with merging this as-is now -- nothing that is currently working will break or start to OOM. Then we'll have this, available in cases where we determine is OK to use (either limited node counts or plenty of memory headroom, which, fwiw, the majority of larger clusters have).

Then we can follow-up to add monitoring or disk spilling or whatever, and at that point we might feel comfortable documenting it, recommending it in more cases, etc. But the diff as-is is nice and small and contained, which if we need to backport it (for a customer in the aforementioned cases where we know it is safe) would make it easier than if we do more refactoring at the same time.

pkg/server/status.go

adityamaru

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)

pkg/server/status.go line 1500 at r1 (raw file):

Then we'll have this, available in cases where we determine is OK to use

I also added a warning in the text on the Advanced Debug page calling out that this is an expensive operation with memory overhead. Similar to how logspy calls out its impact on performance.

I checked the docs + implementation and it appears you cannot use pprof.Merge unless all the profiles come from the same binary. So you need to check the node ExecutableVersion in each remote node before taking a profile there.

Nice catch, I added a check. I'm going to play around with this in a mixed version cluster as well to make sure it works as expected.

knz

Reviewed 1 of 3 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dhartunian)

pkg/server/status.go line 1496 at r3 (raw file):

		// as the node that is going to merge the profiles.
		statusClient := client.(serverpb.StatusClient)
		resp, err := statusClient.Node(ctx, &serverpb.NodeRequest{NodeId: "local"})

You have an atomicity error between the call to Node here and the call to Profile below. They could be called across an upgrade of the node. IMHO you can either:

make Profile return the executable version alongside the profile in the response payload; or
pass the required executable version in the request and make the remote Profile handler refuse to do anything if its version does not match what is requested.

adityamaru · 2023-05-17T19:49:44Z

@knz I went with option 2 of rejecting the profile if the remote binary version isn't the same as the sender version. fwiw a mixed-version 22.2/23.1 seems to pass pprof.Merge successfully so I'm not sure what makes two samples incompatible for a merge 🤔. I believe this is the only place merge throws an error - https://github.com/golang/go/blob/master/src/internal/profile/merge.go#L440. Note, the newly added check won't protect us when running in a 23.1/23.2 mixed version until this is backported because 23.1 won't know about the SenderServerVersion field on the request. If pprof.Merge does fail it will not lead to a panic, because of the error handling logic around it.

knz

Reviewed 4 of 4 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dhartunian)

pkg/server/status.go line 1597 at r4 (raw file):

	// same server version before collecting a profile.
	if req.SenderServerVersion != nil {
		resp, err := s.Node(ctx, &serverpb.NodeRequest{NodeId: "local"})

lol, no you do not need to call the Node RPC handler to retrieve the server version. That's paying 8 layers of Go calls to access data that's already available locally.

Just use the Version cluster setting for the local node and its BinaryVersion() method.

…CPU profile This change adds a debug endpoint to the status server that collects a CPU profiles from all nodes in the cluster and merges them before rendering the profile. The merge is performed using `pprof.Merge` https://pkg.go.dev/github.com/google/pprof/profile#Merge. A future patch will teach the jobs page to filter out samples from this cluster-wide profile that are only labelled with the job's job ID. In this way a user can get the cluster-wide CPU profile for a particular job making it easier to identify the cause for a hotspot or a performance regression. Release note: None

adityamaru

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)

pkg/server/status.go line 1597 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

lol, no you do not need to call the Node RPC handler to retrieve the server version. That's paying 8 layers of Go calls to access data that's already available locally.

Just use the Version cluster setting for the local node and its BinaryVersion() method.

doh, I should've scrolled through what was hanging off of s. Fixed.

knz

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

adityamaru · 2023-05-23T23:09:15Z

TFTR!

bors r=knz

craig · 2023-05-23T23:45:13Z

Build succeeded:

Bazel Essential CI (Cockroach)

adityamaru marked this pull request as ready for review May 3, 2023 14:52

adityamaru requested review from a team May 3, 2023 14:52

adityamaru requested a review from a team as a code owner May 3, 2023 14:52

adityamaru requested a review from dhartunian May 3, 2023 14:52

knz reviewed May 8, 2023

View reviewed changes

adityamaru commented May 8, 2023

View reviewed changes

adityamaru force-pushed the pprof-all branch from b0a3673 to 5792e92 Compare May 8, 2023 22:44

adityamaru requested a review from knz May 9, 2023 17:47

dt reviewed May 9, 2023

View reviewed changes

knz reviewed May 11, 2023

View reviewed changes

dt reviewed May 11, 2023

View reviewed changes

adityamaru commented May 11, 2023

View reviewed changes

pkg/server/status.go Outdated Show resolved Hide resolved

adityamaru force-pushed the pprof-all branch from 5792e92 to ae19af3 Compare May 11, 2023 20:05

adityamaru commented May 11, 2023

View reviewed changes

knz reviewed May 12, 2023

View reviewed changes

adityamaru force-pushed the pprof-all branch 2 times, most recently from 78937f3 to fe2e8fc Compare May 15, 2023 16:35

adityamaru requested a review from a team as a code owner May 15, 2023 16:35

tbg mentioned this pull request May 17, 2023

cli: build in pprof-loop.sh for CPU profiles and Go execution traces #97174

Open

adityamaru requested a review from knz May 17, 2023 19:49

knz reviewed May 23, 2023

View reviewed changes

adityamaru force-pushed the pprof-all branch from fe2e8fc to c471994 Compare May 23, 2023 21:14

adityamaru commented May 23, 2023

View reviewed changes

adityamaru requested a review from knz May 23, 2023 21:16

knz approved these changes May 23, 2023

View reviewed changes

craig bot merged commit 775b005 into cockroachdb:master May 23, 2023

adityamaru deleted the pprof-all branch May 24, 2023 17:54

This was referenced Jun 16, 2023

pprof: add /debug/pprof/trace?node=all #105035

Open

rpc,sql: propagate pprof labels for SQL queries to KV #86012

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: add `debug/pprof/ui/cpu/?node=all` endpoint for cluster-wide CPU profile #102734

server: add `debug/pprof/ui/cpu/?node=all` endpoint for cluster-wide CPU profile #102734

adityamaru commented May 2, 2023 •

edited

Loading

cockroach-teamcity commented May 2, 2023

adityamaru commented May 8, 2023

knz left a comment

adityamaru left a comment

dt left a comment

adityamaru commented May 11, 2023

knz left a comment

dt left a comment

adityamaru left a comment

knz left a comment

adityamaru commented May 17, 2023 •

edited

Loading

knz left a comment

adityamaru left a comment

knz left a comment

adityamaru commented May 23, 2023

craig bot commented May 23, 2023

server: add debug/pprof/ui/cpu/?node=all endpoint for cluster-wide CPU profile #102734

server: add debug/pprof/ui/cpu/?node=all endpoint for cluster-wide CPU profile #102734

Conversation

adityamaru commented May 2, 2023 • edited Loading

cockroach-teamcity commented May 2, 2023

adityamaru commented May 8, 2023

knz left a comment

Choose a reason for hiding this comment

adityamaru left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

adityamaru commented May 11, 2023

knz left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

adityamaru left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

adityamaru commented May 17, 2023 • edited Loading

knz left a comment

Choose a reason for hiding this comment

adityamaru left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

adityamaru commented May 23, 2023

craig bot commented May 23, 2023

server: add `debug/pprof/ui/cpu/?node=all` endpoint for cluster-wide CPU profile #102734

server: add `debug/pprof/ui/cpu/?node=all` endpoint for cluster-wide CPU profile #102734

adityamaru commented May 2, 2023 •

edited

Loading

adityamaru commented May 17, 2023 •

edited

Loading