Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: add debug/pprof/ui/cpu/?node=all endpoint for cluster-wide CPU profile #102734

Merged
merged 1 commit into from
May 23, 2023

Conversation

adityamaru
Copy link
Contributor

@adityamaru adityamaru commented May 2, 2023

This change adds a debug endpoint to the status server that collects a CPU profiles from all nodes in the cluster and merges them before rendering the profile. The merge is performed using pprof.Merge https://pkg.go.dev/github.com/google/pprof/profile#Merge.

A future patch will teach the jobs page to filter out samples from this cluster-wide profile that are only labelled with the job's job ID. In this way a user can get the cluster-wide CPU profile for a particular job making it easier to identify the cause for a hotspot or a performance regression.

Informs: #102735
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru adityamaru marked this pull request as ready for review May 3, 2023 14:52
@adityamaru adityamaru requested review from a team May 3, 2023 14:52
@adityamaru adityamaru requested a review from a team as a code owner May 3, 2023 14:52
@adityamaru adityamaru requested a review from dhartunian May 3, 2023 14:52
@adityamaru
Copy link
Contributor Author

friendly ping @dhartunian, just so that this is on your radar!

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dhartunian)


pkg/server/status.go line 1496 at r1 (raw file):

	}

	resps := make([]profData, len(nodeList))

I think this wants to use memory accounting.


pkg/server/status.go line 1500 at r1 (raw file):

		nodeID := roachpb.NodeID(nodeList[i])
		wg.Add(1)
		go func(ctx context.Context, i int) {

I feel this needs a semaphore.

Why not useiterateNodes() here?

Copy link
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/server/status.go line 1496 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I think this wants to use memory accounting.

Hmm I don't think the status server has a monitor hanging off it today, or am I just missing it? The admin server seems to have one but its an unlimited monitor that logs above some constant threshold -

server.memMonitor = mon.NewUnlimitedMonitor(
. Are you envisioning something similar but for the status server?


pkg/server/status.go line 1500 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I feel this needs a semaphore.

Why not useiterateNodes() here?

Nice! I did not know about this method. I'll use this.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/server/status.go line 1496 at r1 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

Hmm I don't think the status server has a monitor hanging off it today, or am I just missing it? The admin server seems to have one but its an unlimited monitor that logs above some constant threshold -

server.memMonitor = mon.NewUnlimitedMonitor(
. Are you envisioning something similar but for the status server?

it doesn't appear that the existing profile path has monitoring either:

I'd say adding monitoring -- to it and this new path -- sounds like a separate, standalone change rather than something to pull in to this change?

@adityamaru
Copy link
Contributor Author

@knz would you like to see anything more done here before merging?

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dhartunian)


pkg/server/status.go line 1500 at r1 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

Nice! I did not know about this method. I'll use this.

  1. I checked the docs + implementation and it appears you cannot use pprof.Merge unless all the profiles come from the same binary. So you need to check the node ExecutableVersion in each remote node before taking a profile there.

  2. I feel icky about collecting all the profiles in RAM. I disagree with David it's a separate project; the other profiling we do only collects a profile for just 1 node, not for all of them. Maybe a compromise would be to store them into temp files and never keep more than 1-2 of them at a time for the merge operation.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/server/status.go line 1500 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…
  1. I checked the docs + implementation and it appears you cannot use pprof.Merge unless all the profiles come from the same binary. So you need to check the node ExecutableVersion in each remote node before taking a profile there.

  2. I feel icky about collecting all the profiles in RAM. I disagree with David it's a separate project; the other profiling we do only collects a profile for just 1 node, not for all of them. Maybe a compromise would be to store them into temp files and never keep more than 1-2 of them at a time for the merge operation.

My argument for doing these separately is that we're no worse off with merging this as-is now -- nothing that is currently working will break or start to OOM. Then we'll have this, available in cases where we determine is OK to use (either limited node counts or plenty of memory headroom, which, fwiw, the majority of larger clusters have).

Then we can follow-up to add monitoring or disk spilling or whatever, and at that point we might feel comfortable documenting it, recommending it in more cases, etc. But the diff as-is is nice and small and contained, which if we need to backport it (for a customer in the aforementioned cases where we know it is safe) would make it easier than if we do more refactoring at the same time.

pkg/server/status.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/server/status.go line 1500 at r1 (raw file):

Then we'll have this, available in cases where we determine is OK to use

I also added a warning in the text on the Advanced Debug page calling out that this is an expensive operation with memory overhead. Similar to how logspy calls out its impact on performance.

I checked the docs + implementation and it appears you cannot use pprof.Merge unless all the profiles come from the same binary. So you need to check the node ExecutableVersion in each remote node before taking a profile there.

Nice catch, I added a check. I'm going to play around with this in a mixed version cluster as well to make sure it works as expected.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dhartunian)


pkg/server/status.go line 1496 at r3 (raw file):

		// as the node that is going to merge the profiles.
		statusClient := client.(serverpb.StatusClient)
		resp, err := statusClient.Node(ctx, &serverpb.NodeRequest{NodeId: "local"})

You have an atomicity error between the call to Node here and the call to Profile below. They could be called across an upgrade of the node. IMHO you can either:

  1. make Profile return the executable version alongside the profile in the response payload; or
  2. pass the required executable version in the request and make the remote Profile handler refuse to do anything if its version does not match what is requested.

@adityamaru
Copy link
Contributor Author

adityamaru commented May 17, 2023

@knz I went with option 2 of rejecting the profile if the remote binary version isn't the same as the sender version. fwiw a mixed-version 22.2/23.1 seems to pass pprof.Merge successfully so I'm not sure what makes two samples incompatible for a merge 🤔. I believe this is the only place merge throws an error - https://github.com/golang/go/blob/master/src/internal/profile/merge.go#L440. Note, the newly added check won't protect us when running in a 23.1/23.2 mixed version until this is backported because 23.1 won't know about the SenderServerVersion field on the request. If pprof.Merge does fail it will not lead to a panic, because of the error handling logic around it.

@adityamaru adityamaru requested a review from knz May 17, 2023 19:49
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dhartunian)


pkg/server/status.go line 1597 at r4 (raw file):

	// same server version before collecting a profile.
	if req.SenderServerVersion != nil {
		resp, err := s.Node(ctx, &serverpb.NodeRequest{NodeId: "local"})

lol, no you do not need to call the Node RPC handler to retrieve the server version. That's paying 8 layers of Go calls to access data that's already available locally.

Just use the Version cluster setting for the local node and its BinaryVersion() method.

…CPU profile

This change adds a debug endpoint to the status server that collects
a CPU profiles from all nodes in the cluster and merges them before
rendering the profile. The merge is performed using `pprof.Merge`
https://pkg.go.dev/github.com/google/pprof/profile#Merge.

A future patch will teach the jobs page to filter out samples from
this cluster-wide profile that are only labelled with the job's job ID.
In this way a user can get the cluster-wide CPU profile for a particular
job making it easier to identify the cause for a hotspot or a
performance regression.

Release note: None
Copy link
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/server/status.go line 1597 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

lol, no you do not need to call the Node RPC handler to retrieve the server version. That's paying 8 layers of Go calls to access data that's already available locally.

Just use the Version cluster setting for the local node and its BinaryVersion() method.

doh, I should've scrolled through what was hanging off of s. Fixed.

@adityamaru adityamaru requested a review from knz May 23, 2023 21:16
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

@adityamaru
Copy link
Contributor Author

TFTR!

bors r=knz

@craig
Copy link
Contributor

craig bot commented May 23, 2023

Build succeeded:

@craig craig bot merged commit 775b005 into cockroachdb:master May 23, 2023
@adityamaru adityamaru deleted the pprof-all branch May 24, 2023 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants