Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats.Handler's HandleRPC is called with invalid context #6928

Closed
mx-psi opened this issue Jan 18, 2024 · 12 comments
Closed

stats.Handler's HandleRPC is called with invalid context #6928

mx-psi opened this issue Jan 18, 2024 · 12 comments

Comments

@mx-psi
Copy link

mx-psi commented Jan 18, 2024

What version of gRPC are you using?

This happens with v1.60.1 but not with v1.59.0

What version of Go are you using (go version)?

v1.21.0

What operating system (Linux, Windows, …) and version?

Happens at least on Amazon Linux, but I don't think this is specific to given operating system.

What did you do?

The opentelemetry-go-contrib project has some instrumentation for grpc-go. This project defines a stats.Handler. We use this in the OpenTelemetry Collector and have seen reports of crashes on its latest version, see open-telemetry/opentelemetry-collector/issues/9296.

We unfortunately don't have a minimal example to reproduce this at this time.

What did you expect to see?

No crashes :) I would expect the context value set on the TagRPC call to always be recoverable on the HandleRPC call.

What did you see instead?

The context does not have this value. See details on open-telemetry/opentelemetry-collector/issues/9296, the crash trace is:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1235929]

goroutine 167 [running]:
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*config).handleRPC(0xc002a0e9c0, {0x902ac80, 0xc002aaa510}, {0x8ff2fd0?, 0xc0020fba88?})
	go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/stats_handler.go:144 +0xa9
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*serverHandler).HandleRPC(0xc00232a340?, {0x902ac80?, 0xc002aaa510?}, {0x8ff2fd0?, 0xc0020fba88?})
	go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/stats_handler.go:88 +0x2a
google.golang.org/grpc/internal/transport.(*http2Server).WriteStatus(0xc00232a340, 0xc002a0c480, 0xc002b3c340)
	google.golang.org/[email protected]/internal/transport/http2_server.go:1071 +0xaf2
google.golang.org/grpc.(*Server).handleStream(0xc002858c00, {0x905cfe0, 0xc00232a340}, 0xc002a0c480)
	google.golang.org/[email protected]/server.go:1749 +0x575
google.golang.org/grpc.(*Server).serveStreams.func2.1()
	google.golang.org/[email protected]/server.go:1016 +0x59
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 166
	google.golang.org/[email protected]/server.go:1027 +0x115

Additional details

I think this is a bug because of the comment here:

if ri == nil {
// Shouldn't happen because TagRPC populates this information.
return
}

The opentelemetry-go-contrib maintainers also think this is a bug. I filed open-telemetry/opentelemetry-go-contrib/pull/4825 to make the code protected against this, but still this seems like something worth looking into in grpc-go.

I did a first pass to try and narrow down what change would have caused this, my guess it that it would be #6716 or, less likely #6750. Maybe @zasweq cam help here?

@zasweq
Copy link
Contributor

zasweq commented Jan 18, 2024

I just wrote a test locally, and I don't see any issues with any callouts not containing the information propagated in the context. This includes for the event which triggered this panic, out trailers. Still trying to figure out what is going on in opentelemetry collector though.

@zasweq
Copy link
Contributor

zasweq commented Jan 18, 2024

I read the stats handler code in the collector repo and wrote a test locally that has no problem server side in the case of trailers, pulling a value out of the context populated in TagRPC. I have to ask, is there a mini repro I can run. I see it runs behind layers of deployments in the top level issue and it can't seem to be reproduced locally, but I think that would be super useful here. I'd love to get to the bottom of this, here to help. Thank you!

@rpadaki
Copy link

rpadaki commented Jan 19, 2024

My team is experiencing the exact same issue in deployment but we are unable to reproduce it locally. Can also confirm that v1.59.0 does not cause the panic but v1.60.1 does. I'll lyk if we're able to get a mini repro.

@zasweq
Copy link
Contributor

zasweq commented Jan 23, 2024

Hmm ok @rpadaki. Is this the same repository, or are you on a different project that is also running a stats handler that sees this same issue?

@rpadaki
Copy link

rpadaki commented Jan 23, 2024

We are a different project with the same issue using the stats handler behind a load balancer. Still no success with a reproduction unfortunately

@zasweq
Copy link
Contributor

zasweq commented Jan 23, 2024

Is it a problem just in the status stats callout or all of the stats callouts server side?

@mx-psi
Copy link
Author

mx-psi commented Jan 24, 2024

For increased visibility: we have some grpcdump logs comparison here with and without the issue on the OpenTelemetry Collector open-telemetry/opentelemetry-collector#9296 (comment)

@zasweq
Copy link
Contributor

zasweq commented Jan 24, 2024

Ok thanks for that. Looking through the logs, it looks like there is a health checking issue. How does this link with the configured stats handler that is having a problem pulling something out of the context on a WriteStatus call that it expects to be populated from the Tag call?

@zasweq
Copy link
Contributor

zasweq commented Jan 24, 2024

@rpadaki are you using the same OTel repro as the issue that came up? Also, we just discussed this at a team and in order to narrow it down to our repository (there are a lot of layers here) we would need a reproduction, as my test locally cannot reproduce this problem.

Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Jan 30, 2024
@mx-psi
Copy link
Author

mx-psi commented Jan 30, 2024

For clarity: I doubt I will have time to do any further research about this in the next 7 days

@github-actions github-actions bot removed the stale label Jan 30, 2024
Copy link

github-actions bot commented Feb 5, 2024

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Feb 5, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 12, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants