Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: Panic when 2 thanos-query are connected? #4743

Closed
ahurtaud opened this issue Oct 6, 2021 · 7 comments · Fixed by #4754
Closed

query: Panic when 2 thanos-query are connected? #4743

ahurtaud opened this issue Oct 6, 2021 · 7 comments · Fixed by #4754

Comments

@ahurtaud
Copy link
Contributor

ahurtaud commented Oct 6, 2021

What happened:
Thanos v0.23.1:
We have one thanos query, connected to a list of other thanos-query as --store in gRPC secured.

Screen Shot 2021-10-06 at 15 03 02
Screen Shot 2021-10-06 at 15 03 09

What you expected to happen:
as of before v0.23.1 (0.22.0), the thanos store page to list the available thanos query.
However the querying is working fine, it can query data from the registered stores (query).

How to reproduce it (as minimally and precisely as possible):
I think:
Having one thanos-query registering another thanos-query with - --store=x.x.x.x:10901" flag.
open the thanos stores page to list the registered stores component.

Full logs to relevant components:
Please open the following Panic logs:

Panic Logs

2021/10/06 13:00:04 http: panic serving 10.225.5.184:38686: runtime error: invalid memory address or nil pointer dereference
goroutine 728472 [running]:
net/http.(*conn).serve.func1(0xc00101c000)
	/usr/local/go/src/net/http/server.go:1804 +0x153
panic(0x1b54060, 0x2f4a640)
	/usr/local/go/src/runtime/panic.go:971 +0x499
github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).stores(0xc0007f20d0, 0xc0019aa100, 0x0, 0x0, 0x0, 0x6000106, 0x0, 0xffffffffffffffff)
	/home/circleci/project/pkg/api/query/v1.go:717 +0xcb
github.com/thanos-io/thanos/pkg/api.GetInstr.func1.1(0x2148fe0, 0xc00ef50690, 0xc0019aa100)
	/home/circleci/project/pkg/api/api.go:211 +0x42
net/http.HandlerFunc.ServeHTTP(0xc0002811b8, 0x2148fe0, 0xc00ef50690, 0xc0019aa100)
	/usr/local/go/src/net/http/server.go:2049 +0x44
github.com/thanos-io/thanos/pkg/server/http/middleware.RequestID.func1(0x2148fe0, 0xc00ef50690, 0xc0019aa000)
	/home/circleci/project/pkg/server/http/middleware/request_id.go:40 +0x20c
net/http.HandlerFunc.ServeHTTP(0xc0002811d0, 0x2148fe0, 0xc00ef50690, 0xc0019aa000)
	/usr/local/go/src/net/http/server.go:2049 +0x44
github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1(0x214c400, 0xc012766b40, 0xc0019aa000)
	/home/circleci/go/pkg/mod/github.com/!n!y!times/[email protected]/gzip.go:338 +0x299
net/http.HandlerFunc.ServeHTTP(0xc000450780, 0x214c400, 0xc012766b40, 0xc0019aa000)
	/usr/local/go/src/net/http/server.go:2049 +0x44
github.com/thanos-io/thanos/pkg/logging.(*HTTPServerMiddleware).HTTPMiddleware.func1(0x214c400, 0xc012766b40, 0xc0019aa000)
	/home/circleci/project/pkg/logging/http.go:68 +0x399
net/http.HandlerFunc.ServeHTTP(0xc0004507b0, 0x214c400, 0xc012766b40, 0xc0019aa000)
	/usr/local/go/src/net/http/server.go:2049 +0x44
github.com/thanos-io/thanos/pkg/extprom/http.(*defaultInstrumentationMiddleware).NewHandler.func1(0x7f4137adf938, 0xc008e7e280, 0xc0019aa000)
	/home/circleci/project/pkg/extprom/http/instrument_server.go:108 +0x10f
net/http.HandlerFunc.ServeHTTP(0xc000450870, 0x7f4137adf938, 0xc008e7e280, 0xc0019aa000)
	/usr/local/go/src/net/http/server.go:2049 +0x44
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1(0x7f4137adf938, 0xc008e7e230, 0xc0019aa000)
	/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:198 +0xee
net/http.HandlerFunc.ServeHTTP(0xc000450b70, 0x7f4137adf938, 0xc008e7e230, 0xc0019aa000)
	/usr/local/go/src/net/http/server.go:2049 +0x44
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1(0x7f4137adf938, 0xc008e7e1e0, 0xc0019aa000)
	/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:101 +0xdf
net/http.HandlerFunc.ServeHTTP(0xc000450de0, 0x7f4137adf938, 0xc008e7e1e0, 0xc0019aa000)
	/usr/local/go/src/net/http/server.go:2049 +0x44
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerRequestSize.func1(0x214e860, 0xc0019a8000, 0xc0019aa000)
	/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:165 +0xee
net/http.HandlerFunc.ServeHTTP(0xc0004510e0, 0x214e860, 0xc0019a8000, 0xc0019aa000)
	/usr/local/go/src/net/http/server.go:2049 +0x44
github.com/thanos-io/thanos/pkg/tracing.HTTPMiddleware.func1(0x214e860, 0xc0019a8000, 0xc01057ff00)
	/home/circleci/project/pkg/tracing/http.go:46 +0x54c
github.com/prometheus/common/route.(*Router).handle.func1(0x214e860, 0xc0019a8000, 0xc010970500, 0x0, 0x0, 0x0)
	/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/route/route.go:83 +0x27f
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc000a6a060, 0x214e860, 0xc0019a8000, 0xc010970500)
	/home/circleci/go/pkg/mod/github.com/julienschmidt/[email protected]/router.go:387 +0xc7e
github.com/prometheus/common/route.(*Router).ServeHTTP(0xc000a78060, 0x214e860, 0xc0019a8000, 0xc010970500)
	/home/circleci/go/pkg/mod/github.com/prometheus/[email protected]/route/route.go:121 +0x4c
net/http.(*ServeMux).ServeHTTP(0xc000936840, 0x214e860, 0xc0019a8000, 0xc010970500)
	/usr/local/go/src/net/http/server.go:2428 +0x1ad
net/http.serverHandler.ServeHTTP(0xc000898000, 0x214e860, 0xc0019a8000, 0xc010970500)
	/usr/local/go/src/net/http/server.go:2867 +0xa3
net/http.(*conn).serve(0xc00101c000, 0x2155c70, 0xc0025a0240)
	/usr/local/go/src/net/http/server.go:1932 +0x8cd
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2993 +0x39b

Anything else we need to know:

I suspect the newly-removed info gRPC endpoint about metadata I quickly followed on slack :/

@yeya24
Copy link
Contributor

yeya24 commented Oct 6, 2021

I think:
Having one thanos-query registering another thanos-query with - --store=x.x.x.x:10901" flag.
open the thanos stores page to list the registered stores component.

I tried this setup but cannot reproduce this issue.

@ahurtaud
Copy link
Contributor Author

ahurtaud commented Oct 7, 2021

ok thanks I will dig further to try to reproduce with minimal steps.

@matej-g
Copy link
Collaborator

matej-g commented Oct 7, 2021

It definitely seems like we're trying to access something that's not there at https://github.com/thanos-io/thanos/blob/release-0.23/pkg/api/query/v1.go#L717.

My guess is that ComponentType is not always available judging from updateEndpointStatus method.

@hitanshu-mehta any idea if this make sense?

@ahurtaud
Copy link
Contributor Author

ahurtaud commented Oct 7, 2021

Hum so I found a misconfig, where thanos was trying to target an IP:port where no thanos-query was running.
However I dont know what was behind the IP and if it answered something or not.
Maybe it is fine to close it as config error.
Dont know if you want to dig further.

@matej-g
Copy link
Collaborator

matej-g commented Oct 7, 2021

@ahurtaud the scenario you mention would actually make sense to me - if you specify a host:port address where there is nothing running (or something not expected), I would guess the status of the endpoint would only include error, hence the panic when trying to obtain component type.

Whether it is an accidental misconfiguration or not, we should not be panicking, this is a valid bug.

@matej-g
Copy link
Collaborator

matej-g commented Oct 8, 2021

So I was able to reproduce it fairly reliably - I adjusted the E2E test for query and threw in a couple of 'made-up' addresses for store which were always returning errors. If I tried to do call to the /stores endpoint, I was getting panic. So it's more of an edge case it seems, but still it's a small fix. Have a look at #4754.

@MrYueQ
Copy link

MrYueQ commented Oct 11, 2021

image

{"address":"10.0.0.248:10901","caller":"endpointset.go:525","component":"endpointset","err":"getting metadata: fallback fetching info from 10.0.0.248:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded","level":"warn","msg":"update of node failed","ts":"2021-10-11T12:44:35.206298133Z"}
{"address":"10.0.0.212:10901","caller":"endpointset.go:525","component":"endpointset","err":"getting metadata: fallback fetching info from 10.0.0.212:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded","level":"warn","msg":"update of node failed","ts":"2021-10-11T12:44:35.206377728Z"}
{"address":"10.0.0.248:10901","caller":"endpointset.go:525","component":"endpointset","err":"getting metadata: fallback fetching info from 10.0.0.248:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded","level":"warn","msg":"update of node failed","ts":"2021-10-11T12:44:40.206936877Z"}
{"address":"10.0.0.212:10901","caller":"endpointset.go:525","component":"endpointset","err":"getting metadata: fallback fetching info from 10.0.0.212:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded","level":"warn","msg":"update of node failed","ts":"2021-10-11T12:44:40.207256941Z"}
{"address":"10.0.0.212:10901","caller":"endpointset.go:525","component":"endpointset","err":"getting metadata: fallback fetching info from 10.0.0.212:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded","level":"warn","msg":"update of node failed","ts":"2021-10-11T12:44:45.209680178Z"}
{"address":"10.0.0.248:10901","caller":"endpointset.go:525","component":"endpointset","err":"getting metadata: fallback fetching info from 10.0.0.248:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded","level":"warn","msg":"update of node failed","ts":"2021-10-11T12:44:45.209708408Z"}
{"address":"10.0.0.248:10901","caller":"endpointset.go:525","component":"endpointset","err":"getting metadata: fallback fetching info from 10.0.0.248:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded","level":"warn","msg":"update of node failed","ts":"2021-10-11T12:44:50.211450445Z"}
{"address":"10.0.0.212:10901","caller":"endpointset.go:525","component":"endpointset","err":"getting metadata: fallback fetching info from 10.0.0.212:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded","level":"warn","msg":"update of node failed","ts":"2021-10-11T12:44:50.211469304Z"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants