-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queries don't timeout #2314
Comments
Yes, I will try that tomorrow. |
I haved missed that number of goroutines goes up too as they got stuck. So I have dumped them with |
I have spent some time on this and it seems like this is some tricky edge-case. All of our tests on slow StoreAPI nodes also pass on the latest master. I have made some adjustments to them in my own branch but couldn't catch anything. Does it help if you apply the following patch: diff --git a/pkg/store/proxy.go b/pkg/store/proxy.go
index 974a2521..bb77bd09 100644
--- a/pkg/store/proxy.go
+++ b/pkg/store/proxy.go
@@ -443,7 +443,7 @@ func startStreamSeriesSet(
}
func (s *streamSeriesSet) handleErr(err error, done chan struct{}) {
- defer close(done)
+ close(done)
s.closeSeries()
if s.partialResponse { ? |
I think the culprit is this Line 437 in cc6c5b5
It does not handle context cancel like in the original version Lines 413 to 418 in f034581
What is happening is that many proxy are waiting there with responses ready to be merged but some of them are errors. Then the request fails, merge is aborted and request context is canceled with rest of the proxies stucked. |
I'm using v0.11.0 with this patch and did not encounter any problems in the last two days. diff --git a/pkg/store/proxy.go b/pkg/store/proxy.go
index 974a2521..8700e1be 100644
--- a/pkg/store/proxy.go
+++ b/pkg/store/proxy.go
@@ -436,7 +436,14 @@ func startStreamSeriesSet(
s.warnCh.send(storepb.NewWarnSeriesResponse(errors.New(w)))
continue
}
- s.recvCh <- rr.r.GetSeries()
+
+ select {
+ case s.recvCh <- rr.r.GetSeries():
+ continue
+ case <-ctx.Done():
+ s.handleErr(errors.Wrapf(ctx.Err(), "failed to receive any data from %s", s.name), done)
+ return
+ }
}
}()
return s |
Thank you for your patch! I believe that the problem is a bit more nuanced. My stupid way of reproducing is by doing this change: diff --git a/pkg/store/proxy.go b/pkg/store/proxy.go
index 8440de2f..e801d98b 100644
--- a/pkg/store/proxy.go
+++ b/pkg/store/proxy.go
@@ -430,11 +430,13 @@ func startStreamSeriesSet(
}
numResponses++
- if w := rr.r.GetWarning(); w != "" {
- s.warnCh.send(storepb.NewWarnSeriesResponse(errors.New(w)))
- continue
+ for i := 0; i < 15; i++ {
+ if w := rr.r.GetWarning(); w != "" {
+ s.warnCh.send(storepb.NewWarnSeriesResponse(errors.New(w)))
+ continue
+ }
+ s.recvCh <- rr.r.GetSeries()
}
- s.recvCh <- rr.r.GetSeries()
}
}()
return s At the very least the backtraces are the same but that's expected since we shouldn't send more things if it has errored out due to the partial response being disabled. (deleted a lot stuff because I'm still not sure) |
Finally came up with a normal test-case: #2411 😸 |
Nice. I was trying to write up a test case. But somehow I got the fix first. I will try 0.12.0 day or two after release to confrim that the problem is gone. |
Thanos, Prometheus and Golang version used:
thanos 0.11.0, go1.13.7
Object Storage Provider:
s3 / ceph
What happened:
Queries don't timeout after specified
--query.timeout=180s
and keeps running. The metricprometheus_engine_queries
goes up and and never down until thequery
instance is restarted. Queries come fromrule
and can run for hours instead of seconds.It happens on all instances of
query
at the same time so its propably related to temporary problems with some stores.Anything else we need to know:
I have rolled back
query
to 0.10.1 and keeped rest of compontest on 0.11.0 and this is working for me for now.Looking
git log
I have picked and tested these two commits.2e3ece1 works
c39ddb2 broken
I suspect this commit is the cause a354bfb
The text was updated successfully, but these errors were encountered: