Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries don't timeout #2314

Closed
roastiek opened this issue Mar 24, 2020 · 10 comments · Fixed by #2411
Closed

Queries don't timeout #2314

roastiek opened this issue Mar 24, 2020 · 10 comments · Fixed by #2411

Comments

@roastiek
Copy link
Contributor

Thanos, Prometheus and Golang version used:

thanos 0.11.0, go1.13.7
Object Storage Provider:
s3 / ceph

What happened:
Queries don't timeout after specified --query.timeout=180s and keeps running. The metric prometheus_engine_queries goes up and and never down until the query instance is restarted. Queries come from rule and can run for hours instead of seconds.

It happens on all instances of query at the same time so its propably related to temporary problems with some stores.

Anything else we need to know:
I have rolled back query to 0.10.1 and keeped rest of compontest on 0.11.0 and this is working for me for now.

Looking git log I have picked and tested these two commits.
2e3ece1 works
c39ddb2 broken

I suspect this commit is the cause a354bfb

@GiedriusS
Copy link
Member

Hi! Sorry for the problems. Can't say anything about this specific problem but I wouldn't be surprised if this is related to #2213. Thanks for dissecting this to almost a specific commit. Could you try reverting that commit on c39ddb2 and see if it helps?

@roastiek
Copy link
Contributor Author

Yes, I will try that tomorrow.

@roastiek
Copy link
Contributor Author

So I was able only run a354bfb and the problem was present.

I am running v0.11.0 with both a354bfb and c39ddb2 reverted on query now and it is prefectly stable.

@roastiek
Copy link
Contributor Author

roastiek commented Apr 3, 2020

I haved missed that number of goroutines goes up too as they got stuck. So I have dumped them with /debug/pprof/ goroutine.txt

@GiedriusS
Copy link
Member

I haved missed that number of goroutines goes up too as they got stuck. So I have dumped them with /debug/pprof/ goroutine.txt

I have spent some time on this and it seems like this is some tricky edge-case. All of our tests on slow StoreAPI nodes also pass on the latest master. I have made some adjustments to them in my own branch but couldn't catch anything.

Does it help if you apply the following patch:

diff --git a/pkg/store/proxy.go b/pkg/store/proxy.go
index 974a2521..bb77bd09 100644
--- a/pkg/store/proxy.go
+++ b/pkg/store/proxy.go
@@ -443,7 +443,7 @@ func startStreamSeriesSet(
 }
 
 func (s *streamSeriesSet) handleErr(err error, done chan struct{}) {
-       defer close(done)
+       close(done)
        s.closeSeries()
 
        if s.partialResponse {

?

@roastiek
Copy link
Contributor Author

roastiek commented Apr 6, 2020

I think the culprit is this

s.recvCh <- rr.r.GetSeries()

It does not handle context cancel like in the original version

thanos/pkg/store/proxy.go

Lines 413 to 418 in f034581

select {
case s.recvCh <- r.GetSeries():
continue
case <-ctx.Done():
return
}

What is happening is that many proxy are waiting there with responses ready to be merged but some of them are errors. Then the request fails, merge is aborted and request context is canceled with rest of the proxies stucked.

@roastiek
Copy link
Contributor Author

roastiek commented Apr 8, 2020

I'm using v0.11.0 with this patch and did not encounter any problems in the last two days.

diff --git a/pkg/store/proxy.go b/pkg/store/proxy.go
index 974a2521..8700e1be 100644
--- a/pkg/store/proxy.go
+++ b/pkg/store/proxy.go
@@ -436,7 +436,14 @@ func startStreamSeriesSet(
 				s.warnCh.send(storepb.NewWarnSeriesResponse(errors.New(w)))
 				continue
 			}
-			s.recvCh <- rr.r.GetSeries()
+
+			select {
+			case s.recvCh <- rr.r.GetSeries():
+				continue
+			case <-ctx.Done():
+				s.handleErr(errors.Wrapf(ctx.Err(), "failed to receive any data from %s", s.name), done)
+				return
+			}
 		}
 	}()
 	return s

@GiedriusS
Copy link
Member

GiedriusS commented Apr 9, 2020

I'm using v0.11.0 with this patch and did not encounter any problems in the last two days.

diff --git a/pkg/store/proxy.go b/pkg/store/proxy.go
index 974a2521..8700e1be 100644
--- a/pkg/store/proxy.go
+++ b/pkg/store/proxy.go
@@ -436,7 +436,14 @@ func startStreamSeriesSet(
 				s.warnCh.send(storepb.NewWarnSeriesResponse(errors.New(w)))
 				continue
 			}
-			s.recvCh <- rr.r.GetSeries()
+
+			select {
+			case s.recvCh <- rr.r.GetSeries():
+				continue
+			case <-ctx.Done():
+				s.handleErr(errors.Wrapf(ctx.Err(), "failed to receive any data from %s", s.name), done)
+				return
+			}
 		}
 	}()
 	return s

Thank you for your patch! I believe that the problem is a bit more nuanced.

My stupid way of reproducing is by doing this change:

diff --git a/pkg/store/proxy.go b/pkg/store/proxy.go
index 8440de2f..e801d98b 100644
--- a/pkg/store/proxy.go
+++ b/pkg/store/proxy.go
@@ -430,11 +430,13 @@ func startStreamSeriesSet(
                        }
                        numResponses++
 
-                       if w := rr.r.GetWarning(); w != "" {
-                               s.warnCh.send(storepb.NewWarnSeriesResponse(errors.New(w)))
-                               continue
+                       for i := 0; i < 15; i++ {
+                               if w := rr.r.GetWarning(); w != "" {
+                                       s.warnCh.send(storepb.NewWarnSeriesResponse(errors.New(w)))
+                                       continue
+                               }
+                               s.recvCh <- rr.r.GetSeries()
                        }
-                       s.recvCh <- rr.r.GetSeries()
                }
        }()
        return s

At the very least the backtraces are the same but that's expected since we shouldn't send more things if it has errored out due to the partial response being disabled.

(deleted a lot stuff because I'm still not sure)

@GiedriusS
Copy link
Member

Finally came up with a normal test-case: #2411 😸

@roastiek
Copy link
Contributor Author

Nice. I was trying to write up a test case. But somehow I got the fix first.

I will try 0.12.0 day or two after release to confrim that the problem is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants