Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: add store.read-timeout parameter to avoid partial response failure when one of stores timed out #895

Closed
wants to merge 19 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ This to have SRV resolution working on [Golang 1.11+ with KubeDNS below v1.14](h
- [#986](https://github.com/improbable-eng/thanos/pull/986) Store index cache files in object storage, reduces store start-up time by skipping the generating the index cache for all blocks and only do this for recently created uncompacted blocks.

### Changed
- [#895](https://github.com/improbable-eng/thanos/pull/895) Changed Thanos Query `--query.timeout` to actually time out requests on slow Thanos Store API. The old timeout is renamed to `--promql.timeout` and sets maximum time to execute PromQL in query node..
- [#970](https://github.com/improbable-eng/thanos/pull/970) Deprecated partial_response_disabled proto field. Added partial_response_strategy instead. Both in gRPC and Query API.
- [#970](https://github.com/improbable-eng/thanos/pull/970) No `PartialResponseStrategy` field for `RuleGroups` by default means `abort` strategy (old PartialResponse disabled) as this is recommended option for Rules and alerts.

Expand Down
1 change: 1 addition & 0 deletions benchmark/cmd/thanosbench/resources.go
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,7 @@ func createPrometheus(opts *opts, name string, bucket string) *appsv1.StatefulSe
"--storage.tsdb.max-block-duration=2h",
"--storage.tsdb.retention=2h",
"--query.timeout=10m",
"--promql.timeout=10m",
},
VolumeMounts: []v1.VolumeMount{
{
Expand Down
19 changes: 12 additions & 7 deletions cmd/thanos/query.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,9 +61,6 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application, name string
webExternalPrefix := cmd.Flag("web.external-prefix", "Static prefix for all HTML links and redirect URLs in the UI query web interface. Actual endpoints are still served on / or the web.route-prefix. This allows thanos UI to be served behind a reverse proxy that strips a URL sub-path.").Default("").String()
webPrefixHeaderName := cmd.Flag("web.prefix-header", "Name of HTTP request header used for dynamic prefixing of UI links and redirects. This option is ignored if web.external-prefix argument is set. Security risk: enable this option only if a reverse proxy in front of thanos is resetting the header. The --web.prefix-header=X-Forwarded-Prefix option can be useful, for example, if Thanos UI is served via Traefik reverse proxy with PathPrefixStrip option enabled, which sends the stripped prefix value in X-Forwarded-Prefix header. This allows thanos UI to be served on a sub-path.").Default("").String()

queryTimeout := modelDuration(cmd.Flag("query.timeout", "Maximum time to process query by query node.").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave this as it was, reasons:

  1. query sounds like being related to QueryAPI, that's why this name. Which always included promql one. That's why I would leave like it was, the fact that query also queries something else underneath is hidden. Also it matches Prometheus query flags as well. And yet another thing compatibility. Even though we are not 1.0 this API change is major hit into compatibiltiy as suddently query timeout means something opposite to what it was ):

  2. promql sounds like only promQL, however we know it involves proxy StoreAPI Series() invocation. More over not only one, but sometimes more then one! PromQL can run multiple Selects per single Query.

I really like this idea, but IMO it should be:
query.timeout (as it was)
store.timeout for proxy.go client timeout.

alternatively if we want to be more explicit maybe query.select-timeout? That will also include then the deduplication process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

Default("2m"))

maxConcurrentQueries := cmd.Flag("query.max-concurrent", "Maximum number of queries processed concurrently by query node.").
Default("20").Int()

Expand Down Expand Up @@ -97,6 +94,12 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application, name string
enablePartialResponse := cmd.Flag("query.partial-response", "Enable partial response for queries if no partial_response param is specified.").
Default("true").Bool()

promqlTimeout := modelDuration(cmd.Flag("promql.timeout", "Maximum time to execute PromQL in query node.").
Default("2m"))

queryTimeout := modelDuration(cmd.Flag("query.timeout", "Maximum time to process request by query node. If a request to one of the stores has timed out and query.timeout < promql.timeout then a partial response will be returned. If query.timeout >= promql.timeout then only timeout error will be returned.").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite confusing, I am having hard time to understand on what things we timeout here

Default("2m"))

defaultEvaluationInterval := modelDuration(cmd.Flag("query.default-evaluation-interval", "Set default evaluation interval for sub queries.").Default("1m"))

storeResponseTimeout := modelDuration(cmd.Flag("store.response-timeout", "If a Store doesn't send any data in this specified duration then a Store will be ignored and partial data will be returned if it's enabled. 0 disables timeout.").Default("0ms"))
Expand Down Expand Up @@ -150,8 +153,9 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application, name string
*webExternalPrefix,
*webPrefixHeaderName,
*maxConcurrentQueries,
time.Duration(*queryTimeout),
time.Duration(*promqlTimeout),
time.Duration(*storeResponseTimeout),
time.Duration(*queryTimeout),
*replicaLabel,
peer,
selectorLset,
Expand Down Expand Up @@ -268,8 +272,9 @@ func runQuery(
webExternalPrefix string,
webPrefixHeaderName string,
maxConcurrentQueries int,
queryTimeout time.Duration,
promqlTimeout time.Duration,
storeResponseTimeout time.Duration,
queryTimeout time.Duration,
replicaLabel string,
peer cluster.Peer,
selectorLset labels.Labels,
Expand Down Expand Up @@ -327,7 +332,7 @@ func runQuery(
dialOpts,
unhealthyStoreTimeout,
)
proxy = store.NewProxyStore(logger, stores.Get, component.Query, selectorLset, storeResponseTimeout)
proxy = store.NewProxyStore(logger, stores.Get, component.Query, selectorLset, storeResponseTimeout, queryTimeout)
queryableCreator = query.NewQueryableCreator(logger, proxy, replicaLabel)
engine = promql.NewEngine(
promql.EngineOpts{
Expand All @@ -336,7 +341,7 @@ func runQuery(
MaxConcurrent: maxConcurrentQueries,
// TODO(bwplotka): Expose this as a flag: https://github.com/improbable-eng/thanos/issues/703
MaxSamples: math.MaxInt32,
Timeout: queryTimeout,
Timeout: promqlTimeout,
},
)
)
Expand Down
8 changes: 7 additions & 1 deletion docs/components/query.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,6 @@ Flags:
stripped prefix value in X-Forwarded-Prefix
header. This allows thanos UI to be served on a
sub-path.
--query.timeout=2m Maximum time to process query by query node.
--query.max-concurrent=20 Maximum number of queries processed
concurrently by query node.
--query.replica-label=QUERY.REPLICA-LABEL
Expand Down Expand Up @@ -301,6 +300,13 @@ Flags:
if no max_source_resolution param is specified.
--query.partial-response Enable partial response for queries if no
partial_response param is specified.
--promql.timeout=2m Maximum time to execute PromQL in query node.
--query.timeout=2m Maximum time to process request by query node.
If a request to one of the stores has timed out
and query.timeout < promql.timeout then a
partial response will be returned. If
query.timeout >= promql.timeout then only
timeout error will be returned.
--query.default-evaluation-interval=1m
Set default evaluation interval for sub
queries.
Expand Down
1 change: 0 additions & 1 deletion pkg/query/querier_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ import (
"math"
"math/rand"
"testing"

"time"

"github.com/fortytw2/leaktest"
Expand Down
60 changes: 47 additions & 13 deletions pkg/store/proxy.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ type Client interface {
TimeRange() (mint int64, maxt int64)

String() string

// Addr returns address of a Client.
Addr() string
}
Expand All @@ -44,7 +45,11 @@ type ProxyStore struct {
component component.StoreAPI
selectorLabels labels.Labels

// responseTimeout is a timeout for any GRPC operation during series query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing trailing period in comment, same below

responseTimeout time.Duration

// queryTimeout is a timeout for entire request
queryTimeout time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's kill it IMO

}

// NewProxyStore returns a new ProxyStore that uses the given clients that implements storeAPI to fan-in all series to the client.
Expand All @@ -55,6 +60,7 @@ func NewProxyStore(
component component.StoreAPI,
selectorLabels labels.Labels,
responseTimeout time.Duration,
queryTimeout time.Duration,
) *ProxyStore {
if logger == nil {
logger = log.NewNopLogger()
Expand All @@ -66,6 +72,7 @@ func NewProxyStore(
component: component,
selectorLabels: selectorLabels,
responseTimeout: responseTimeout,
queryTimeout: queryTimeout,
}
return s
}
Expand Down Expand Up @@ -119,6 +126,8 @@ func newRespCh(ctx context.Context, buffer int) (*ctxRespSender, <-chan *storepb
return &ctxRespSender{ctx: ctx, ch: respCh}, respCh, func() { close(respCh) }
}

// send writes response to sender channel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use full sentence comments. They are for humans (:

// or just returns if sender context was timed out
func (s ctxRespSender) send(r *storepb.SeriesResponse) {
select {
case <-s.ctx.Done():
Expand Down Expand Up @@ -147,6 +156,9 @@ func (s *ProxyStore) Series(r *storepb.SeriesRequest, srv storepb.Store_SeriesSe
respSender, respRecv, closeFn = newRespCh(gctx, 10)
)

storeCtx, cancel := s.contextWithTimeout(gctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what's the difference in this timeout to the overall gRPC request timeout or gRPC server timeout? I believe Querier can control the same thing by just specifying timeout here: https://github.com/improbable-eng/thanos/blob/1cd9ddd14999d6b074f34a4328e03f7ac3b7c26a/pkg/query/querier.go#L183

I would remove this from proxy.go completely and set this timeout on querier client side. What do you think?

Effect is the same, if not better, as you missed to pass this context to gctx = errgroup.WithContext(srv.Context())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can I ask what's the point of this logic, what's the use case? E.g WHEN would you set (old) query timeout different to proxy/select/store.timeout underneath PromQL?
Cannot find any immdiate idea. Do you want to timeout on select, but allow more time for promQL?

Imagine that we have two stores: first is very slow, second is fast. Partial time-out is enabled.
Behaviour that we want to see: if slow store is timed out - we still receiving data from second.
So, to make it possible, store.read-timeout should be less than PromQL timeout.
You're right, we can move it to querier.go and the effect will be the same at the moment (and it was my first implementation :) )

I think it doesn't matter where we have this timeout, more important that at the moment it doesn't work perfect because of reading from stores in mergedSet: reading from stores works in series there and first slow store still blocks reading from second. We need some improvements there...

defer cancel()

g.Go(func() error {
var (
seriesSet []storepb.SeriesSet
Expand Down Expand Up @@ -177,8 +189,8 @@ func (s *ProxyStore) Series(r *storepb.SeriesRequest, srv storepb.Store_SeriesSe
}
storeDebugMsgs = append(storeDebugMsgs, fmt.Sprintf("store %s queried", st))

// This is used to cancel this stream when one operations takes too long.
seriesCtx, closeSeries := context.WithCancel(gctx)
// This is used to cancel this stream when one operation takes too long.
seriesCtx, closeSeries := context.WithCancel(storeCtx)
defer closeSeries()

sc, err := st.Series(seriesCtx, r)
Expand Down Expand Up @@ -295,19 +307,12 @@ func startStreamSeriesSet(
}

if ctx.Err() != nil {
s.writeWarningOrErrorResponse(partialResponse, errors.Wrapf(ctx.Err(), "receive series from %s", s.name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice 👍

return
}

if err != nil {
wrapErr := errors.Wrapf(err, "receive series from %s", s.name)
if partialResponse {
s.warnCh.send(storepb.NewWarnSeriesResponse(wrapErr))
return
}

s.errMtx.Lock()
s.err = wrapErr
s.errMtx.Unlock()
s.writeWarningOrErrorResponse(partialResponse, errors.Wrapf(err, "receive series from %s", s.name))
return
}

Expand Down Expand Up @@ -356,6 +361,24 @@ func (s *streamSeriesSet) Next() (ok bool) {
}
}

// writeWarningOrErrorResponse sends warning if partial response enabled or sets error otherwise
func (s *streamSeriesSet) writeWarningOrErrorResponse(partialResponse bool, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this return nothing since we aren't using the return values anywhere and we aren't even setting them 😄

if partialResponse {
s.warnCh.send(storepb.NewWarnSeriesResponse(err))
return
}

s.setError(err)
return
}

// setError sets error (thread-safe)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full sentence comments

func (s *streamSeriesSet) setError(err error) {
s.errMtx.Lock()
s.err = err
s.errMtx.Unlock()
}

func (s *streamSeriesSet) At() ([]storepb.Label, []storepb.AggrChunk) {
if s.currSeries == nil {
return nil, nil
Expand Down Expand Up @@ -411,11 +434,14 @@ func (s *ProxyStore) LabelValues(ctx context.Context, r *storepb.LabelValuesRequ
g, gctx = errgroup.WithContext(ctx)
)

storeCtx, cancel := s.contextWithTimeout(gctx)
defer cancel()

for _, st := range s.stores() {
store := st
g.Go(func() error {
resp, err := store.LabelValues(gctx, &storepb.LabelValuesRequest{
Label: r.Label,
resp, err := store.LabelValues(storeCtx, &storepb.LabelValuesRequest{
Label: r.Label,
PartialResponseDisabled: r.PartialResponseDisabled,
})
if err != nil {
Expand Down Expand Up @@ -448,3 +474,11 @@ func (s *ProxyStore) LabelValues(ctx context.Context, r *storepb.LabelValuesRequ
Warnings: warnings,
}, nil
}

func (s *ProxyStore) contextWithTimeout(ctx context.Context) (context.Context, context.CancelFunc) {
if s.queryTimeout > 0 {
return context.WithTimeout(ctx, s.queryTimeout)
}

return ctx, func() {}
}
Loading