Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: add store.read-timeout parameter to avoid partial response failure when one of stores timed out #895

Closed
wants to merge 19 commits into from
Closed

Conversation

d-ulyanov
Copy link
Contributor

This PR fixes partial response failure when one of stores timed out (if partial response enabled).
Issue: #835

Changes

Added store.read-timeout argument to Thanos Query. By default, store.read-timeout = 2m (as query.timeout). If store.read-timeout > query.timeout it will be overrided by query.timeout value.
This store.read-timeout propagated to GRPC StoreClient via context using seriesServer, see querier.go diff.
Timeout will be finally handled inside GRPC client.

Also, mock structures added to store package to reuse it in unit tests in different packages.

Verification

Changes covered by unit tests.

@d-ulyanov
Copy link
Contributor Author

Hi, Thanos team!
Review pls :)

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I love the idea, but I think we need this timeout in different place. (:

Plus some minor suggestion while mocking. Let me know what do you think.

cmd/thanos/query.go Outdated Show resolved Hide resolved
cmd/thanos/query.go Outdated Show resolved Hide resolved
pkg/query/querier.go Outdated Show resolved Hide resolved
pkg/store/storepb/mock.go Outdated Show resolved Hide resolved
pkg/store/mock.go Outdated Show resolved Hide resolved
@d-ulyanov
Copy link
Contributor Author

Currently we still have one strange place - mergeSet.
mergeSet.Next() can stuck if one of streams will return data slowly.
Essentially it means that whole request can fail with timeout in case when one of streams (stores) returning data slowly but all other streams are working well (see test case where partialResponse: true and queryTimeout < storeReadTimeout)

But I think its idea for next pull request.

@bwplotka
Copy link
Member

@povilasv continued on your PR @d-ulyanov here #920 I hope you are ok with this (:

@povilasv
Copy link
Member

povilasv commented Mar 15, 2019

Currently we still have one strange place - mergeSet.
mergeSet.Next() can stuck if one of streams will return data slowly.
Essentially it means that whole request can fail with timeout in case when one of streams (stores) returning data slowly but all other streams are working well (see test case where partialResponse: true and queryTimeout < storeReadTimeout)

But I think its idea for next pull request.

@d-ulyanov this is solved by #928

@GiedriusS
Copy link
Member

As another version of this has been merged, I think it's safe to close this PR. Please shout at me if it needs to be reopened or I made a mistake somewhere. Thank you a lot for your contribution! :)

@GiedriusS GiedriusS closed this Mar 29, 2019
@povilasv
Copy link
Member

povilasv commented Mar 29, 2019

@GiedriusS it's a different PR, so we still need this. I.e. this timeout is essiantially query-timeout, and cancels the context for the whole query. The current timeout doesn't work correctly, due to it sets timeout on PromQL engine

@GiedriusS GiedriusS reopened this Mar 29, 2019
@d-ulyanov
Copy link
Contributor Author

Hi, @povilasv
Is it possible to include this PR to nearest release?
Also, if you need any help, please, let me know :)

@povilasv
Copy link
Member

povilasv commented Mar 30, 2019

@d-ulyanov sure thing, could you resolve merge conflicts? I've lost the current state on this one, is this ready for review? Once it is I can reread it and do another round of review :)

@d-ulyanov
Copy link
Contributor Author

Hey @povilasv
All conflicts are resolved and I've added some tests also.

Copy link
Contributor Author

@d-ulyanov d-ulyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@povilasv Its cool that this PR will appear in v0.4.0
I have a time to finish this PR, but anyway, fill free to make any changes if you think that smth is wrong :) Edits from maintainers are allowed.
Also, please, review my comments about testing

@@ -93,6 +93,9 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application, name string

storeResponseTimeout := modelDuration(cmd.Flag("store.response-timeout", "If a Store doesn't send any data in this specified duration then a Store will be ignored and partial data will be returned if it's enabled. 0 disables timeout.").Default("0ms"))

storeReadTimeout := modelDuration(cmd.Flag("store.read-timeout", "Maximum time to read response from store. If request to one of stores is timed out and store.read-timeout < query.timeout partial response will be returned. If store.read-timeout >= query.timeout one of stores is timed out the client will get no data and timeout error.").
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree partially.
In code we have query layer which also doing some logic (deduplication). So we can't really say that storeReadTimeout will guarantee that request will be killed in time. We just limiting store request time. That's why I think that its not honest to use promql-engine-timeout here or we need also to add ctx to query layer.

pkg/store/proxy.go Outdated Show resolved Hide resolved
pkg/store/proxy.go Outdated Show resolved Hide resolved
pkg/store/proxy.go Outdated Show resolved Hide resolved
pkg/store/proxy_test.go Outdated Show resolved Hide resolved
@povilasv
Copy link
Member

@d-ulyanov FYI I've changed the flag names in your branch

@povilasv povilasv requested a review from bwplotka April 13, 2019 17:34
@d-ulyanov
Copy link
Contributor Author

@povilasv cool, many thanks!

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for a good work, but I think we can still improve it. Sorry for review delay!

Some suggestions below. Please be careful on compatibility side. We are not 1.0 API stable, BUT we don't want to totally confuse our users either.

Also, can I ask what's the point of this logic, what's the use case? E.g WHEN would you set (old) query timeout different to proxy/select/store.timeout underneath PromQL?

Cannot find any immdiate idea. Do you want to timeout on select, but allow more time for promQL?

@@ -44,7 +45,11 @@ type ProxyStore struct {
component component.StoreAPI
selectorLabels labels.Labels

// responseTimeout is a timeout for any GRPC operation during series query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing trailing period in comment, same below

@@ -119,6 +126,8 @@ func newRespCh(ctx context.Context, buffer int) (*ctxRespSender, <-chan *storepb
return &ctxRespSender{ctx: ctx, ch: respCh}, respCh, func() { close(respCh) }
}

// send writes response to sender channel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use full sentence comments. They are for humans (:

return
}

// setError sets error (thread-safe)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full sentence comments

@@ -147,6 +156,9 @@ func (s *ProxyStore) Series(r *storepb.SeriesRequest, srv storepb.Store_SeriesSe
respSender, respRecv, closeFn = newRespCh(gctx, 10)
)

storeCtx, cancel := s.contextWithTimeout(gctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what's the difference in this timeout to the overall gRPC request timeout or gRPC server timeout? I believe Querier can control the same thing by just specifying timeout here: https://github.com/improbable-eng/thanos/blob/1cd9ddd14999d6b074f34a4328e03f7ac3b7c26a/pkg/query/querier.go#L183

I would remove this from proxy.go completely and set this timeout on querier client side. What do you think?

Effect is the same, if not better, as you missed to pass this context to gctx = errgroup.WithContext(srv.Context())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can I ask what's the point of this logic, what's the use case? E.g WHEN would you set (old) query timeout different to proxy/select/store.timeout underneath PromQL?
Cannot find any immdiate idea. Do you want to timeout on select, but allow more time for promQL?

Imagine that we have two stores: first is very slow, second is fast. Partial time-out is enabled.
Behaviour that we want to see: if slow store is timed out - we still receiving data from second.
So, to make it possible, store.read-timeout should be less than PromQL timeout.
You're right, we can move it to querier.go and the effect will be the same at the moment (and it was my first implementation :) )

I think it doesn't matter where we have this timeout, more important that at the moment it doesn't work perfect because of reading from stores in mergedSet: reading from stores works in series there and first slow store still blocks reading from second. We need some improvements there...

responseTimeout time.Duration

// queryTimeout is a timeout for entire request
queryTimeout time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's kill it IMO

promqlTimeout := modelDuration(cmd.Flag("promql.timeout", "Maximum time to execute PromQL in query node.").
Default("2m"))

queryTimeout := modelDuration(cmd.Flag("query.timeout", "Maximum time to process request by query node. If a request to one of the stores has timed out and query.timeout < promql.timeout then a partial response will be returned. If query.timeout >= promql.timeout then only timeout error will be returned.").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite confusing, I am having hard time to understand on what things we timeout here

@@ -61,9 +61,6 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application, name string
webExternalPrefix := cmd.Flag("web.external-prefix", "Static prefix for all HTML links and redirect URLs in the UI query web interface. Actual endpoints are still served on / or the web.route-prefix. This allows thanos UI to be served behind a reverse proxy that strips a URL sub-path.").Default("").String()
webPrefixHeaderName := cmd.Flag("web.prefix-header", "Name of HTTP request header used for dynamic prefixing of UI links and redirects. This option is ignored if web.external-prefix argument is set. Security risk: enable this option only if a reverse proxy in front of thanos is resetting the header. The --web.prefix-header=X-Forwarded-Prefix option can be useful, for example, if Thanos UI is served via Traefik reverse proxy with PathPrefixStrip option enabled, which sends the stripped prefix value in X-Forwarded-Prefix header. This allows thanos UI to be served on a sub-path.").Default("").String()

queryTimeout := modelDuration(cmd.Flag("query.timeout", "Maximum time to process query by query node.").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave this as it was, reasons:

  1. query sounds like being related to QueryAPI, that's why this name. Which always included promql one. That's why I would leave like it was, the fact that query also queries something else underneath is hidden. Also it matches Prometheus query flags as well. And yet another thing compatibility. Even though we are not 1.0 this API change is major hit into compatibiltiy as suddently query timeout means something opposite to what it was ):

  2. promql sounds like only promQL, however we know it involves proxy StoreAPI Series() invocation. More over not only one, but sometimes more then one! PromQL can run multiple Selects per single Query.

I really like this idea, but IMO it should be:
query.timeout (as it was)
store.timeout for proxy.go client timeout.

alternatively if we want to be more explicit maybe query.select-timeout? That will also include then the deduplication process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

@bwplotka
Copy link
Member

After offline discussion with @povilasv I missed the fact that @povilasv meant to fix #919 but this PR specifies as addressing #835 ...

So let's organize bit better:

Sorry for confused, @povilasv @d-ulyanov does it make sense to you?

@povilasv
Copy link
Member

After offline discussion with @povilasv I missed the fact that @povilasv meant to fix #919 but this PR specifies as addressing #835 ...

So let's organize bit better:

Sorry for confused, @povilasv @d-ulyanov does it make sense to you?

I'll try to fix the context propogation in PromQL eval and let's see where it goes.

Let's keep this open for now.

Also this PR has some cool stuff lke writeWarningOrErrorResponse so we can use part of this
Also some good tests 👍

@bwplotka
Copy link
Member

Also this PR has some cool stuff lke writeWarningOrErrorResponse so we can use part of this
Also some good tests +1

agreee

@d-ulyanov
Copy link
Contributor Author

d-ulyanov commented Apr 15, 2019

Hey, @povilasv @bwplotka

Thanks for your review.

I agree that our current solution is not perfect.
We tested timeouts today with different combinations of slow/fast stores and it seems that current logic with store.response-timeout from #928 works a bit wrong:

Store A - slow, store B - fast:

  1. store.response-timeout waits for data from store.Next(). First batch of series will be returned not immediately because it could take some time on Prometheus side (finding index, reading data from disk, etc...). In this case this timeout fails.

  2. In mergedSet we're reading synchronously from all stores. It means that if store.response-timeout=30s and store A is not responding, we will wait 30s before we will start reading from store B (first recv() already finished, but samples are not merged still). And when we will start reading from B - 30s already passed and we will receive nothing.

I propose the following:

  • make reading and merging process asynchronous
  • keep only one store.read-timeout that should limit reading time from store OR move this timeout to querier layer at all.
  • remove store.response-timeout (we can do it a bit later)

@bwplotka
Copy link
Member

Yes, very related explanation of confusing slow/fast stores can be found here: #900 (comment)

@bwplotka
Copy link
Member

bwplotka commented Apr 15, 2019

make reading and merging process asynchronous

Can we talk about this more? It's probably the most critical part of Thanos, might need some design first.

So essentially, taking the ready responses first instead of arbitrary Next of SeriesSet that are merged inside iterators, right?

I think we might need some work on interface side as SeriesSet Next() things does not allow it easily. We would need some benchmarks as well for it, but I feel like this is essential change yes. Not for 0.4.0, but for next one.

@povilasv
Copy link
Member

povilasv commented Apr 16, 2019

* make reading and merging process asynchronous

* keep only one `store.read-timeout` that should limit reading time from store OR move this timeout to querier layer at all.

* remove `store.response-timeout` (we can do it a bit later)

I disagree, we still need to have store.response-timeout. As I told you before it is used for kicking out slow stores and making queries respond faster. i.e. if query-timeout is 10 minutes, you would always wait 10minutes for every query if one store node is super slow. And if store-response-timeout set to 1s, it would be kicked out after 1s + delta, thus query responding way faster.

@povilasv
Copy link
Member

povilasv commented Apr 16, 2019

@d-ulyanov if you think --store.response-timeout doesn't do what it is supposed to do, file an issue add logs tell me what behaviours you see, etc and I can investigate. I will fix the bug if there is a bug. So let's not discuss this here.

But --store.response-timeout is here to stay in one shape or another as it reduces Query duration by a alot.

@povilasv
Copy link
Member

povilasv commented Apr 16, 2019

I propose the following:

* make reading and merging process asynchronous

Currently we are reading from buffered channel, which is asyncronuos currently. So not sure what you mean here.

Re merging please file an issue and write down what you want to do. IMO This is not the best place to discuss this. Thanks :)

@povilasv
Copy link
Member

povilasv commented Apr 16, 2019

@d-ulyanov @bwplotka So today I spent like 2 hours trying to figure out where we lose the context and I couldn't find it.

But I think #1035 this potentially simplifies code in context passing case.

As Series call in server doesn't take context as an argument:

Series(*SeriesRequest, Store_SeriesServer) error

But in client does:

Series(ctx context.Context, in *SeriesRequest, opts ...grpc.CallOption) (Store_SeriesClient, error) 

So the code becomes a bit cleaner and we can see how the context propogates :)

@d-ulyanov
Copy link
Contributor Author

Hi @povilasv
Okay, lets discuss about --store.response-timeout in another issue later, we'll prepare logs/traces.

  • make reading and merging process asynchronous
    About this.

Initial issue:
We have 2 stores: A(slow) and B(fast). Partial response enabled, store.request-timeout=5s, store.read-timeout=30s, query.timeout=1m

Expected behaviour: We're receiving data from store B.
Actual behaviour: After timeout we're receiving nothing.

At the moment it seems that we have bottleneck here:

mergedSet := storepb.MergeSeriesSets(seriesSet...)

https://github.com/improbable-eng/thanos/blob/master/pkg/store/proxy.go#L214

On init it calls Next() on each store in newMergedSeriesSet(a, b SeriesSet). And it blocks, because calling Next() on all stores: Next() could wait response from store A (because grpc read could block in startStreamSeriesSet func: r, err := s.stream.Recv()) and we won't collect anything from store B in this moment even if store B already returned some data.
https://github.com/improbable-eng/thanos/blob/master/pkg/store/storepb/custom.go#L102

Further, in cycle for mergedSet.Next() { blocking can happen again.
Also, we can't use short store.response-timeout because time to first data could be significant.

My propose:

  1. Read as much data as possible from all stores independently and write it to shared channel.
  2. Read series from this channel to some buffer in proxy.go.
  3. After all stores returned data OR context closed - merge received data and write it back to querier.go
  4. It seems that we don't need MergeSeriesSet structure in this case (we need only part of logic related to merging), so MergeSeriesSet structure could be removed.

We will allocate additional memory for this buffering. But I think its not a big problem, isn't it?

PS Now we already have some buffer in seriesServer
https://github.com/improbable-eng/thanos/blob/master/pkg/query/querier.go#L118

@povilasv
Copy link
Member

  1. Read as much data as possible from all stores independently and write it to shared channel.

@d-ulyanov please create a seperate issue for this. And let's stop bloating this conversation :D

@bwplotka
Copy link
Member

bwplotka commented Apr 16, 2019

As Series call in server doesn't take context as an argument:
Series(*SeriesRequest, Store_SeriesServer) error

Store_SeriesServer.Context() has it

@bwplotka
Copy link
Member

@d-ulyanov Can we take this offline to avoid confusion? sounds like we need to discuss this first. Slack of any other chat? Are you on improbable-eng slack maybe? (:

@bwplotka bwplotka closed this Apr 18, 2019
@bwplotka
Copy link
Member

We decided offline that this flag is not helping with anything much, we need to fix SeriesSet iterator place first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants