Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler: err="no query peer reachable" with no details. #2020

Closed
hawran opened this issue Jan 21, 2020 · 11 comments
Closed

Ruler: err="no query peer reachable" with no details. #2020

hawran opened this issue Jan 21, 2020 · 11 comments

Comments

@hawran
Copy link

hawran commented Jan 21, 2020

Thanos, Prometheus and Golang version used:
Thanos: 0.10.0
Prometheus: 2.15.2
Golang: 1.13.1

Object Storage Provider:

What happened:
From time to time we're experiencing a bunch of error messages from ruler as follows:

... caller=manager.go:525 component=rules group=... ...err="no query peer reachable"

What you expected to happen:
I was about to ask whether it would be possible to elaborate on details of such an error?
What reason, which peer? At least.

I've noticed the commit 1a419c2#diff-8b6a7e1ae18dc0ef5f768a537dff89f5L772 , however I'm not sure if this solve my problem...

Thank you,
hawran

How to reproduce it (as minimally and precisely as possible):
Full logs to relevant components:
Anything else we need to know:

@FUSAKLA
Copy link
Member

FUSAKLA commented Jan 22, 2020

Hi, this error message can mean two things:

  • there are no queriers resolved (for example from service discovery)
  • all resolved queriers returned an error, in that case you should see those errors logged before this line
    So the answer to "which?" is all of them.

Do you have any preceding error log lines?

@hawran
Copy link
Author

hawran commented Jan 22, 2020

Hi, this error message can mean two things:
...

Sorry, my mistake, I overlooked some filters in use (hiding other lines).
So I've got a preceding log line as follows (one example):

... err="perform GET request against http://.../api/v1/query?dedup=true&partial_response=false&query=...&time=2020-01-22T08%3A19%3A13.766329475Z: EOF" query="..." ...

Well, the question which has been answered.
But what was the reason?

@hawran
Copy link
Author

hawran commented Jan 22, 2020

...

... err="perform GET request against http://.../api/v1/query?dedup=true&partial_response=false&query=...&time=2020-01-22T08%3A19%3A13.766329475Z: EOF" query="..." ...

...

I'm deeply sorry, accidentally I've picked out an example when ruler got oom-killed, :-/.
However, the original idea for this issue / question remains, I'd say.
Hopefully I'll find out a better example...

@hawran
Copy link
Author

hawran commented Jan 24, 2020

Hi,
before I go on with some log lines I've managed to gather, one question has just crossed my mind:

  • regarding the error line I mentioned before - is that EOF symbol actually an 'error code or whatever' which describes the err="..." result?
    Hence the string after the last colon within the err string is the error status, right?

@hawran
Copy link
Author

hawran commented Jan 24, 2020

So, the log lines I'd like to present are as follows (just snippets to make it as concise as possible):

thanos-compactor: level=info ... 2020-01-22T11:44:04.595706111Z caller=compact.go:441 msg="compact blocks" count=4 mint=1579651200000 maxt=1579680000000 ulid=01DZ6GWZPKQJCHS4C5JENGCB61 sources="[01DZ5K2FWDGK64AZ818CQ7TNVA ...
...

thanos-compactor: level=info ... 2020-01-22T11:45:28.619330697Z caller=compact.go:834 compactionGroup=0@6203230819750641309 msg="deleting compacted block" old_block=01DZ5K2FWDGK64AZ818CQ7TNVA ...
...

thanos-querier: level=error ... 2020-01-22T11:45:36.062016291Z caller=engine.go:617 msg="error selecting series set" err="proxy Series(): Addr: ... LabelSets: [...] Mint: 1563523200000 Maxt: 1579664733046: rpc error: code = Aborted desc = fetch series for block 01DZ5K2FWDGK64AZ818CQ7TNVA: preload chunks: read range for 0: get range reader: The specified key does not exist."
... a couple of similar errors with the same block ID ...

thanos-ruler: level=warn ... 2020-01-22T11:45:36.06306059Z caller=manager.go:525 component=rules group=... msg="Evaluating rule failed" rule="record: ..." err="no query peer reachable"
... a couple of similar warnings ...

Please note the 01DZ5K2FWDGK64AZ818CQ7TNVA block.

My questions:

  1. Why does querier try to use that deleted block?
  2. As you can see, it's very hard to couple requests being reported by ruler with requests within other components. Would it be possible to propagate some ID (originated by querier) throughout the whole chain?

@FUSAKLA
Copy link
Member

FUSAKLA commented Jan 27, 2020

Hi, regarding the coupling of the logs. I'd recommend using some kind of distributed tracing, Thanos does support most of the providers so that would definitely make things easier for you.

I believe there is a plan to log the trace ids which would make it possible to couple those related logs.

Regarding the issue, I personally bumped to the exact same issue reported here #2022

But it is known issue, the interesting is that it occurred just now. There is open PR which should mitigate this.

Can you share what changes led to this? In my case there was upgrade from Thanos 0.7.0 to 0.9.0 and upgrading Minio which it uses also.

@hawran
Copy link
Author

hawran commented Jan 30, 2020

Hi,
my comments inline...

Hi, regarding the coupling of the logs. I'd recommend using some kind of distributed tracing, Thanos does support most of the providers so that would definitely make things easier for you.

Well, there's Jaeger in use here at the moment and it's not good (probably tampering with sampling could help).
I had just the feeling it might have been the best way if marking is done via Thanos's internals.

I believe there is a plan to log the trace ids which would make it possible to couple those related logs.

Good, it could be nice.

Regarding the issue, I personally bumped to the exact same issue reported here #2022

But it is known issue, the interesting is that it occurred just now. There is open PR which should mitigate this.

Can you share what changes led to this? In my case there was upgrade from Thanos 0.7.0 to 0.9.0 and upgrading Minio which it uses also.

According to our remaining "historical" logs we experienced the The specified key does not exist messages even with Thanos 0.9.0.
(And we've been using Ceph (S3 API) as an Object Storage Provider.)

@AlexDCraig
Copy link

I'm having the same problem. When I configure Ruler to communicate with Query, it gives me:

level=error ts=2020-01-31T21:04:08.047237876Z caller=rule.go:761 err="perform GET request against http://[Internal IP of Query pod]:10901/api/v1/query?dedup=true&partial_response=false&query=vector%281%29&time=2020-01-31T21%3A04%3A08.043993972Z: Get http://10.174.10.65:10901/api/v1/query?dedup=true&partial_response=false&query=vector%281%29&time=2020-01-31T21%3A04%3A08.043993972Z: net/http: HTTP/1.x transport connection broken: malformed HTTP response \"\\x00\\x00\\x06\\x04\\x00\\x00\\x00\\x00\\x00\\x00\\x05\\x00\\x00@\\x00\"" query=vector(1)

I'm running Thanos 0.10.1 on Kubernetes version 1.16.3. Query DNS discovery is enabled on Ruler and Rule DNS Discovery is enabled on Query. Query has gRPC client TLS data (cert, key, and CA) and Ruler is TLS-secured via an nginx ingress.

Communication works perfectly fine if I tell Ruler to hit the Query HTTP endpoints, but universally fails if I try to get Ruler to communicate with it over gRPC.

@yeya24
Copy link
Contributor

yeya24 commented Jan 31, 2020

Hi @AlexDHoffer,
The Ruler can only get data from Query via its HTTP API, so you should connect to the HTTP port. If you get that error message when connecting port 10901, it is expected behavior.

@AlexDCraig
Copy link

Hi @AlexDHoffer,
The Ruler just queries data from Query via its HTTP API, so you should connect to the HTTP port. If you get that error message when connecting port 10901, it is expected behavior.

Thanks for the clarity. For some reason I thought we could query that data over the gRPC port. Will fix.

@stale
Copy link

stale bot commented Mar 1, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants