Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeadlineExceeded in 2.10.2 #6385

Closed
hamishforbes opened this issue Oct 16, 2023 · 18 comments
Closed

DeadlineExceeded in 2.10.2 #6385

hamishforbes opened this issue Oct 16, 2023 · 18 comments

Comments

@hamishforbes
Copy link

Describe the bug

I'm getting a constant low level of DeadlineExceeded errors from the distributor after upgrading from 2.10.1 to 2.10.2.
Reverting to 2.10.1 fixes this, no other changes.

To Reproduce

Upgrade to 2.10.2

Expected behavior

No errors!

Environment

  • Infrastructure: Kubernetes
  • Deployment tool: Jsonnet

My setup is
Vanilla Prometheus remote write -> Envoy as an HTTP gateway/load balancer -> Distributors

Additional Context

The same issue presented in 2 environments for me and both were resolved by reverting to 2.10.1

Logs:

ts=2023-10-16T01:52:11.72323819Z caller=logging.go:126 level=warn traceID=544e742c02a21389 msg="POST /api/v1/push (500) 2.122306606s Response: \"failed pushing to ingester: rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Encoding: snappy; Content-Length: 126508; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.47.1; X-Envoy-Expected-Rq-Timeout-Ms: 15000; X-Forwarded-Proto: http; X-Prometheus-Remote-Write-Version: 0.1.0; X-Request-Id: 27e07fcf-c5fd-4eff-b683-e79aa2e89540; X-Scope-Orgid: core; " |  

Screenshot 2023-10-16 at 15 03 49

From the Mimir / Writes dash

Screenshot 2023-10-16 at 15 10 34

@langerma
Copy link

i have the same problem with grafana-agent.

@sviatlo
Copy link
Contributor

sviatlo commented Oct 16, 2023

We had exactly same issue on k8s, deployed with helm. I think it is related to this PR and the issue was seen in integration tests.

@czhfe
Copy link

czhfe commented Oct 16, 2023

I'm having the same problem.

@trexx
Copy link

trexx commented Oct 16, 2023

I've also rolled back, however the issue wasn't resolved until all the ingesters were slowly rolled back. They take a very long time to terminate and then to replay the wal.

@beatkind
Copy link
Contributor

We are seeing the same issue here

@bboreham
Copy link
Contributor

bboreham commented Oct 17, 2023

FWIW we did not see these errors when the code was rolled out at Grafana Labs.
We don't have a lot to go on to debug what you are seeing. More detail would be helpful, e.g. a grab of more of the 'writes' dashboard showing ingesters too, and 'writes resources' at the time of the problem.

They take a very long time to terminate and then to replay the wal.

Consider "zone-aware replication" which lets you take down a third of ingesters at a time.
https://github.com/grafana/mimir/blob/main/docs/sources/mimir/configure/configure-zone-aware-replication.md

EDIT: Apologies, I did not understand when I wrote the above that "long time to terminate" is a new symptom of the bug.

@chewrocca
Copy link

Experiencing the same issue after upgrading to 2.10.2.

The Prometheus Agents send a lot of these two warnings 10-15 mins after the upgrade before remote writes stop altogether. The vast majority of the warnings are the "use of closed network connection."

ts=2023-10-17T14:22:29.012Z caller=dedupe.go:112 component=remote level=warn remote_name=mimir-distributor.dev.silabs.net url=http://127.0.0.1:8080/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: failed pushing to ingester: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

and

ts=2023-10-17T14:22:27.016Z caller=dedupe.go:112 component=remote level=warn remote_name=mimir-distributor.dev.silabs.net url=http://127.0.0.1:8080/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: failed pushing to ingester: rpc error: code = Unavailable desc = error reading from server: read tcp 172.26.64.126:34612->10.51.157.201:22240: use of closed network connection"

@ying-jeanne
Copy link
Contributor

@hamishforbes what is your setting for server.grpc-max-concurrent-streams? the default value is 100 if you didn't set it before, could you try to set it higher let's say 500 and see whether it solved problem. I think the problem comes from this commit grpc/grpc-go@6a1400d

@ZeidAqemia
Copy link

ZeidAqemia commented Oct 17, 2023

@hamishforbes what is your setting for server.grpc-max-concurrent-streams? the default value is 100 if you didn't set it before, could you try to set it higher let's say 500 and see whether it solved problem. I think the problem comes from this commit grpc/grpc-go@6a1400d

Mine is set to 1000
image

Here are few more graphs
https://grafana.slack.com/archives/C039863E8P7/p1697555754641479
image
image
image
image
image
image

@ying-jeanne
Copy link
Contributor

@ZeidAqemia, have you considered setting server.grpc-max-concurrent-streams to 0 as a test? This could help us pinpoint the root cause of the issue.

@ZeidAqemia
Copy link

@ZeidAqemia, have you considered setting server.grpc-max-concurrent-streams to 0 as a test? This could help us pinpoint the root cause of the issue.

I applied the change and I'm rotating out the ingesters. I'll report back when it's done.

@ZeidAqemia
Copy link

First ingester removed at 16:41 and last one at 16:54.

  • Inflights connections are back to normal levels.
  • TCP connections on the Ingester went from 25-30 back to 8.
  • Latency seems gone
  • Ruler errors are gone too
image image

@bboreham
Copy link
Contributor

I believe we have tracked the underlying problem to a bad backport in grpc-go.
grpc/grpc-go#6737

We'll do some more testing and, if it checks out, issue a new release.

@colega colega mentioned this issue Oct 18, 2023
3 tasks
colega added a commit that referenced this issue Oct 18, 2023
This includes fix from grpc/grpc-go#6737

Should fix #6385

Signed-off-by: Oleg Zaytsev <[email protected]>
@colega
Copy link
Contributor

colega commented Oct 18, 2023

I have just published v2.10.3 that should fix this issue.

@colega colega closed this as completed in 222b8e8 Oct 18, 2023
@pstibrany
Copy link
Member

I think it is related to this PR and the issue was seen in integration tests.

I am sorry for this. I saw failing integration tests failures, but was convinced they are flakes :-( I should have investigated this further.

@hamishforbes
Copy link
Author

I have just published v2.10.3 that should fix this issue.

Can confirm it's fixed for me! Thank you

@colega
Copy link
Contributor

colega commented Oct 19, 2023

I believe we owe a better explanation of what happened when @bboreham said:

FWIW we did not see these errors when the code was rolled out at Grafana Labs.

We didn't see those errors because we run the weekly releases at Grafana Labs, in order to catch the bugs earlier, but the weekly releases are already using grpc @ v1.58.x. Since the bug was in the backport of the fix to v1.57.x, and that is the minor version that our Mimir 2.10.x uses, we didn't see those problems.

As we didn't see those problems on the weekly release, where we tested the security fix for grpc, we did not wait (because it was a security fix) to test the 2.10.2 release-candidate before releasing it.

Sorry for the inconveniences, we'll work on action items to improve the testing of our OSS releases.

@chakradhar-zs
Copy link

I have just published v2.10.3 that should fix this issue.
@colega We are using mimir-distributed-5.1.3 helm chart. We are facing same issue with deadline exceeded logs and metrics are not getting pushed. Can you please let me know which version to use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests