DeadlineExceeded in 2.10.2 #6385

hamishforbes · 2023-10-16T02:14:32Z

Describe the bug

I'm getting a constant low level of DeadlineExceeded errors from the distributor after upgrading from 2.10.1 to 2.10.2.
Reverting to 2.10.1 fixes this, no other changes.

To Reproduce

Upgrade to 2.10.2

Expected behavior

No errors!

Environment

Infrastructure: Kubernetes
Deployment tool: Jsonnet

My setup is
Vanilla Prometheus remote write -> Envoy as an HTTP gateway/load balancer -> Distributors

Additional Context

The same issue presented in 2 environments for me and both were resolved by reverting to 2.10.1

Logs:

ts=2023-10-16T01:52:11.72323819Z caller=logging.go:126 level=warn traceID=544e742c02a21389 msg="POST /api/v1/push (500) 2.122306606s Response: \"failed pushing to ingester: rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Encoding: snappy; Content-Length: 126508; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.47.1; X-Envoy-Expected-Rq-Timeout-Ms: 15000; X-Forwarded-Proto: http; X-Prometheus-Remote-Write-Version: 0.1.0; X-Request-Id: 27e07fcf-c5fd-4eff-b683-e79aa2e89540; X-Scope-Orgid: core; " |

From the Mimir / Writes dash

The text was updated successfully, but these errors were encountered:

langerma · 2023-10-16T05:07:10Z

i have the same problem with grafana-agent.

sviatlo · 2023-10-16T07:33:54Z

We had exactly same issue on k8s, deployed with helm. I think it is related to this PR and the issue was seen in integration tests.

czhfe · 2023-10-16T09:14:29Z

I'm having the same problem.

trexx · 2023-10-16T09:40:57Z

I've also rolled back, however the issue wasn't resolved until all the ingesters were slowly rolled back. They take a very long time to terminate and then to replay the wal.

beatkind · 2023-10-17T08:02:22Z

We are seeing the same issue here

bboreham · 2023-10-17T10:04:50Z

FWIW we did not see these errors when the code was rolled out at Grafana Labs.
We don't have a lot to go on to debug what you are seeing. More detail would be helpful, e.g. a grab of more of the 'writes' dashboard showing ingesters too, and 'writes resources' at the time of the problem.

They take a very long time to terminate and then to replay the wal.

Consider "zone-aware replication" which lets you take down a third of ingesters at a time.
https://github.com/grafana/mimir/blob/main/docs/sources/mimir/configure/configure-zone-aware-replication.md

EDIT: Apologies, I did not understand when I wrote the above that "long time to terminate" is a new symptom of the bug.

chewrocca · 2023-10-17T15:13:19Z

Experiencing the same issue after upgrading to 2.10.2.

The Prometheus Agents send a lot of these two warnings 10-15 mins after the upgrade before remote writes stop altogether. The vast majority of the warnings are the "use of closed network connection."

ts=2023-10-17T14:22:29.012Z caller=dedupe.go:112 component=remote level=warn remote_name=mimir-distributor.dev.silabs.net url=http://127.0.0.1:8080/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: failed pushing to ingester: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

and

ts=2023-10-17T14:22:27.016Z caller=dedupe.go:112 component=remote level=warn remote_name=mimir-distributor.dev.silabs.net url=http://127.0.0.1:8080/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: failed pushing to ingester: rpc error: code = Unavailable desc = error reading from server: read tcp 172.26.64.126:34612->10.51.157.201:22240: use of closed network connection"

ying-jeanne · 2023-10-17T16:14:17Z

@hamishforbes what is your setting for server.grpc-max-concurrent-streams? the default value is 100 if you didn't set it before, could you try to set it higher let's say 500 and see whether it solved problem. I think the problem comes from this commit grpc/grpc-go@6a1400d

ZeidAqemia · 2023-10-17T16:20:23Z

@hamishforbes what is your setting for server.grpc-max-concurrent-streams? the default value is 100 if you didn't set it before, could you try to set it higher let's say 500 and see whether it solved problem. I think the problem comes from this commit grpc/grpc-go@6a1400d

Mine is set to 1000

Here are few more graphs
https://grafana.slack.com/archives/C039863E8P7/p1697555754641479

ying-jeanne · 2023-10-17T16:28:08Z

@ZeidAqemia, have you considered setting server.grpc-max-concurrent-streams to 0 as a test? This could help us pinpoint the root cause of the issue.

ZeidAqemia · 2023-10-17T16:41:53Z

@ZeidAqemia, have you considered setting server.grpc-max-concurrent-streams to 0 as a test? This could help us pinpoint the root cause of the issue.

I applied the change and I'm rotating out the ingesters. I'll report back when it's done.

ZeidAqemia · 2023-10-17T17:09:58Z

First ingester removed at 16:41 and last one at 16:54.

Inflights connections are back to normal levels.
TCP connections on the Ingester went from 25-30 back to 8.
Latency seems gone
Ruler errors are gone too

bboreham · 2023-10-18T10:42:55Z

I believe we have tracked the underlying problem to a bad backport in grpc-go.
grpc/grpc-go#6737

We'll do some more testing and, if it checks out, issue a new release.

This includes fix from grpc/grpc-go#6737 Should fix #6385 Signed-off-by: Oleg Zaytsev <[email protected]>

colega · 2023-10-18T13:11:00Z

I have just published v2.10.3 that should fix this issue.

pstibrany · 2023-10-18T20:27:54Z

I think it is related to this PR and the issue was seen in integration tests.

I am sorry for this. I saw failing integration tests failures, but was convinced they are flakes :-( I should have investigated this further.

hamishforbes · 2023-10-18T21:57:05Z

I have just published v2.10.3 that should fix this issue.

Can confirm it's fixed for me! Thank you

colega · 2023-10-19T07:49:36Z

I believe we owe a better explanation of what happened when @bboreham said:

FWIW we did not see these errors when the code was rolled out at Grafana Labs.

We didn't see those errors because we run the weekly releases at Grafana Labs, in order to catch the bugs earlier, but the weekly releases are already using grpc @ v1.58.x. Since the bug was in the backport of the fix to v1.57.x, and that is the minor version that our Mimir 2.10.x uses, we didn't see those problems.

As we didn't see those problems on the weekly release, where we tested the security fix for grpc, we did not wait (because it was a security fix) to test the 2.10.2 release-candidate before releasing it.

Sorry for the inconveniences, we'll work on action items to improve the testing of our OSS releases.

chakradhar-zs · 2024-08-21T17:32:37Z

I have just published v2.10.3 that should fix this issue.
@colega We are using mimir-distributed-5.1.3 helm chart. We are facing same issue with deadline exceeded logs and metrics are not getting pushed. Can you please let me know which version to use?

colega mentioned this issue Oct 18, 2023

Update grpc to 1.57.2-dev #6419

Merged

3 tasks

colega added a commit that referenced this issue Oct 18, 2023

Update grpc to 1.57.2-dev

6650f2b

This includes fix from grpc/grpc-go#6737 Should fix #6385 Signed-off-by: Oleg Zaytsev <[email protected]>

colega closed this as completed in 222b8e8 Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeadlineExceeded in 2.10.2 #6385

DeadlineExceeded in 2.10.2 #6385

hamishforbes commented Oct 16, 2023

langerma commented Oct 16, 2023

sviatlo commented Oct 16, 2023

czhfe commented Oct 16, 2023

trexx commented Oct 16, 2023 •

edited

Loading

beatkind commented Oct 17, 2023

bboreham commented Oct 17, 2023 •

edited

Loading

chewrocca commented Oct 17, 2023

ying-jeanne commented Oct 17, 2023

ZeidAqemia commented Oct 17, 2023 •

edited

Loading

ying-jeanne commented Oct 17, 2023

ZeidAqemia commented Oct 17, 2023

ZeidAqemia commented Oct 17, 2023

bboreham commented Oct 18, 2023

colega commented Oct 18, 2023

pstibrany commented Oct 18, 2023

hamishforbes commented Oct 18, 2023

colega commented Oct 19, 2023

chakradhar-zs commented Aug 21, 2024

DeadlineExceeded in 2.10.2 #6385

DeadlineExceeded in 2.10.2 #6385

Comments

hamishforbes commented Oct 16, 2023

Describe the bug

To Reproduce

Expected behavior

Environment

Additional Context

langerma commented Oct 16, 2023

sviatlo commented Oct 16, 2023

czhfe commented Oct 16, 2023

trexx commented Oct 16, 2023 • edited Loading

beatkind commented Oct 17, 2023

bboreham commented Oct 17, 2023 • edited Loading

chewrocca commented Oct 17, 2023

ying-jeanne commented Oct 17, 2023

ZeidAqemia commented Oct 17, 2023 • edited Loading

ying-jeanne commented Oct 17, 2023

ZeidAqemia commented Oct 17, 2023

ZeidAqemia commented Oct 17, 2023

bboreham commented Oct 18, 2023

colega commented Oct 18, 2023

pstibrany commented Oct 18, 2023

hamishforbes commented Oct 18, 2023

colega commented Oct 19, 2023

chakradhar-zs commented Aug 21, 2024

trexx commented Oct 16, 2023 •

edited

Loading

bboreham commented Oct 17, 2023 •

edited

Loading

ZeidAqemia commented Oct 17, 2023 •

edited

Loading