-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeadlineExceeded in 2.10.2 #6385
Comments
i have the same problem with grafana-agent. |
We had exactly same issue on k8s, deployed with helm. I think it is related to this PR and the issue was seen in integration tests. |
I'm having the same problem. |
I've also rolled back, however the issue wasn't resolved until all the ingesters were slowly rolled back. They take a very long time to terminate and then to replay the wal. |
We are seeing the same issue here |
FWIW we did not see these errors when the code was rolled out at Grafana Labs.
Consider "zone-aware replication" which lets you take down a third of ingesters at a time. EDIT: Apologies, I did not understand when I wrote the above that "long time to terminate" is a new symptom of the bug. |
Experiencing the same issue after upgrading to 2.10.2. The Prometheus Agents send a lot of these two warnings 10-15 mins after the upgrade before remote writes stop altogether. The vast majority of the warnings are the "use of closed network connection."
and
|
@hamishforbes what is your setting for server.grpc-max-concurrent-streams? the default value is 100 if you didn't set it before, could you try to set it higher let's say 500 and see whether it solved problem. I think the problem comes from this commit grpc/grpc-go@6a1400d |
Here are few more graphs |
@ZeidAqemia, have you considered setting server.grpc-max-concurrent-streams to 0 as a test? This could help us pinpoint the root cause of the issue. |
I applied the change and I'm rotating out the ingesters. I'll report back when it's done. |
I believe we have tracked the underlying problem to a bad backport in grpc-go. We'll do some more testing and, if it checks out, issue a new release. |
This includes fix from grpc/grpc-go#6737 Should fix #6385 Signed-off-by: Oleg Zaytsev <[email protected]>
I have just published |
I am sorry for this. I saw failing integration tests failures, but was convinced they are flakes :-( I should have investigated this further. |
Can confirm it's fixed for me! Thank you |
I believe we owe a better explanation of what happened when @bboreham said:
We didn't see those errors because we run the weekly releases at Grafana Labs, in order to catch the bugs earlier, but the weekly releases are already using grpc @ v1.58.x. Since the bug was in the backport of the fix to v1.57.x, and that is the minor version that our Mimir 2.10.x uses, we didn't see those problems. As we didn't see those problems on the weekly release, where we tested the security fix for grpc, we did not wait (because it was a security fix) to test the 2.10.2 release-candidate before releasing it. Sorry for the inconveniences, we'll work on action items to improve the testing of our OSS releases. |
|
Describe the bug
I'm getting a constant low level of DeadlineExceeded errors from the distributor after upgrading from 2.10.1 to 2.10.2.
Reverting to 2.10.1 fixes this, no other changes.
To Reproduce
Upgrade to 2.10.2
Expected behavior
No errors!
Environment
My setup is
Vanilla Prometheus remote write -> Envoy as an HTTP gateway/load balancer -> Distributors
Additional Context
The same issue presented in 2 environments for me and both were resolved by reverting to 2.10.1
Logs:
From the Mimir / Writes dash
The text was updated successfully, but these errors were encountered: