-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck in RetryableSendRequest #1519
Comments
I think I may have seen this myself, but didn't realize the issue, or rather, assumed it would clean itself up pretty quickly. Seems not! |
You should be able to keep |
Disabled both, as a start to test again & confirm what I have found, but I think it's enough to disable retry canceled requests. Will test this (keep alive enabled) under some load as well, but can't say when I confirm it, because of the reproducibility. BTW app is still running (~5h). We decided that it will pass internally if it will run for at least a week, because it will be under heavier load later. |
I found that this had been fixed on master, so I've backported the changes to the 0.11.x branch. A fix with this has just been published in v0.11.17. |
Thanks! For future readers, it was published as v0.11.27. |
Description
We have an application named
recorder
, which handles 0.5 - 1 Gbps of incoming data, process them (basically creates TAR archives) and uploads them to S3. The network load is +- same for tx & rx. If anything bad happens, logs are sent to Loggly via our open sourced rust-slog-loggly crate. This application is running on EC2 instances (m5.large, r4.large).This load is created by cameras and quite huge amount of requests:
From time to time, application stops. Looks like freezed ...
... but it's still running and CPU user time goes up, rest goes down ...
... more info from
top -H -p ...
...... and here's what the app is doing ...
Dependencies
Relevant dependencies.
Tokio
99% of processing is done on the main thread within one Tokio loop. There's also CPU pool where slow operations are offloaded (TAR archive creation, ...).
Every single request is wrapped with Tokio timer
timeout
.Reproducibility
I can "reproduce" this on Linux machines (EC2, with kernel 4.4.0-1054-aws) and om my local macOS machine.
Why "reproduce" in quotes? It's hard. Sometimes the application runs for a day, for 5 hours or for just 5 minutes without issues.
Workaround
We have to disable
keep_alive
andretry_canceled_requests
.No more issues (for several hours) since this change. I'll confirm this again later when the app will be up for at least two days.
The text was updated successfully, but these errors were encountered: