Stuck in RetryableSendRequest #1519

zrzka · 2018-05-16T12:15:40Z

Description

We have an application named recorder, which handles 0.5 - 1 Gbps of incoming data, process them (basically creates TAR archives) and uploads them to S3. The network load is +- same for tx & rx. If anything bad happens, logs are sent to Loggly via our open sourced rust-slog-loggly crate. This application is running on EC2 instances (m5.large, r4.large).

This load is created by cameras and quite huge amount of requests:

download playlist á 6s (per camera, small one, few bytes)
download segments (6s duration, several MBs)
create TAR archive from them (60s duration, 10 segments)
upload TAR archive

From time to time, application stops. Looks like freezed ...

... but it's still running and CPU user time goes up, rest goes down ...

... more info from top -H -p ... ...

... and here's what the app is doing ...

Dependencies

Relevant dependencies.

futures = "0.1.21"
futures-cpupool = "0.1.8"
hyper = "0.11.26"
hyper-tls = "0.1.3"
tokio-core = "0.1.17"
tokio-timer = "0.1.2"

Tokio

99% of processing is done on the main thread within one Tokio loop. There's also CPU pool where slow operations are offloaded (TAR archive creation, ...).

Every single request is wrapped with Tokio timer timeout.

Reproducibility

I can "reproduce" this on Linux machines (EC2, with kernel 4.4.0-1054-aws) and om my local macOS machine.

Why "reproduce" in quotes? It's hard. Sometimes the application runs for a day, for 5 hours or for just 5 minutes without issues.

Workaround

We have to disable keep_alive and retry_canceled_requests.

No more issues (for several hours) since this change. I'll confirm this again later when the app will be up for at least two days.

The text was updated successfully, but these errors were encountered:

seanmonstar · 2018-05-16T17:02:19Z

I think I may have seen this myself, but didn't realize the issue, or rather, assumed it would clean itself up pretty quickly. Seems not!

seanmonstar · 2018-05-16T17:02:51Z

You should be able to keep keep_alive enabled, and simple disable retry_canceled_requests, right?

zrzka · 2018-05-16T17:15:12Z

Disabled both, as a start to test again & confirm what I have found, but I think it's enough to disable retry canceled requests. Will test this (keep alive enabled) under some load as well, but can't say when I confirm it, because of the reproducibility.

BTW app is still running (~5h). We decided that it will pass internally if it will run for at least a week, because it will be under heavier load later.

Closes #1519

seanmonstar · 2018-05-16T19:03:53Z

I found that this had been fixed on master, so I've backported the changes to the 0.11.x branch. A fix with this has just been published in v0.11.17.

zrzka · 2018-05-16T19:28:42Z

Thanks! For future readers, it was published as v0.11.27.

seanmonstar added C-bug Category: bug. Something is wrong. This is bad! A-client Area: client. labels May 16, 2018

seanmonstar added a commit that referenced this issue May 16, 2018

fix(client): prevent pool checkout looping on not-ready connections

ccec79d

Closes #1519

seanmonstar closed this as completed May 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck in RetryableSendRequest #1519

Stuck in RetryableSendRequest #1519

zrzka commented May 16, 2018 •

edited

Loading

seanmonstar commented May 16, 2018

seanmonstar commented May 16, 2018

zrzka commented May 16, 2018

seanmonstar commented May 16, 2018

zrzka commented May 16, 2018

Stuck in RetryableSendRequest #1519

Stuck in RetryableSendRequest #1519

Comments

zrzka commented May 16, 2018 • edited Loading

Description

Dependencies

Tokio

Reproducibility

Workaround

seanmonstar commented May 16, 2018

seanmonstar commented May 16, 2018

zrzka commented May 16, 2018

seanmonstar commented May 16, 2018

zrzka commented May 16, 2018

zrzka commented May 16, 2018 •

edited

Loading