Success rate, throughput, and latency issues with HTTP/1 #1353

siggy · 2018-07-19T20:19:42Z

With linkerd2-proxy, observed 80% success rate and high latency when testing HTTP/1.

Test environment

compares 3 configurations
- linkerd2-proxy:git-565c1dad
- linkerd1 1.4.5
- baseline (no proxy)
HTTP/1
1000 qps total
10 connections
slow-cooker frontend
helloworld backend

Proxy metrics:
https://gist.github.com/siggy/2708cdff73c3e25463d80fc10feac45a

Kubernetes config:
https://gist.github.com/siggy/21ecc89162c23f1690baf29ab4cd2b5a

Seeing lots of these in proxy log:

ERR! proxy={server=in listen=0.0.0.0:4143 remote=127.0.0.1:52052} linkerd2_proxy turning Error caused by underlying HTTP/2 error: protocol error: unexpected internal error encountered into 500

Steps to reproduce

Deploy

kubectl apply -f https://gist.githubusercontent.com/siggy/21ecc89162c23f1690baf29ab4cd2b5a/raw/100493dc1e4fd2181c4f474fa7b4c52116dc71bd/linkerd2-h1.yaml

Observe in Grafana

kubectl -n linkerd2-h1 port-forward $(kubectl -n linkerd2-h1 get po --selector=app=grafana -o jsonpath='{.items[*].metadata.name}') 3000:3000

The text was updated successfully, but these errors were encountered:

seanmonstar · 2018-07-19T22:33:10Z

After enabling a bunch of logs, noticed this:

http/1 client error: an error occurred trying to connect: Cannot assign requested address (os error 99)

Which is interesting! The "trying to connect" means the error came from the C: Connect piece, which we create a custom one in the proxy in transport::connect.

Digging deeper...

hawkw · 2018-07-19T22:35:42Z

Cannot assign requested address (os error 99)

Huh, is SO_REUSEADDR not being set in some cases, or something?

seanmonstar · 2018-07-19T22:54:59Z

Well, this error is when connecting, and thus we don't set the option at all. But, this suggests that a lot of churn is happening, and many connections are sitting in TIME_WAIT. There was a patch to hyper to try to significantly reduce this, upgrade in linkerd/linkerd2-proxy#24.

seanmonstar · 2018-07-23T17:54:23Z

Turns out the real problem was every single one of these requests resulted in a new connection. There was some optimizations added to hyper to reduce the amount of operations needed when the size of a body was known, but because of those optimizations, the internal read state wasn't polled to the end, so it assumed the body wasn't wanted and had to close the connection. Fix to hyper merged in hyperium/hyper#1610, new PR for the proxy incoming!

siggy · 2018-07-30T21:43:29Z

Confirmed, tested with Linkerd2 v18.7.2, issue no longer observable.

siggy added area/proxy area/production labels Jul 19, 2018

seanmonstar mentioned this issue Jul 19, 2018

Update hyper and httparse versions linkerd/linkerd2-proxy#24

Merged

seanmonstar mentioned this issue Jul 23, 2018

update to hyper 0.12.7 to fix a keep-alive bug linkerd/linkerd2-proxy#26

Merged

seanmonstar closed this as completed in linkerd/linkerd2-proxy#26 Jul 24, 2018

github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Success rate, throughput, and latency issues with HTTP/1 #1353

Success rate, throughput, and latency issues with HTTP/1 #1353

siggy commented Jul 19, 2018

seanmonstar commented Jul 19, 2018

hawkw commented Jul 19, 2018

seanmonstar commented Jul 19, 2018 •

edited by hawkw

Loading

seanmonstar commented Jul 23, 2018

siggy commented Jul 30, 2018

Success rate, throughput, and latency issues with HTTP/1 #1353

Success rate, throughput, and latency issues with HTTP/1 #1353

Comments

siggy commented Jul 19, 2018

Test environment

Steps to reproduce

seanmonstar commented Jul 19, 2018

hawkw commented Jul 19, 2018

seanmonstar commented Jul 19, 2018 • edited by hawkw Loading

seanmonstar commented Jul 23, 2018

siggy commented Jul 30, 2018

seanmonstar commented Jul 19, 2018 •

edited by hawkw

Loading