skipper times out before clients on slow connections can finish their requests #795

ptrf · 2018-09-18T10:07:28Z

We have found a bug in skipper, which affects connections where a client attempts to send large request payload on a slow connection. Specifically, skipper errors out due to an i/o timeout:

error while proxying, route [...] with backend http://[...]:8080, status code 504: dialing failed false: read tcp [...]:9999->[...]:28972: i/o timeout

Our setup contains a kubernetes deployed service that exposes a POST endpoint. This endpoint can accept payloads of several hundreds of MBs.

When a client attempts to POST a large payload on a slow connection, skipper aborts the request after reaching the timeout specified by -read-timeout-server.

It seems that skipper times out because the backend has not responded, even though the client's request has yet to complete.

We attempted a workaround where the service would send multiple 100 Continue informational responses, after reading each 8 MB chunk, but here we in turn ran into issues golang/go#26089 and golang/go#26088.

Steps to reproduce:

Setup a service that exposes a POST or PUT endpoint behind a skipper - the endpoint may discard the request payload after reading it.
Set the skipper's -read-timeout-server to, say, 10 sec
Use curl(1) to POST a file using --limit-rate options:

curl -X POST [skipper-endpoint-url] --data-binary @somelargefile --limit-rate 2k

If somelargefile is 50 KB, it should take 25+ seconds for curl(1) to finish the request, which exceeds the timeout specified by -read-timeout-server, and you should see the request abort, even though the client hasn't finished the request.

cc @roffe

The text was updated successfully, but these errors were encountered:

szuecs · 2018-09-18T21:28:14Z

Thanks for creating this detailed issue!

Did you also tried to change the -read-timeout-server to 0 or to a higher number and see if the problem is not happening anymore?

roffe · 2018-09-18T21:58:52Z

if we raised the number we could transfer the upload until the timeout happened, ie set it to 15 min and transfer would take 30 min, it gets cut at the timeout. I verified it with curls rate limited at different time ranges

never tried 0 cause no timeouts doesn't sound to healthy, also raising the timeout really high would affect other ingress rules on the same controller.

might be worth adding option per ingress to configure this timeout or maybe reset the timeout if the client still is sending data.

szuecs · 2018-09-19T06:16:41Z

Option per ingress is not scalable (connection pool per ingress).
We would need to change a bit more, that we will do anyways in the future, but maybe not to fix the issue.
We used the Go default timeouts and if you want you can set the read timeout to 0, because there are a lot of other timeouts catching tcp/ip timeouts for example.
I build the client/server timeouts because of https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/amp/ which explains them in a good way .

tkrop · 2018-09-19T10:00:23Z

Can you ensure, that Skipper/go is closing the TCP connections on the other timeouts? Otherwise you might end up with the to-many-open-file-descriptors-problem - when setting read timeout to 0.

szuecs · 2018-09-19T21:22:06Z

@tkrop please proof that it doesn’t. I appreciate if you show that the assumption is wrong, because it would be another issue in golang we can put upstream. I am pretty sure that my assumption from reading docs is right.

Readheadertimeout is http and if timeout, which is tcp, don’t close the connection, what is the purpose of a tcp timeout?
I hope you read the article and might consider reading the godoc, which is more current than the article.

szuecs · 2018-09-19T21:24:33Z

@roffe a timeout of 15 minutes is already easily to DoS. Why bother with the timeout if you have to ratelimit anyways?

roffe · 2018-09-19T21:24:54Z

The ratelimit is to simulate a slow connection

roffe · 2018-09-19T21:26:17Z

99% of all the uploads would be from reliable connections but sometimes people sit on a flaky cellphone connection and upload slow and the total transfer time would go over -read-timeout-server

szuecs · 2018-09-19T21:33:17Z

I am not sure if we talk about the same problem.
I mean you can set read timeout to 0, because the DoS protection that you need to protect from malicious unhealthy clients can be protected by client side ratelimit.
For timeouts you can use the tcp (transport) timeouts, that don’t care about the http payload to be transferred in time.

Do I miss something?

ptrf · 2018-09-20T10:12:58Z

@szuecs If I'm not mistaken while reading the article above – "The Complete guide to Go net/http timeouts" – you control the maximum amount of time that Skipper will wait for a client to finish
sending its request with http.Server.ReadTimeout, controlled in skipper by the parameter -read-timeout-server?

If this is correctly understood, then the behaviour we're seeing is intentional.

szuecs · 2018-09-20T10:43:37Z

@ptrf yes exactly. In code you find this at https://github.com/zalando/skipper/blob/master/skipper.go#L588-L596

You could also set http.ReadHeaderTimeout with -read-header-timeout-server=30s for timeout client connections and use -read-timeout-server=0 to have the intended behaviour, I would expect.

ptrf · 2018-09-20T11:26:25Z

@roffe Well, I guess we should close this issue?

ptrf · 2018-09-20T11:27:20Z

@szuecs And thanks for looking at our inquiry :-)

tkrop · 2018-09-20T14:26:16Z

@szuecs you are probably right. I can't proof this. It is just base on the intuition, that go is propagating the 0 down to the socket. This may be no problem, if the go's http is terminating the receive operation using other means, e.g by closing the socket. I'm not the go expert to say how this is handled, but it might be worth keeping this in mind or testing it before using this setup.

roffe · 2018-09-20T19:20:40Z

@ptrf yes we can close the issue, thanks for looking into this @szuecs and elaborating on the behaviour

ptrf · 2018-09-20T19:21:33Z

closing then, thanks again

ptrf closed this as completed Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skipper times out before clients on slow connections can finish their requests #795

skipper times out before clients on slow connections can finish their requests #795

ptrf commented Sep 18, 2018

szuecs commented Sep 18, 2018

roffe commented Sep 18, 2018 •

edited

Loading

szuecs commented Sep 19, 2018 •

edited

Loading

tkrop commented Sep 19, 2018 •

edited

Loading

szuecs commented Sep 19, 2018

szuecs commented Sep 19, 2018

roffe commented Sep 19, 2018

roffe commented Sep 19, 2018 •

edited

Loading

szuecs commented Sep 19, 2018

ptrf commented Sep 20, 2018

szuecs commented Sep 20, 2018 •

edited

Loading

ptrf commented Sep 20, 2018

ptrf commented Sep 20, 2018

tkrop commented Sep 20, 2018

roffe commented Sep 20, 2018

ptrf commented Sep 20, 2018

skipper times out before clients on slow connections can finish their requests #795

skipper times out before clients on slow connections can finish their requests #795

Comments

ptrf commented Sep 18, 2018

szuecs commented Sep 18, 2018

roffe commented Sep 18, 2018 • edited Loading

szuecs commented Sep 19, 2018 • edited Loading

tkrop commented Sep 19, 2018 • edited Loading

szuecs commented Sep 19, 2018

szuecs commented Sep 19, 2018

roffe commented Sep 19, 2018

roffe commented Sep 19, 2018 • edited Loading

szuecs commented Sep 19, 2018

ptrf commented Sep 20, 2018

szuecs commented Sep 20, 2018 • edited Loading

ptrf commented Sep 20, 2018

ptrf commented Sep 20, 2018

tkrop commented Sep 20, 2018

roffe commented Sep 20, 2018

ptrf commented Sep 20, 2018

roffe commented Sep 18, 2018 •

edited

Loading

szuecs commented Sep 19, 2018 •

edited

Loading

tkrop commented Sep 19, 2018 •

edited

Loading

roffe commented Sep 19, 2018 •

edited

Loading

szuecs commented Sep 20, 2018 •

edited

Loading