Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: don't reuse connections that are experiencing errors [1.20 backport] #60301

Closed
btasker opened this issue May 19, 2023 · 12 comments
Closed
Labels
WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@btasker
Copy link

btasker commented May 19, 2023

Requesting that #59690 be considered for backport into the next 1.20 minor release

Relevant change in Gerrit is https://go-review.googlesource.com/c/net/+/486156

Rationale:

When encountered, this is a serious issue: requests are blocked until the kernel gives up on the socket (with default settings on a Linux host, that's around 17 minutes). If the operator restarts the application as a result, data loss may occur.

Although the fix is in x/net v0.10.0 it will not be available to most developers who likely import net/http (and so rely on the h2 bundle rolled into Go's releases)

The only viable workaround is to disable the H2 client entirely via environment variable.

@gopherbot gopherbot added this to the Go1.20.5 milestone May 19, 2023
@seankhliao seankhliao added the CherryPickCandidate Used during the release process for point releases label May 19, 2023
@lidel lidel moved this to 📋 Backlog in bifrost-gateway May 22, 2023
@dmitshur dmitshur changed the title net/http2: don't reuse connections that are experiencing errors [1.20 backport] x/net/http2: don't reuse connections that are experiencing errors [1.20 backport] May 24, 2023
@mdempsky
Copy link
Contributor

mdempsky commented May 24, 2023

@neild What do you think about this?

Also, is this applicable to 1.19?

@dmitshur dmitshur changed the title x/net/http2: don't reuse connections that are experiencing errors [1.20 backport] net/http: don't reuse connections that are experiencing errors [1.20 backport] May 24, 2023
@mknyszek
Copy link
Contributor

Friendly ping @neild. :)

@ianlancetaylor
Copy link
Contributor

Opened #60662 for a 1.19 backport.

@Yuhjiang
Copy link

hello, btasker, your description inspired me. I've also encountered the problem with go version 1.18.6, and the problem would persist for 15-17min at every time. So I'm curious about the description of "with default settings on a Linux host, that's around 17 minutes". How can I get some related information about it?

@btasker
Copy link
Author

btasker commented Jun 20, 2023

Hi @Yuhjiang

Sure. I don't know your level, so apologies if any of this is teaching you to suck eggs.

So, the bit you know:

  1. Your application establishes a TCP connection to $server and negotiates to use HTTP/2
  2. At some point, packets start getting lost and the connection breaks
  3. The application keeps trying to send new requests over that connection, which fail
  4. About 15 minutes later, a new connection is established and the application recovers

That 15 minute delay in recovery is a product of the way that the Linux kernel handles TCP retransmissions between 2 and 4 above.

When things are fully broken (i.e. the other end doesn't seem to be responding at all), there's an exponential back-off between each retry, for example

Send
wait 200ms
retry 1
wait 400ms
retry 2
wait 800ms
retry 3
wait 1.6s

etc

The minimum and maximum length of delay between the retries is defined by TCP_RTO_MIN (default 200ms) and TCP_RTO_MAX (default 120s). These hardcoded into the kernel (although you can technically change them with BPF) here.

The number of retries, however, is governed by a sysctl - net.ipv4.tcp_retries2. The default value is 15 (so with each pause doubling in length, you get up to that 15 minute mark).

Eventually, the limit in net.ipv4.tcp_retries2 is hit and the kernel decides the connection is dead and closes it (causing it to drop out of the application's connection pool, with a new connection getting opened instead).

Mitigation

So, if you're looking to mitigate this through system config changes, you could do so via net.ipv4.tcp_retries2. If you drop it to 8 the maximum connection hang time will be around 1 minute (which still isn't great, but is in line with timeout thresholds that often get set at application level).

If you wanted to go a step further and adjust TCP_RTO_MIN and TCP_RTO_MAX using BPF there's (what looks like) a good write up at https://arthurchiao.art/blog/customize-tcp-initial-rto-with-bpf/.

Does that help? Happy to go into more (or less) depth if needed

@Yuhjiang
Copy link

@btasker Thank you very much for you detailed reply, I learn a lot from it. I've figured out the problem and the reason. Finally, you've been a big help. I will try x/[email protected] soon to verify if it could solve the problem.

@dmitshur
Copy link
Contributor

Issue #60818 may also be relevant here.

@dmitshur
Copy link
Contributor

Given the report in #60818 of this fix causing a problem, moving back to CherryPickCandidate state so we can take another look.

@dmitshur dmitshur added CherryPickCandidate Used during the release process for point releases WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed CherryPickApproved Used during the release process for point releases labels Jun 21, 2023
@gopherbot gopherbot modified the milestones: Go1.20.6, Go1.20.7 Jul 11, 2023
@alkmc
Copy link

alkmc commented Jul 20, 2023

@dmitshur seems that #60818 is closed now.
Perhaps it is the right time to look again into this issue?

@dmitshur
Copy link
Contributor

@alkmc Note that #60818 was fixed by reverting the very change that's being requested for backport here, and its issue has been reopened. There's not much we can do, other than reject this cherry-pick candidate, until #59690 becomes resolved again.

@btasker
Copy link
Author

btasker commented Jul 20, 2023

@alkmc the way we've mitigated is to instead enable H2 healthchecks.

If you're importing net/http and relying on the H2 bundle, you can't directly access ReadIdleTimeout on the H2 object.

But, net/http2 includes ConfigureTransports which accepts a http.Transport and returns the h2 transport, so you can do

http2Trans, err := http2.ConfigureTransports(transport)
if err == nil {
   http2Trans.ReadIdleTimeout = time.Duration(ReadIdleTimeout)
}

The count-down to ReadIdleTimeout starts when the last frame is received (so is reset everytime the server responds). If nothing is received for that period, a H2 Ping is sent - if that's not answered (there's a PingTimeout attribute if you want to change the threshold for that) the connection is torn down.

One thing I did find though, there's something about setting ReadIdleTimeout which fixes/prevents the race that stopped closeIdleConnections() from working so you may find out your normal timeout setting starts working again (I haven't dug into why)

@gopherbot gopherbot removed this from the Go1.20.7 milestone Aug 1, 2023
@gopherbot gopherbot added this to the Go1.20.8 milestone Aug 1, 2023
@dmitshur dmitshur modified the milestones: Go1.20.8, Go1.20.9 Sep 1, 2023
@gopherbot gopherbot modified the milestones: Go1.20.9, Go1.20.10, Go1.20.11 Oct 5, 2023
@gopherbot gopherbot modified the milestones: Go1.20.11, Go1.20.12 Nov 7, 2023
@gopherbot gopherbot modified the milestones: Go1.20.12, Go1.20.13 Dec 5, 2023
@gopherbot gopherbot modified the milestones: Go1.20.13, Go1.20.14 Jan 9, 2024
@gopherbot gopherbot modified the milestones: Go1.20.14, Go1.20.15 Feb 6, 2024
@dmitshur
Copy link
Contributor

The upstream issue isn't resolved yet, and Go 1.20 has exited its support window per https://go.dev/doc/devel/release#policy.

@dmitshur dmitshur closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2024
@dmitshur dmitshur removed the CherryPickCandidate Used during the release process for point releases label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

11 participants