Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(outputs.influxdb_v2): expose HTTP/2 client timeouts #13562

Merged
merged 7 commits into from
Jul 10, 2023

Conversation

powersj
Copy link
Contributor

@powersj powersj commented Jul 6, 2023

No description provided.

@telegraf-tiger telegraf-tiger bot added area/influxdb fix pr to fix corresponding bug plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins labels Jul 6, 2023
@btasker
Copy link
Contributor

btasker commented Jul 7, 2023

Have given this a quick test using a 30s read_idle_timeout

[[outputs.influxdb_v2]]
  urls = ["https://us-central1-1.gcp.cloud2.influxdata.com"]
  token = "<redacted>"
  organization = "<redacted>"
  bucket = "woo"
  timeout = "15s"
  alias = "woo"
  read_idle_timeout = "30s"

Running:

$ ./telegraf-1.28.0/usr/bin/telegraf --config telegraf_repro_c2_timeout.conf

Killed comms to the client port that Telegraf had acquired

$ sudo iptables -I INPUT -p tcp --dport 34774 -j DROP

Telegraf recovered nicely:

2023-07-07T10:16:30Z D! [outputs.influxdb_v2::woo] Wrote batch of 7696 metrics in 1.213010468s
2023-07-07T10:16:30Z D! [outputs.influxdb_v2::woo] Buffer fullness: 1870 / 100000 metrics
2023-07-07T10:16:39Z W! [agent] ["outputs.influxdb_v2::woo"] did not complete within its flush interval
2023-07-07T10:16:39Z D! [outputs.influxdb_v2::woo] Buffer fullness: 15407 / 100000 metrics
2023-07-07T10:16:44Z W! [agent] ["outputs.influxdb_v2::woo"] did not complete within its flush interval
2023-07-07T10:16:44Z D! [outputs.influxdb_v2::woo] Buffer fullness: 23147 / 100000 metrics
2023-07-07T10:16:49Z E! [outputs.influxdb_v2::woo] When writing to [https://us-central1-1.gcp.cloud2.influxdata.com]: Post "https://us-central1-1.gcp.cloud2.influxdata.com/api/v2/write?bucket=woo&org=fd76a6a54498de0a": context deadline exceeded
2023-07-07T10:16:49Z D! [outputs.influxdb_v2::woo] Buffer fullness: 30973 / 100000 metrics
2023-07-07T10:16:49Z E! [agent] Error writing to outputs.influxdb_v2::woo: failed to send metrics to any configured server(s)
2023-07-07T10:16:51Z D! [outputs.influxdb_v2::woo] Wrote batch of 10000 metrics in 2.158238188s
2023-07-07T10:16:51Z D! [outputs.influxdb_v2::woo] Buffer fullness: 24296 / 100000 metrics
2023-07-07T10:16:53Z D! [outputs.influxdb_v2::woo] Wrote batch of 10000 metrics in 1.8990725s
2023-07-07T10:16:54Z W! [agent] ["outputs.influxdb_v2::woo"] did not complete within its flush interval
2023-07-07T10:16:54Z D! [outputs.influxdb_v2::woo] Buffer fullness: 18681 / 100000 metrics
2023-07-07T10:16:55Z D! [outputs.influxdb_v2::woo] Wrote batch of 10000 metrics in 1.788305659s
2023-07-07T10:16:56Z D! [outputs.influxdb_v2::woo] Wrote batch of 9993 metrics in 1.543969249s

@btasker
Copy link
Contributor

btasker commented Jul 7, 2023

There's something off with the timings though.

ReadIdleTimeout defines the time that can pass since a connection last received frames, so with a setting of 30s it'll generally trigger less than 30s after a write request appears to fail because the last response was likely flush_interval seconds before (unless metric_batch_size was hit first).

We can inspect the timings by putting the H2 client into debug mode

export GODEBUG=http2debug=2
./telegraf-1.28.0/usr/bin/telegraf --config telegraf_repro_c2_timeout.conf

After locating the first did not complete within its flush interval message, we need to look above it for the most recent indication that the transport received something

For example, in my log I have

2023-07-07T10:22:56Z I! http2: Transport received HEADERS flags=END_STREAM|END_HEADERS stream=3 len=168
2023-07-07T10:22:56Z D! [outputs.influxdb_v2::woo] Wrote batch of 7741 metrics in 1.299565621s
... snip ...
2023-07-07T10:23:05Z W! [agent] ["outputs.influxdb_v2::woo"] did not complete within its flush interval

So, although a write failure is logged at 10:23:05, the ReadIdleTimeout counter stared counting at 10:22:56 (because that's when we last received a frame from the server)

A little further down, we see that a ping attempt has timed out (i.e. hit the threshold defined by ping_timeout - which defaults to 15s in the current Go core (the default is set here)

2023-07-07T10:23:15Z I! RoundTrip failure: context deadline exceeded

The result of this is that the connection is closed

2023-07-07T10:23:15Z I! http2: Framer 0xc0033fe7e0: wrote RST_STREAM stream=5 len=4 ErrCode=CANCEL
2023-07-07T10:23:15Z I! http2: Transport closing idle conn 0xc000e18180 (forSingleUse=false, maxStream=5)

So, although we expected to see

  • Packet received @ 10:22:56
  • Wait 30s (10:23:24)
  • Trigger healthcheck
  • Wait 15s for ping timeout (10:23:39)
  • Transport closed

The teardown actually happened much more quickly:

  • Packet received @ 10:22:56
  • +25s Ping timed out @ 10:23:15
  • Connection torn down

I re-ran with read_idle_timeout set to 50s and got the following timings

  • Last logged frame @ 10:56:29
  • did not complete within flush interval message @ 10:56:38
  • Connection tear down @ 10:56:48Z

So this time, it's about 19s.

For completeness, I set ping_timeout = 40s and read_idle_timeout = "30s" and got the following timings

  • Last logged frame @ 11:02:56Z
  • Flush interval message @ 11:03:05Z
  • Connection tear down @ 11:03:15Z

So, although the presence of these settings is clearly activating the healthchecks (things don't recover without these options present), for whatever reason their values don't seem to be being taken into account.

I don't currently have a good explanation of why that might be, need to probe some more

@btasker
Copy link
Contributor

btasker commented Jul 7, 2023

Ah, now this is interesting.

It's not actually the health check that's tearing the connection down in the tests above.

When a health check is triggered, there's a specific log line written:

http2: Transport sending health check

That doesn't appear in the debug output.

The connection is being closed because closeIdleConns() is managing to identify the connection as idle and so is willing to close it (essentially addressing the race that started us down this path at all).

So the bit that matters in the tests above is the value of timeout in the outputs.influxdb_v2 section.

I've extended that to give the ping timeout room to work

[[outputs.influxdb_v2]]
  urls = ["https://us-central1-1.gcp.cloud2.influxdata.com"]
  token = "<redacted>"
  organization = "<org>"
  bucket = "woo"
  timeout = "90s"
  alias = "woo"
  read_idle_timeout = "30s"
  ping_timeout = "40s"

This time we get the following timings

  • last response received @ 11:21:51Z
  • Flush interval warnings start @ 11:22:00Z

Then we get out ping (30s after the last read)

2023-07-07T11:22:21Z I! http2: Transport sending health check
2023-07-07T11:22:21Z I! http2: Framer 0xc0035c3a40: wrote PING len=8 ping="2\xa8q}u\x1d/\x98"

40s later that ping times out

2023-07-07T11:23:01Z I! http2: Transport health check failure: context deadline exceeded
2023-07-07T11:23:01Z I! http2: Transport readFrame error on conn 0xc000579080: (*net.OpError) read tcp 192.168.3.217:39378->35.202.191.0:443: use of closed network connection
2023-07-07T11:23:01Z I! http2: Framer 0xc0035c3a40: wrote RST_STREAM stream=7 len=4 ErrCode=CANCEL
2023-07-07T11:23:01Z I! RoundTrip failure: http2: client connection lost

and the next write is successful.

So, I'm not yet sure how, but by setting ReadIdleTimeout or PingTimeout to a non-0 value, we've also inadvertently prevented (rather than simply mitigated) the behaviour that led to golang/go#59690

@btasker
Copy link
Contributor

btasker commented Jul 7, 2023

So, actually, with this build, it's not even necessary to set the options - it addresses the zombie connection issue either way.

The config

[[outputs.influxdb_v2]]
  urls = ["https://us-central1-1.gcp.cloud2.influxdata.com"]
  token = "<redacted>"
  organization = "<org>"
  bucket = "woo"
  timeout = "90s"
  alias = "woo"
#  read_idle_timeout = "30s"
# ping_timeout = "40s"

will still recover.

The bad news is, this is probably only temporary.

The reason this happens (I think) is because we call x/net/http2's configureTransports() , which returns a H2 transport (using the definition in net/http2 rather than the one in the core net/http h2_bundle).

Telegraf pulls in x/net version v0.11.0 which still has the fix from golang/go#59690 in it. That fix, unfortunately needed to be reverted, so won't be present in versions > v0.11.0 (and at time of writing, there isn't a replacement fix in place).

So, the timeout recovery I saw earlier in testing is actually expected, but will break again.

@btasker
Copy link
Contributor

btasker commented Jul 7, 2023

Sorry @powersj, I took a very long way around, but this seems to be working as it should.

Whilst Telegraf depends on x/net v0.11.0, the pings probably won't trigger unless the user has a higher timeout set than read_idle_timeout. This might cause a little bit of confusion at first.

But, once v0.12.0 is used it should work almost exactly as expected

@powersj
Copy link
Contributor Author

powersj commented Jul 7, 2023

@btasker,

That was quite a roller coaster of responses ;)

The reason this happens (I think) is because we call x/net/http2's configureTransports() , which returns a H2 transport (using the definition in net/http2 rather than the one in the core net/http h2_bundle).

Should we only call this if the user has set one of the timeouts then? That way we ensure the behavior does not change at all unless someone sets one of these values?

values to TODO for ping timeout and TODO for read idle timeout and retry.

Do you have suggestions on the TODO values for the readme and sample config?

But, once v0.12.0 is used it should work almost exactly as expected

I have updated the PR to use that version. Can you verify?

@btasker
Copy link
Contributor

btasker commented Jul 7, 2023

That was quite a roller coaster of responses ;)

Hah, yes, it's definitely been a day of mixed emotions :)

Should we only call this if the user has set one of the timeouts then? That way we ensure the behavior does not change at all unless someone sets one of these values?

Yep, that sounds like a good shout to me - means we can guarantee that nothing changes without these set.

I have updated the PR to use that version. Can you verify?

Yep, so using the newer artefact:

[[outputs.influxdb_v2]]
  urls = ["https://us-central1-1.gcp.cloud2.influxdata.com"]
  token = "<redacted>"
  organization = "<redacted>"
  bucket = "woo"
  timeout = "15s"
  alias = "woo"
#  read_idle_timeout = "30"
#  ping_timeout = "25"

Exhibits the expected broken behaviour - when the connection is broken, we have to wait for the kernel to exhaust its retries.

[[outputs.influxdb_v2]]
  urls = ["https://us-central1-1.gcp.cloud2.influxdata.com"]
  token = "<redacted>"
  organization = "<redacted>"
  bucket = "woo"
  timeout = "15s"
  alias = "woo"
  read_idle_timeout = "30"
#  ping_timeout = "25"

Strangely, this appears to restore the behaviour I had earlier where timeout will lead to the connection being abandoned.

What's interesting, is it seems (assuming I haven't just been very lucky) to happen read_idle_timeout seconds after the last read.

Further testing suggests that this was just blind luck in terms of the thresholds I chose! Running with

  timeout = "15s"
  alias = "woo"
  read_idle_timeout = "55"
#  ping_timeout = "25"

leads to a different experience:

2023-07-07T14:33:32Z I! RoundTrip failure: context deadline exceeded
2023-07-07T14:33:32Z E! [outputs.influxdb_v2::woo] When writing to [https://us-central1-1.gcp.cloud2.influxdata.com]: Post "https://us-central1-1.gcp.cloud2.influxdata.com/api/v2/write?bucket=woo&org=fd76a6a54498de0a": context deadline exceeded
2023-07-07T14:33:32Z D! [outputs.influxdb_v2::woo] Buffer fullness: 54206 / 100000 metrics
2023-07-07T14:33:32Z E! [agent] Error writing to outputs.influxdb_v2::woo: failed to send metrics to any configured server(s)
2023-07-07T14:33:32Z W! [agent] ["outputs.influxdb_v2::woo"] did not complete within its flush interval
2023-07-07T14:33:32Z D! [outputs.influxdb_v2::woo] Buffer fullness: 54213 / 100000 metrics
2023-07-07T14:33:33Z W! [inputs.cpu] Collection took longer than expected; not complete after interval of 5ms
2023-07-07T14:33:33Z D! [inputs.cpu] Previous collection has not completed; scheduled collection skipped
2023-07-07T14:33:37Z W! [agent] ["outputs.influxdb_v2::woo"] did not complete within its flush interval
2023-07-07T14:33:37Z D! [outputs.influxdb_v2::woo] Buffer fullness: 61941 / 100000 metrics
2023-07-07T14:33:42Z W! [agent] ["outputs.influxdb_v2::woo"] did not complete within its flush interval
2023-07-07T14:33:42Z D! [outputs.influxdb_v2::woo] Buffer fullness: 69691 / 100000 metrics
2023-07-07T14:33:47Z I! RoundTrip failure: context deadline exceeded
2023-07-07T14:33:47Z E! [outputs.influxdb_v2::woo] When writing to [https://us-central1-1.gcp.cloud2.influxdata.com]: Post "https://us-central1-1.gcp.cloud2.influxdata.com/api/v2/write?bucket=woo&org=fd76a6a54498de0a": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2023-07-07T14:33:47Z D! [outputs.influxdb_v2::woo] Buffer fullness: 77426 / 100000 metrics
2023-07-07T14:33:47Z E! [agent] Error writing to outputs.influxdb_v2::woo: failed to send metrics to any configured server(s)
2023-07-07T14:33:57Z I! http2: Transport sending health check

That then ultimately recovers

So there's a slight race possible here.

I need to dig back into net/http2 to verify, but I think the answer to "why" is: If read_idle_timeout is hit around the same time (but very slightly before) the connection timeout, it'll mark the connection as idle, but before the healthcheck stuff triggers we'll have called closeIdleConnections() resulting in the connection being torn down.

Whatever the cause, in terms of behaviour, it's probably reasonably beneficial - it hastens recovery rather than delaying it. The idle count is only reset by frames, so there shouldn't be any adverse affect if closeIdleConnections() gets called first (or at the same time) - it'd just end up waiting for the ping to fail before closing.

With that knowledge: If timeout is greater than read_idle_timeout+ping_timeout we should see the recovery happen based on ping alone:

[[outputs.influxdb_v2]]
  urls = ["https://us-central1-1.gcp.cloud2.influxdata.com"]
  token = "<redacted>"
  organization = "<redacted>"
  bucket = "woo"
  timeout = "60s"
  alias = "woo"
  read_idle_timeout = "30s"
#  ping_timeout = "25"

Results in

2023-07-07T14:42:38Z I! http2: Transport received HEADERS flags=END_STREAM|END_HEADERS stream=11 len=169
2023-07-07T14:42:38Z D! [outputs.influxdb_v2::woo] Wrote batch of 7741 metrics in 1.061120408s

2023-07-07T14:42:47Z W! [agent] ["outputs.influxdb_v2::woo"] did not complete within its flush interval
2023-07-07T14:42:47Z D! [outputs.influxdb_v2::woo] Buffer fullness: 15542 / 100000 metrics
2023-07-07T14:42:52Z W! [agent] ["outputs.influxdb_v2::woo"] did not complete within its flush interval
2023-07-07T14:42:52Z D! [outputs.influxdb_v2::woo] Buffer fullness: 23267 / 100000 metrics
2023-07-07T14:42:57Z W! [agent] ["outputs.influxdb_v2::woo"] did not complete within its flush interval
2023-07-07T14:42:57Z D! [outputs.influxdb_v2::woo] Buffer fullness: 31002 / 100000 metrics
2023-07-07T14:43:02Z W! [agent] ["outputs.influxdb_v2::woo"] did not complete within its flush interval
2023-07-07T14:43:02Z D! [outputs.influxdb_v2::woo] Buffer fullness: 38721 / 100000 metrics
2023-07-07T14:43:04Z D! [inputs.cpu] Previous collection has not completed; scheduled collection skipped
2023-07-07T14:43:07Z W! [agent] ["outputs.influxdb_v2::woo"] did not complete within its flush interval
2023-07-07T14:43:07Z D! [outputs.influxdb_v2::woo] Buffer fullness: 46463 / 100000 metrics
2023-07-07T14:43:08Z I! http2: Transport sending health check
2023-07-07T14:43:08Z I! http2: Framer 0xc00362c700: wrote PING len=8 ping="~VM.\x91\xac\n\xbb

2023-07-07T14:43:23Z I! http2: Transport health check failure: context deadline exceeded
2023-07-07T14:43:23Z I! http2: Transport readFrame error on conn 0xc000afe180: (*net.OpError) read tcp 192.168.3.217:47504->35.202.191.0:443: use of closed network connection
2023-07-07T14:43:23Z I! http2: Framer 0xc00362c700: wrote RST_STREAM stream=13 len=4 ErrCode=CANCEL
2023-07-07T14:43:23Z I! RoundTrip failure: http2: client connection lost

2023-07-07T14:43:23Z I! http2: Transport failed to get client conn for us-central1-1.gcp.cloud2.influxdata.com:443: http2: no cached connection was available
2023-07-07T14:43:23Z I! http2: Transport failed to get client conn for us-central1-1.gcp.cloud2.influxdata.com:443: http2: no cached connection was available
2023-07-07T14:43:23Z I! http2: Transport creating client conn 0xc000afe000 to 35.202.191.0:443

Looks like success:

  • Healthcheck fires when expected (30s after last received frame)
  • ping times out when expected (15s later)
  • the connection is torn down (moments later)
  • new write attempt establishes a new connection

Do you have suggestions on the TODO values for the readme and sample config?

I think we probably want to recommend 15s for ping_timeout (or leaving it unset) - that aligns with the default used by Go and nothing in my testing suggests that tweaking it yields too much benefit.

read_idle_timeout needs to be longer than flush_interval (otherwise, unless metric_batch_size is hit healthchecks will automatically trigger between writes - potentially leading to a change in behaviour during normal times).

It's probably also wise to size it to try and limit its interaction with timeout (if only because the resulting behaviour can be a little confusing).

The README and sample.conf both use a value of 5 for timeout, so with that in mind, I'd suggest

  read_idle_timeout = "30"
  ping_timeout = "15"

(assuming flush_interval is less than 30s or there's strong confidence in metric_batch_size being hit).

If we want to hedge a little, we probably could get away with much higher values though - the underlying Go issue can lead to connections being blocked for ~17 minutes, so a read_idle_timeout of even 2 minutes would still deliver a significant improvement.


TL:DR

Looks good to me, there are just some nuances we might need to be aware of around its interaction with timeout. They shouldn't cause any issues, other than "it seems to have recovered faster than expected".

@powersj
Copy link
Contributor Author

powersj commented Jul 7, 2023

ok I have done the following:

  • Added your note about the timer
  • Added the 15s and 30s suggestions
  • Only configure the http2 transport when either of the http2 settings are set
  • Rebased on master

@btasker
Copy link
Contributor

btasker commented Jul 7, 2023

LGTM, thanks!

The only extra thing I can think of that we might want to add is a test which uses the new config items.

I don't think we can make a test force the type of failure needed to trigger the underlying behaviour that this change addresses, but having a test which forces use of http2.ConfigureTransports() would help detect if x/net is ever updated to some future version with a breaking change.

@powersj powersj marked this pull request as ready for review July 7, 2023 19:00
@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Jul 7, 2023
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @powersj! I have one question regarding why you suggest the values you suggest. Just out of curiosity.
The other comment I have is the (seemingly) unrelated change to outputs.instrumental... Why is this required?

## The following values control the HTTP/2 client's timeouts. These settings
## are generally not required unless a user is seeing issues with client
## disconnects. If a user does see issues, then it is suggested to set these
## values to "15s" for ping timeout and "30s" for read idle timeout and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why these values!?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These came out of the conversation with @btasker, specifically #13562 (comment)

@powersj
Copy link
Contributor Author

powersj commented Jul 10, 2023

The other comment I have is the (seemingly) unrelated change to outputs.instrumental... Why is this required?

This came out of failing tests, which you have seen the other PR for instrumental as well. I wanted to get artifacts to test. If we land that other PR I can rebase and remove these changes.

@telegraf-tiger
Copy link
Contributor

Download PR build artifacts for linux_amd64.tar.gz, darwin_amd64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

⚠️ This pull request increases the Telegraf binary size by 3.37 % for linux amd64 (new size: 193.7 MB, nightly size 187.4 MB)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB RPM TAR GZ ZIP
amd64.deb aarch64.rpm darwin_amd64.tar.gz windows_amd64.zip
arm64.deb armel.rpm darwin_arm64.tar.gz windows_arm64.zip
armel.deb armv6hl.rpm freebsd_amd64.tar.gz windows_i386.zip
armhf.deb i386.rpm freebsd_armv7.tar.gz
i386.deb ppc64le.rpm freebsd_i386.tar.gz
mips.deb riscv64.rpm linux_amd64.tar.gz
mipsel.deb s390x.rpm linux_arm64.tar.gz
ppc64el.deb x86_64.rpm linux_armel.tar.gz
riscv64.deb linux_armhf.tar.gz
s390x.deb linux_i386.tar.gz
linux_mips.tar.gz
linux_mipsel.tar.gz
linux_ppc64le.tar.gz
linux_riscv64.tar.gz
linux_s390x.tar.gz

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @powersj! LGTM.

@srebhan srebhan merged commit d2ebc8e into influxdata:master Jul 10, 2023
@srebhan srebhan added this to the v1.27.2 milestone Jul 10, 2023
powersj added a commit that referenced this pull request Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/influxdb fix pr to fix corresponding bug plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants