-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(outputs.influxdb_v2): expose HTTP/2 client timeouts #13562
Conversation
Have given this a quick test using a 30s [[outputs.influxdb_v2]]
urls = ["https://us-central1-1.gcp.cloud2.influxdata.com"]
token = "<redacted>"
organization = "<redacted>"
bucket = "woo"
timeout = "15s"
alias = "woo"
read_idle_timeout = "30s" Running:
Killed comms to the client port that Telegraf had acquired
Telegraf recovered nicely:
|
There's something off with the timings though.
We can inspect the timings by putting the H2 client into debug mode
After locating the first For example, in my log I have
So, although a write failure is logged at 10:23:05, the A little further down, we see that a ping attempt has timed out (i.e. hit the threshold defined by
The result of this is that the connection is closed
So, although we expected to see
The teardown actually happened much more quickly:
I re-ran with
So this time, it's about 19s. For completeness, I set
So, although the presence of these settings is clearly activating the healthchecks (things don't recover without these options present), for whatever reason their values don't seem to be being taken into account. I don't currently have a good explanation of why that might be, need to probe some more |
Ah, now this is interesting. It's not actually the health check that's tearing the connection down in the tests above. When a health check is triggered, there's a specific log line written:
That doesn't appear in the debug output. The connection is being closed because So the bit that matters in the tests above is the value of I've extended that to give the ping timeout room to work
This time we get the following timings
Then we get out ping (30s after the last read)
40s later that ping times out
and the next write is successful. So, I'm not yet sure how, but by setting |
So, actually, with this build, it's not even necessary to set the options - it addresses the zombie connection issue either way. The config
will still recover. The bad news is, this is probably only temporary. The reason this happens (I think) is because we call Telegraf pulls in So, the timeout recovery I saw earlier in testing is actually expected, but will break again. |
Sorry @powersj, I took a very long way around, but this seems to be working as it should. Whilst Telegraf depends on But, once |
That was quite a roller coaster of responses ;)
Should we only call this if the user has set one of the timeouts then? That way we ensure the behavior does not change at all unless someone sets one of these values?
Do you have suggestions on the TODO values for the readme and sample config?
I have updated the PR to use that version. Can you verify? |
Hah, yes, it's definitely been a day of mixed emotions :)
Yep, that sounds like a good shout to me - means we can guarantee that nothing changes without these set.
Yep, so using the newer artefact: [[outputs.influxdb_v2]]
urls = ["https://us-central1-1.gcp.cloud2.influxdata.com"]
token = "<redacted>"
organization = "<redacted>"
bucket = "woo"
timeout = "15s"
alias = "woo"
# read_idle_timeout = "30"
# ping_timeout = "25" Exhibits the expected broken behaviour - when the connection is broken, we have to wait for the kernel to exhaust its retries. [[outputs.influxdb_v2]]
urls = ["https://us-central1-1.gcp.cloud2.influxdata.com"]
token = "<redacted>"
organization = "<redacted>"
bucket = "woo"
timeout = "15s"
alias = "woo"
read_idle_timeout = "30"
# ping_timeout = "25" Strangely, this appears to restore the behaviour I had earlier where What's interesting, is it seems (assuming I haven't just been very lucky) to happen Further testing suggests that this was just blind luck in terms of the thresholds I chose! Running with timeout = "15s"
alias = "woo"
read_idle_timeout = "55"
# ping_timeout = "25" leads to a different experience:
That then ultimately recovers So there's a slight race possible here. I need to dig back into Whatever the cause, in terms of behaviour, it's probably reasonably beneficial - it hastens recovery rather than delaying it. The idle count is only reset by frames, so there shouldn't be any adverse affect if With that knowledge: If [[outputs.influxdb_v2]]
urls = ["https://us-central1-1.gcp.cloud2.influxdata.com"]
token = "<redacted>"
organization = "<redacted>"
bucket = "woo"
timeout = "60s"
alias = "woo"
read_idle_timeout = "30s"
# ping_timeout = "25" Results in
Looks like success:
I think we probably want to recommend 15s for
It's probably also wise to size it to try and limit its interaction with The read_idle_timeout = "30"
ping_timeout = "15"
(assuming If we want to hedge a little, we probably could get away with much higher values though - the underlying Go issue can lead to connections being blocked for ~17 minutes, so a TL:DRLooks good to me, there are just some nuances we might need to be aware of around its interaction with |
ok I have done the following:
|
LGTM, thanks! The only extra thing I can think of that we might want to add is a test which uses the new config items. I don't think we can make a test force the type of failure needed to trigger the underlying behaviour that this change addresses, but having a test which forces use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @powersj! I have one question regarding why you suggest the values you suggest. Just out of curiosity.
The other comment I have is the (seemingly) unrelated change to outputs.instrumental
... Why is this required?
## The following values control the HTTP/2 client's timeouts. These settings | ||
## are generally not required unless a user is seeing issues with client | ||
## disconnects. If a user does see issues, then it is suggested to set these | ||
## values to "15s" for ping timeout and "30s" for read idle timeout and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious why these values!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These came out of the conversation with @btasker, specifically #13562 (comment)
This came out of failing tests, which you have seen the other PR for instrumental as well. I wanted to get artifacts to test. If we land that other PR I can rebase and remove these changes. |
Download PR build artifacts for linux_amd64.tar.gz, darwin_amd64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @powersj! LGTM.
(cherry picked from commit d2ebc8e)
No description provided.