Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http_response does not measure anything on dns-error #2784

Closed
froth opened this issue May 11, 2017 · 11 comments
Closed

http_response does not measure anything on dns-error #2784

froth opened this issue May 11, 2017 · 11 comments
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@froth
Copy link
Contributor

froth commented May 11, 2017

Bug report

If a service is unavailable because of misconfiguration dns, the http_response plugin does not send measurement data.

Relevant telegraf.conf:

[[inputs.http_response]]
address = "http://does.not.exist/health"
response_timeout = "5s"
method = "GET"

System info:

Telegraf v1.2.1 (git: release-1.2 3b6ffb3)

Steps to reproduce:

  1. configure http_response to check a url with domain which can not be resolved

Expected behavior:

response_time=0.16079805900000002,http_connection_failure=1

Actual behavior:

2017-05-11T11:36:20Z E! ERROR in input [inputs.http_response]: Get http://does.not.exist/health: dial tcp: lookup does.not.exist on 8.8.8.8:53: no such host

@danielnelson danielnelson added this to the 1.4.0 milestone May 11, 2017
@danielnelson danielnelson added the bug unexpected problem or unintended behavior label May 11, 2017
@danielnelson
Copy link
Contributor

related #2624

@danielnelson
Copy link
Contributor

I think we should use a new field response_timeout. This way we don't pollute the field containing actual times with values that have a different meaning. I would include the units except it doesn't fit with the format of the existing fields. All the fields that are not applicable should just be left out, so no http_response_code, response_string_match.

Happy Path:

http_response,method=GET,server=http://www.github.com http_response_code=200i,response_time=6.223266528 1459419354977857955

Timeout:

http_response,method=GET,server=http://www.github.com response_timeout=10 1459419354977857955

@phemmer
Copy link
Contributor

phemmer commented May 17, 2017

Are we sure we want to call it "timeout"? The request can fail for numerous reasons other than a timeout. Connection refused, dns lookup failure, connection closed without response, etc.

@froth
Copy link
Contributor Author

froth commented May 17, 2017

I second phemmers comment. Basically what the error state says is "something below the http layer went wrong". It is hard to come up with a good metric name for that. If we call it "http_connection_failed" or something similar there will me confusion, whether it should be 1 or 0, when the server returns 500. The http connection was established but there was an http-error. Maybe one of you guys have a good idea for a name? If we create a new metric, we should at least come up with something less confusing than http_response returning -1 ;)

@danielnelson
Copy link
Contributor

That's a good point, I'm confusing this with that other timeout issue.

If we did return a timeout field only for timeouts, would we still want another field for connection failed? The distinction between a missing point and sending an point with an error field seems to be that you know that telegraf is trying. If we want to know there was an attempt, does that mean we need at least one new field for every measurement in the case that there was an error?

@froth
Copy link
Contributor Author

froth commented May 18, 2017

I am not quite sure if this is necessary for each type of measurement but why not?

For http_response it is quite important to be able to measure these "error attempts" I think. We(and most users I think) use the plugin to check whether a Service under a specific URL is answering correctly. We are interested whether the service is responding and whether the reponse given is 200(or other). Currently however we only get data if the http service is reachable and responds. So our alarm bell rings when a 500 is responded but not when connections timeout or dns does not work correctly etc. However this is information I think most people expect from a check for a http endpoint.

So it is quite important to fix http_response, I think.

Would it be OK for you if I submit a patch that introduces a generic connection_failed metric?

After that I would suggest we discuss whether and which more specific metrics we need for http_response. That way we would can a basic solution first and then we can concentrate on making it perfect (and think about, whether we need a generic solution for all measurements).

@danielnelson
Copy link
Contributor

Yes, lets do that. So we add connection_failed (float, count) and response_timeout (float, seconds) fields.

@froth
Copy link
Contributor Author

froth commented May 22, 2017

See my pull request #2814
We already have a field response_time containing the request time in seconds. Therefore I added response_timeout as 0/1 boolean flag instead of seconds to avoid redundancy.

@danielnelson
Copy link
Contributor

I think we should not use the response_time field. Technically, there was no response so it doesn't match, and it would dirty the data if you want to perform an aggregation over the field.

@froth
Copy link
Contributor Author

froth commented May 23, 2017

I agree, pull request has been updated. Please have a look :)

@danielnelson
Copy link
Contributor

PR has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants