http_response does not measure anything on dns-error #2784

froth · 2017-05-11T11:39:12Z

Bug report

If a service is unavailable because of misconfiguration dns, the http_response plugin does not send measurement data.

Relevant telegraf.conf:

[[inputs.http_response]]
address = "http://does.not.exist/health"
response_timeout = "5s"
method = "GET"

System info:

Telegraf v1.2.1 (git: release-1.2 3b6ffb3)

Steps to reproduce:

configure http_response to check a url with domain which can not be resolved

Expected behavior:

response_time=0.16079805900000002,http_connection_failure=1

Actual behavior:

2017-05-11T11:36:20Z E! ERROR in input [inputs.http_response]: Get http://does.not.exist/health: dial tcp: lookup does.not.exist on 8.8.8.8:53: no such host

The text was updated successfully, but these errors were encountered:

danielnelson · 2017-05-12T22:13:00Z

related #2624

danielnelson · 2017-05-16T23:01:11Z

I think we should use a new field response_timeout. This way we don't pollute the field containing actual times with values that have a different meaning. I would include the units except it doesn't fit with the format of the existing fields. All the fields that are not applicable should just be left out, so no http_response_code, response_string_match.

Happy Path:

http_response,method=GET,server=http://www.github.com http_response_code=200i,response_time=6.223266528 1459419354977857955

Timeout:

http_response,method=GET,server=http://www.github.com response_timeout=10 1459419354977857955

phemmer · 2017-05-17T02:25:05Z

Are we sure we want to call it "timeout"? The request can fail for numerous reasons other than a timeout. Connection refused, dns lookup failure, connection closed without response, etc.

froth · 2017-05-17T06:16:24Z

I second phemmers comment. Basically what the error state says is "something below the http layer went wrong". It is hard to come up with a good metric name for that. If we call it "http_connection_failed" or something similar there will me confusion, whether it should be 1 or 0, when the server returns 500. The http connection was established but there was an http-error. Maybe one of you guys have a good idea for a name? If we create a new metric, we should at least come up with something less confusing than http_response returning -1 ;)

danielnelson · 2017-05-18T00:56:06Z

That's a good point, I'm confusing this with that other timeout issue.

If we did return a timeout field only for timeouts, would we still want another field for connection failed? The distinction between a missing point and sending an point with an error field seems to be that you know that telegraf is trying. If we want to know there was an attempt, does that mean we need at least one new field for every measurement in the case that there was an error?

froth · 2017-05-18T06:05:58Z

I am not quite sure if this is necessary for each type of measurement but why not?

For http_response it is quite important to be able to measure these "error attempts" I think. We(and most users I think) use the plugin to check whether a Service under a specific URL is answering correctly. We are interested whether the service is responding and whether the reponse given is 200(or other). Currently however we only get data if the http service is reachable and responds. So our alarm bell rings when a 500 is responded but not when connections timeout or dns does not work correctly etc. However this is information I think most people expect from a check for a http endpoint.

So it is quite important to fix http_response, I think.

Would it be OK for you if I submit a patch that introduces a generic connection_failed metric?

After that I would suggest we discuss whether and which more specific metrics we need for http_response. That way we would can a basic solution first and then we can concentrate on making it perfect (and think about, whether we need a generic solution for all measurements).

danielnelson · 2017-05-18T21:54:12Z

Yes, lets do that. So we add connection_failed (float, count) and response_timeout (float, seconds) fields.

froth · 2017-05-22T12:44:56Z

See my pull request #2814
We already have a field response_time containing the request time in seconds. Therefore I added response_timeout as 0/1 boolean flag instead of seconds to avoid redundancy.

danielnelson · 2017-05-22T21:26:25Z

I think we should not use the response_time field. Technically, there was no response so it doesn't match, and it would dirty the data if you want to perform an aggregation over the field.

froth · 2017-05-23T07:33:59Z

I agree, pull request has been updated. Please have a look :)

danielnelson · 2017-06-07T00:07:20Z

PR has been merged.

danielnelson added this to the 1.4.0 milestone May 11, 2017

danielnelson added the bug unexpected problem or unintended behavior label May 11, 2017

ceseuron mentioned this issue May 12, 2017

Input net_response does not return measurements when target port is not open. #2794

Closed

froth mentioned this issue May 16, 2017

Add result_type field for http_response input #2814

Merged

2 tasks

danielnelson closed this as completed Jun 7, 2017

danielnelson mentioned this issue Jun 8, 2017

Adding running parameter to stats #2687

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http_response does not measure anything on dns-error #2784

http_response does not measure anything on dns-error #2784

froth commented May 11, 2017

danielnelson commented May 12, 2017

danielnelson commented May 16, 2017

phemmer commented May 17, 2017

froth commented May 17, 2017 •

edited

Loading

danielnelson commented May 18, 2017

froth commented May 18, 2017

danielnelson commented May 18, 2017

froth commented May 22, 2017

danielnelson commented May 22, 2017

froth commented May 23, 2017

danielnelson commented Jun 7, 2017

http_response does not measure anything on dns-error #2784

http_response does not measure anything on dns-error #2784

Comments

froth commented May 11, 2017

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

danielnelson commented May 12, 2017

danielnelson commented May 16, 2017

phemmer commented May 17, 2017

froth commented May 17, 2017 • edited Loading

danielnelson commented May 18, 2017

froth commented May 18, 2017

danielnelson commented May 18, 2017

froth commented May 22, 2017

danielnelson commented May 22, 2017

froth commented May 23, 2017

danielnelson commented Jun 7, 2017

froth commented May 17, 2017 •

edited

Loading