Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

service crash with panic: slice bounds out of range #3001

Closed
mattwwarren opened this issue Jul 10, 2017 · 10 comments · Fixed by #3003
Closed

service crash with panic: slice bounds out of range #3001

mattwwarren opened this issue Jul 10, 2017 · 10 comments · Fixed by #3003
Milestone

Comments

@mattwwarren
Copy link

Bug report

Relevant telegraf.conf:

Apologies if not all of this is relevant.

[global_tags]
  env = "prod"

[agent]
  interval = "10s"
  round_interval = false
  metric_batch_size = 2000
  metric_buffer_limit = 6000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "3s"
  precision = ""
  debug = false
  quiet = false
  hostname = ""
  omit_hostname = false

[[outputs.influxdb]]
  urls = [
            "http://myinflux:8086",
              ]
  database = "api"
  retention_policy = ""
    namedrop = ["ops_*", "logparser*"]
    write_consistency = "any"
  timeout = "5s"

[[outputs.influxdb]]
  urls = [
            "http://myinflux:8086",
              ]
  database = "telegraf_prod"
  retention_policy = "telegraf_prod_default"
  namepass = ["ops_*", "logparser*"]
  write_consistency = "any"
  timeout = "5s"

[[inputs.logparser]]
  files = [
    "/var/log/httpd/*access.log",
    "/var/log/httpd/*combined.log",
    "/var/log/httpd/*forwarded.log",
  ]

  [inputs.logparser.grok]
  patterns = [
    "%{COMBINED_LOG_FORMAT_WITH_RESPONSE_TIME_US}"
  ]

### service input plugins ###
  custom_patterns = '''
  COMBINED_LOG_FORMAT_WITH_RESPONSE_TIME_US %{CLIENT:client_ip} %{NOTSPACE:ident} %{NOTSPACE:auth} \[%{HTTPDATE:ts:ts-httpd}\] "(?:%{WORD:verb:tag} %{NOTSPACE:request}(?: HTTP/%{NUMBER:http_version:float})?|%{DATA})" %{NUMBER:resp_code:int} (?:%{NUMBER:resp_bytes:int}|-) %{QS:referrer} %{QS:agent} %{NUMBER:response_time_us:int}
  '''

System info:

Telegraf v1.3.3 (git: release-1.3 46db92a)
Amazon Linux latest

Steps to reproduce:

  1. Set up a high volume application with apache logs running through telegraf
  2. Wait?

Expected behavior:

Telegraf does not crash

Actual behavior:

After running for some number of hours, telegraf will crash with panic: runtime error: slice bounds out of range

Additional info:

goroutine 30 [running]:
github.com/influxdata/telegraf/metric.(*metric).Fields(0xc420789800, 0xc420e36fe0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/metric/metric.go:318 +0x5c8
github.com/influxdata/telegraf/plugins/inputs/logparser.(*LogParserPlugin).parser(0xc420328000)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/logparser/logparser.go:224 +0x25f
created by github.com/influxdata/telegraf/plugins/inputs/logparser.(*LogParserPlugin).Start
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/logparser/logparser.go:129 +0x56d

I glanced over the code but couldn't quite tell where the error bottomed out. I thought perhaps our custom log pattern was at the source of the issue but the only float field is the HTTP version and they are all 1.1

# cat /var/log/httpd/api.access.log | awk -F'HTTP/' '{print $2}' | awk -F'"' '{print $1}' | sort | uniq -c
 954368 1.1
@mattwwarren
Copy link
Author

Also, this panic may have been referenced on #2488 but the request to open a new issue seems to have never been done.

@danielnelson
Copy link
Contributor

Can you check if it panics again if you replay the same logs? You will probably need to use from_beginning = true.

@danielnelson
Copy link
Contributor

I'm making a instrumented build that will improve the logging here if it panics, it will be a little slower than the 1.3.3 but otherwise the same. Would it be possible for you to use it until you get another crash?

@mattwwarren
Copy link
Author

I am trying to replay the same log from a different server with debug logging. I will report back with the results of that.

Since this is crashing pretty regularly for us, I would be happy to run a custom build.

@danielnelson
Copy link
Contributor

I added this code: 5a56291 I think this should be enough to determine if the bug is in logparser or in the metrics code.

Here is an linux amd64 build: https://7429-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.cD2V8ru/telegraf.gz

@mattwwarren
Copy link
Author

Thank you. I will try to get that installed (although perhaps tomorrow)

Some additional good news, I was able to recreate the panic on a second machine with the same log and my pasted config.

@danielnelson
Copy link
Contributor

Great, so long as we can reproduce it then it shouldn't be too hard to fix.

@mattwwarren
Copy link
Author

mattwwarren commented Jul 10, 2017

I had just enough time to kick off a run before I left for the day.

panic: Recovered in metric.Fields(); m.fields: "agent=\"Mozilla/5.0 (Linux; Android 5.1; Bush Spira D2 5.5\\\\\" Smartphone Build/LMY47D) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36\",resp_code=204i,referrer=\"https://www.wanderu.com/en/tripsummary/GBNXP%2CGBCPIGBNXP%2CGBUBWGBNXP3%2C2017-07-13T01%3A40%3A00%2B01%3A00%2C2017-07-13T02%3A20%3A00%2B01%3A00%2C?query=anonymized\",response_time_us=1133i,client_ip=\"10.20.0.208\",auth=\"-\",ident=\"-\",request=\"/v2/auth.json\",http_version=1.1"

@mattwwarren
Copy link
Author

Right on! Thanks for the quick turnaround. I look forward to 1.3.4!

@danielnelson
Copy link
Contributor

I hope to have it out tomorrow, if you need a build of the 1.3 branch with the fix you can use https://7437-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.XAvjpJX/telegraf.gz

Thanks for the help with this bug!

@danielnelson danielnelson added this to the 1.3.4 milestone Jul 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants