Skip to content
This repository has been archived by the owner on Feb 5, 2021. It is now read-only.

emit transaction failed: error_class=NoMethodError error="undefined method `[]' for nil:NilClass" #54

Closed
emptywee opened this issue Apr 1, 2018 · 18 comments

Comments

@emptywee
Copy link
Contributor

emptywee commented Apr 1, 2018

With the latest upgrade to v1.8 from v1.6 we began to receive the following errors:

fluentd-2tl8n fluentd 2018-04-01 18:44:35 +0000 [warn]: #0 emit transaction failed: error_class=NoMethodError error="undefined method `[]' for nil:NilClass" tag="containers.mnt.log.containers.calico-node-hxmb6_kube-system_calico-node-703da71b5a5aa7a72c4eef2653357809f11cb03bfb6fab74daa7e9db92b4acc3.log"
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:297:in `filter_stream_from_files'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/event_router.rb:177:in `block in filter_stream'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/event_router.rb:177:in `each'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/event_router.rb:177:in `reduce'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/event_router.rb:177:in `filter_stream'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/event_router.rb:158:in `emit_events'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/event_router.rb:96:in `emit_stream'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_relabel.rb:29:in `process'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:721:in `emit_sync'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/event_router.rb:159:in `emit_events'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/event_router.rb:96:in `emit_stream'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/in_tail.rb:348:in `receive_lines'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/in_tail.rb:465:in `wrap_receive_lines'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/in_tail.rb:666:in `block in on_notify'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/in_tail.rb:707:in `with_io'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/in_tail.rb:644:in `on_notify'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/in_tail.rb:496:in `on_notify'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/timer.rb:77:in `on_timer'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/cool.io-1.5.0/lib/cool.io/loop.rb:88:in `run_once'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/cool.io-1.5.0/lib/cool.io/loop.rb:88:in `run'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/event_loop.rb:84:in `block in start'
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
fluentd-2tl8n fluentd 2018-04-01 18:44:35 +0000 [warn]: #0 emit transaction failed: error_class=NoMethodError error="undefined method `[]' for nil:NilClass" tag="containers.mnt.log.containers.calico-node-hxmb6_kube-system_calico-node-703da71b5a5aa7a72c4eef2653357809f11cb03bfb6fab74daa7e9db92b4acc3.log"
fluentd-2tl8n fluentd   2018-04-01 18:44:35 +0000 [warn]: #0 suppressed same stacktrace

Any clues what that might be?

Thank you.

@frankreno
Copy link
Contributor

@emptywee : can you provide the following?

  1. What version of k8s?
  2. Have you modified the environment variables in the Daemonset or are you running with the defaults?
  3. Is it consistent or intermittent? Is it all pods or just some?

@emptywee
Copy link
Contributor Author

emptywee commented Apr 4, 2018

@frankreno absolutely.

  1. I tried it on Kubernetes 1.9.6
  2. I did, but not between v1.6 and v1.8, still let me provide them:
      containers:
        - image: sumologic/fluentd-kubernetes-sumologic:v1.6
          name: fluentd
          resources:
            requests:
              memory: "1.5Gi"
            limits:
              memory: "2Gi"
          imagePullPolicy: IfNotPresent
          volumeMounts:
          - name: host-logs
            mountPath: /mnt/log/
            readOnly: true
          - name: host-logs
            mountPath: /var/log/
            readOnly: true
          - name: docker-logs
            mountPath: /var/lib/docker/
            readOnly: true
          - name: pos-files
            mountPath: /mnt/pos/
          - name: fluentd-confd-file
            mountPath: /fluentd/conf.d/file
          env:
          - name: DUMMY
            value: DUMMY2
          - name: FLUENTD_OPT
            value: ""
          - name: COLLECTOR_URL
            valueFrom:
              secretKeyRef:
                name: sumologic
                key: collector-url
          - name: EXCLUDE_NAMESPACE_REGEX
            value: "(dev|st2|qa|spinnaker|kube-system)"
          - name: MULTILINE_START_REGEXP
            value: '/^(\[)*\d{2,4}-\d{1,2}-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2}/'
          - name: NUM_THREADS 
            value: "2"
          - name: READ_FROM_HEAD
            value: "false"
          - name: FLUSH_INTERVAL
            value: "30s"
  1. It is consistent and feels like it's thrown on every log file. I posted only an excerpt from the fluentd logs to show the stacktrace, but it happens to logs from other pods as well. But seeing only "suppressed same stacktrace" for them.

I can temporarily upgrade again and post more of them, but they are exactly the same, just the tag changes.

Please, let me know if you require more details, I'd be more than happy to provide them.

@frankreno
Copy link
Contributor

@emptywee I am still looking into this, will let you know what I find out ASAP. Good news is I can reproduce so I can now start to debug and figure out the issue.

@frankreno
Copy link
Contributor

just wanted to provide some updates, the stack trace is not stemming from this plugin but is stemming from one of the plugins we package with it.

https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/master/lib/fluent/plugin/filter_kubernetes_metadata.rb

Line 297, based on the error I suspect es is nil. I still need to determine why this is happening, the issue seems to be introduced with the 7b10030.

@emptywee
Copy link
Contributor Author

Thanks for the update. I used to add something similar to my configs (@timeout label) in v1.6 and it worked fine.

@frankreno
Copy link
Contributor

Yeah it is definitely odd, but in the environment I have, the stack was never thrown before adding the @timeout label to the config. Just to confirm, you never saw this stack before 1.8 correct? I'm gonna throw some additional logging into the metadata filter and deploy a custom image with that to see what more insights I can get.

@frankreno
Copy link
Contributor

Ok so these logs are only happening when the concat plugin kicks in. In my environment, I am running with mostly all default values. The default multiline pattern for the concat plugin is Julian dates. I have one pod who's logging matches that pattern. Everytime there is a multiline event in that pod, there is this stack trace:

screen shot 2018-04-13 at 8 46 42 am

The above is the final product sent to Sumo. If we go look at the pod logs directly, we can see that this is the concatenation of 2 log lines.

screen shot 2018-04-13 at 8 48 12 am

And the stack trace is occuring at the exact same time, we get 2 because there are 2 messages being concated.

screen shot 2018-04-13 at 8 49 48 am

The good news is that this appears to be benign, other than generating noise in the logs. No data is being dropped and the data makes it into Sumo concated together as expected.

@emptywee can you confirm that you are seeing the same behavior? E.g. that the stack traces occur during times of ingesting multi line log messages and the behavior above mirrors what you are seeing?

@emptywee
Copy link
Contributor Author

I'll check if logs aren't being lost. It's kinda tricky, since we have a really busy cluster and I need to find a way to make sure I am seeing what you are seeing :)

@frankreno
Copy link
Contributor

totally get it, i am going to try a few things in parallel and let you know what happens.

@frankreno
Copy link
Contributor

I think I have a fix, it is working locally, but want to let it run overnight to ensure all is good. The fix will not be to this repo, but to the repo we depend on so timeline will depend on when they can merge. if all continues to work as expected, I will submit a PR to that library. We will leave this open for tracking until it is accepted and we can upgrade that library.

@emptywee
Copy link
Contributor Author

So it's really a fix? Not just a suppression of the error message?

@frankreno
Copy link
Contributor

I traced the issue. When these stack traces appear, the event stream is actually empty in the k8s metadata filter plugin, which is downstream of the concat plugin. I think it is a symptom of the concat plugin merging messages together. Every time I get this stack trace that you are seeing, the stream is empty (I let it run with some debug logging for 24hours and confirmed every time it happened, it was empty). The k8s metadata filter never expects an empty stream, so when it gets one, it causes this error. So I submitted a PR that ensures if the stream is empty, it does not process it (as there is nothing to process anyway).

@frankreno
Copy link
Contributor

update - fix was merged with the other repo. Waiting on new gem to be published with fix. Then I will update the version we use and push a new image. Hang tight, almost there.

@frankreno
Copy link
Contributor

@emptywee - ive pushed v1.11 which addressed the issue in my testing. Before I push latest tag, wanna give her a go?

@emptywee
Copy link
Contributor Author

Absolutely. I will do it now and post results!

@emptywee
Copy link
Contributor Author

@frankreno everything looks solid. No issues has been observed so far. I think we can close this one. If anything comes up in the near future, I'll open a new one or we'll re-open this one if it has anything to do with it.

@emptywee
Copy link
Contributor Author

Thank you for fixing this 👍

@frankreno
Copy link
Contributor

Great, glad to hear and happy to help!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants