[filebeat] Sometimes Pod logs are not collected by Filebeat #17396

vasrem · 2020-04-01T12:38:47Z

As @mkirkevo already observed, sometimes Filebeat doesn't collect some of the logs when the pod is going in Terminating state. I tried to reproduce it and I think I found at least one case in which this can happen. In that case, a random pod is handling the SIGTERM when someone deletes the pod, it sleeps for some time and in the end creates another log entry. As you can see in my findings, the logs after the SIGTERM was triggered never reach our backend(in that case Elasticsearch), but we can still see the logs using kubectl. As far as I can tell, the Harvester is not deleted prematurely.

It might be related to #14259 but I'm not sure as I don't know the codebase.

Running configuration:

Version: 7.6.1
Operating System: amazon-eks-node-1.14-v20200228 (ami-04ff680b843305fef)
Discuss Forum URL: https://discuss.elastic.co/t/filebeat-missing-end-of-logs-for-k8s-pods/217281
Filebeat Configuration specific to autodiscovery:

filebeat.autodiscover:
  providers:
    - type: kubernetes
      hints.enabled: true
      templates:
      - condition:
          regexp:
            kubernetes.pod.name: ".*"
        config:
        - type: container
          paths:
            - /var/lib/docker/containers/${data.kubernetes.container.id}/*.log
          fields:
            kubernetes.cluster.type: eks
          fields_under_root: true

Steps to reproduce:
In order to reproduce you need a running Kubernetes cluster with Filebeat up and running using the configuration above(please also create an output, we have Elasticsearch as a backend as I already mentioned).

Apply this:

---
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  labels:
    app: test
spec:
  containers:
  - name: test
    image: ubuntu:latest
    command: 
    - bash
    - -c
    - |
     #!/bin/bash
     trap "{ echo '$(date): SIGTERM triggered'; sleep 5;echo '$(date): Bye bye'; sleep 2; echo '$(date): bye'; exit 1; }" SIGINT SIGTERM
     echo "$(date): started the process"

     while :   
     do
            echo "$(date): sleeping 5 seconds"
            sleep 5   
     done

This is just a long running pod that will not be killed immediately after you do kubectl delete pod test-pod.
2. Check the logs of this pod using kubectl logs -f test-pod in a new shell.
3. Kill the pod by running kubectl delete pod test-pod.
4. Check your backend(depending on what you have set in the filebeat configuration).

You are gonna see something like this(please ignore the wrong timestamps after SIGTERM triggered, I didn't invest much time to figure out how to provide a good date to the trap):

$ kubectl logs -f test-pod                                                                                                                                                                                                                         
Wed Apr  1 12:03:30 UTC 2020: started the process                                                                                                                                                                                                                                    
Wed Apr  1 12:03:30 UTC 2020: sleeping 5 seconds                                                                                                                                                                                                                                     
Wed Apr  1 12:03:35 UTC 2020: sleeping 5 seconds                                                                                                                                                                                                                                     
Wed Apr  1 12:03:40 UTC 2020: sleeping 5 seconds                                                                                                                                                                                                                                     
Wed Apr  1 12:03:45 UTC 2020: sleeping 5 seconds                                                                                                                                                                                                                                     
Wed Apr  1 12:03:30 UTC 2020: SIGTERM triggered                                                                                                                                                                                                                                      
Wed Apr  1 12:03:30 UTC 2020: Bye bye                                                                                                                                                                                                                                                
Wed Apr  1 12:03:30 UTC 2020: bye                                                                                                                                                                                                                                                    
rpc error: code = Unknown desc = Error: No such container: d5bfd58d4f473a991f88b016079a75c9f96c4bcf0117a88798d641dfc3732c91%

In the meantime, I've checked the Filebeat logs and it seems that the harvester is terminated after the pod has finished completely. You can see that the last message arrived at 14:03:50.853 and if you add 5+2 seconds from the trap mechanism you end up close to the timestamp the file was removed.

2020-04-01T12:03:31.996Z        INFO    log/input.go:152        Configured paths: [/var/lib/docker/containers/d5bfd58d4f473a991f88b016079a75c9f96c4bcf0117a88798d641dfc3732c91/*.log]
2020-04-01T12:03:31.996Z        INFO    input/input.go:114      Starting input of type: container; ID: 809693904145508800
2020-04-01T12:03:31.997Z        INFO    log/harvester.go:297    Harvester started for file: /var/lib/docker/containers/d5bfd58d4f473a991f88b016079a75c9f96c4bcf0117a88798d641dfc3732c91/d5bfd58d4f473a991f88b016079a75c9f96c4bcf0117a88798d641dfc3732c91-json.log
2020-04-01T12:03:59.000Z        INFO    log/harvester.go:316    File was removed: /var/lib/docker/containers/d5bfd58d4f473a991f88b016079a75c9f96c4bcf0117a88798d641dfc3732c91/d5bfd58d4f473a991f88b016079a75c9f96c4bcf0117a88798d641dfc3732c91-json.log. Closing because close_remove
d is enabled.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-02T07:45:55Z

Pinging @elastic/integrations-platforms (Team:Platforms)

mkirkevo · 2020-04-02T17:22:45Z

We eventually just put a buffer timer in our app to wait xx seconds (currently set to 10) after the app has completed all processing upon receiving a k8s SIGTERM. This allowed any harvesting to complete before the pod is terminated as well as the app to actually terminate within the default k8s 30sec timeout. The only line of logging we miss with this is the very last line from the app acknowledging a "Shutting down" INFO Message.

Not the most ideal solution but it worked for our current needs, going forward it is a much larger problem altogether with high trans apps deployed to k8s.

vasrem · 2020-05-04T11:18:27Z

Is there an update on this topic?

gbates101 · 2020-07-20T15:22:57Z

We're having the same issue, but in our case it's not logging stacktraces that led to a pod restart. Despite having the stacktrace printed in my terminal (from tailing the pod logs), I can't find the exact same message through searching Kibana. We're at a point now where we don't know how to reproduce bugs because the stacktraces that document them just don't exist.

Until this issue is resolved, I really can't recommend using filebeat for forwarding important logs. I'll likely write a new fluentd config and switch back to using it.

ChrsMark · 2020-07-21T09:35:53Z

Hey folks! Sorry for the delay here.

I just opened a patch PR for this which hopefully fixes the issue: #20084

ChrsMark added autodiscovery bug Team:Platforms Label for the Integrations - Platforms team labels Apr 2, 2020

andresrc added [zube]: Inbox [zube]: Investigate and removed [zube]: Inbox labels Apr 7, 2020

andresrc assigned blakerouse Apr 7, 2020

blakerouse removed their assignment May 15, 2020

gbates101 mentioned this issue Jul 20, 2020

consistent user lookup, catch and log exception without exiting PlaidCloud/incubator-superset#40

Merged

ChrsMark mentioned this issue Jul 21, 2020

Fix terminating pod autodiscover issue #20084

Merged

3 tasks

ChrsMark self-assigned this Jul 21, 2020

ChrsMark closed this as completed in #20084 Jul 23, 2020

zube bot added [zube]: Done and removed [zube]: Investigate labels Jul 23, 2020

This was referenced Jul 23, 2020

Cherry-pick #20084 to 7.x: Fix terminating pod autodiscover issue #20192

Merged

Cherry-pick #20084 to 7.9: Fix terminating pod autodiscover issue #20193

Merged

zube bot removed the [zube]: Done label Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[filebeat] Sometimes Pod logs are not collected by Filebeat #17396

[filebeat] Sometimes Pod logs are not collected by Filebeat #17396

vasrem commented Apr 1, 2020

elasticmachine commented Apr 2, 2020

mkirkevo commented Apr 2, 2020

vasrem commented May 4, 2020

gbates101 commented Jul 20, 2020

ChrsMark commented Jul 21, 2020

[filebeat] Sometimes Pod logs are not collected by Filebeat #17396

[filebeat] Sometimes Pod logs are not collected by Filebeat #17396

Comments

vasrem commented Apr 1, 2020

elasticmachine commented Apr 2, 2020

mkirkevo commented Apr 2, 2020

vasrem commented May 4, 2020

gbates101 commented Jul 20, 2020

ChrsMark commented Jul 21, 2020