-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use filestream input as default for hints autodiscover. #36950
Use filestream input as default for hints autodiscover. #36950
Conversation
…c.logs/json* in hints to the ndjson parser of filestream
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes lgtm!
Consider adding this to a manual testing phase for when the BCs are out.
@@ -123,15 +123,19 @@ data: | |||
logs_path: "/var/log/containers/" | |||
|
|||
# To enable hints based autodiscover, remove `filebeat.inputs` configuration and uncomment this: | |||
#filebeat.autodiscover: | |||
# filebeat.autodiscover: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this extra space intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, when I commented in and out the autodiscover block, it added this space. TBH it looks more readable
@@ -19,15 +19,19 @@ data: | |||
logs_path: "/var/log/containers/" | |||
|
|||
# To enable hints based autodiscover, remove `filebeat.inputs` configuration and uncomment this: | |||
#filebeat.autodiscover: | |||
# filebeat.autodiscover: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same: is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say to remove the spaces because you can end up uncommenting this block and this not to have the correct spacing.
@elastic/elastic-agent-data-plane team as you are the code owners , could you review this PR ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're already switching to another input type with a state loss, I'd highly recommend to use fingerprint file identity on Kubernetes. We have a lot of reports that inode values in containerized environments are not stable.
For details please refer to https://www.elastic.co/blog/introducing-filestream-fingerprint-mode
@rdner thanks for your comment. Although this new option is officially recommended to be used in cases where a customer is facing data loss or duplication, I understand the value of us setting it as default recommendation. I got a bit confused from the configuration documentation.
or should it be under |
@MichaelKatsoulis there are 2 things here:
The correct snippet would be something like this: - type: filestream
id: kubernetes-container-logs-${data.kubernetes.pod.name}-${data.kubernetes.container.id}
prospector:
scanner:
fingerprint.enabled: true
symlinks: true
file_identity.fingerprint: ~
paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log |
@rdner I played around with fingerprint using defaults in a local kind cluster and I get constant errors for most of the log files
So TBH I don't know if setting different defaults make sense or leave it on the users to decide if they want this feature or not. |
@MichaelKatsoulis but what's the issue with the message? It clearly communicates what's happening and it will pick up the file once it grows in size. We're talking about the choice between non-working file identity that leads to data duplication and data loss, and working file identity that addresses this issue. We're having a quite high amount of support tickets related to this on Kubernetes, the fingerprint file identity was created to address this. |
@MichaelKatsoulis by the way, these messages are not errors. They're warnings, so the customer would know why their files are not being ingested yet. |
@rdner It is just that those logs confused me as on top of that I could not find the logs of some test pods that I had running. But it is due to the log file being too small, which was on purpose as it wasn't something that logs if not used (like Redis or nginx). I updated my pr accordingly. Could you take a final look? |
if inputType == harvester.FilestreamType { | ||
// json options should be under ndjson parser in filestream input | ||
parsersTempCfg := []mapstr.M{} | ||
ndjsonTempCfg := mapstr.M{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check if this is empty before calling next line?
Ignore this. I just realsised that those are empty mapstr.M
json.add_error_key: "true" | ||
----- | ||
|
||
NOTE: `keys_under_root` json option of `log` input is replaced with `target` option in filestream input. Read the documentation on how to use it correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should put a link to filestream input here
Co-authored-by: Andrew Gizas <[email protected]>
@@ -112,9 +112,16 @@ metadata: | |||
data: | |||
filebeat.yml: |- | |||
filebeat.inputs: | |||
- type: container | |||
- type: filestream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelKatsoulis
shouldn't here be defined id
in input? or in such case it will be automatically generated?
doc:
Each filestream input must have a unique ID. Omitting or changing the filestream ID may cause data duplication. Without a unique ID, filestream is unable to correctly track the state of files.
so for all files that are matching /var/log/containers/*.log
we have 1 filestream with unique id, correct? do you know what does it imply in comparison to the autodiscover
where it will be created a dedicated filestream per container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tetianakravchenko Yes an id will automatically get generated. When the filebeat.input
is used instead of auto discovery then there will be one stream of filestream
input looking at all files in the path. When autodiscovery is used there will be one stream for each discovered container looking at one log file only.
For the metadata in first scenario, the processor is used which requires the matchers log path so it can extract the container id from the log file name, and add the metadata of that container.
In the autodiscovery case the metadata are enriched by the kubernetes provider.
So yes, we have one filestream with one id for all the log collection. Both options work just fine. But with the first approach we cannot enable hints.
* Use filestream input as default for hints autodiscover. Map co.elastic.logs/json* in hints to the ndjson parser of filestream * Update filebeat-kubernetes.yaml * Map co.elastic.logs/multiline.* hints to multiline parser of filestream input * Update documentation * Use file_identity.fingerprint as default way of file unique id creation --------- Co-authored-by: Andrew Gizas <[email protected]>
What does this PR do
This PR is the code resolution of #35984 issue.
It updates filebeat hints autodiscover config and the proposed filebeat k8s manifest(
filebeat-kubernetes.yml
) to usefilestream
input instead ofcontainer
input for thehints.default_config
It allows to continue to use the same co.elastic.logs/* hints inside pods' annotations by
co.elastic.logs/json*
hints to the ndjson parser in case of filestream.co.elastic.logs/multiline*
hints to the multiline parser in case of filestream.User can still choose
container
input inhints.default_config
. Everything will work as they used to in that case.Example
User has the following filebeat.yml configuration with hints autodiscover enabled and filestream set as hints.default_config
User sets the following hints in the Filebeat pods' annotations
The produced configuration for filebeat pod should look like this:
IMPORTANT NOTE:
Due to the default input type change , a user already running filebeat using container input will experience filebeat state loss. This will lead to all the available files at that moment to be re ingested.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
PLATFORMS=linux/amd64 TYPES=docker mage package
cd build/package/filebeat-oss/filebeat-oss-linux-amd64.docker/docker-build && docker build -t myfilebeat .
kind load docker-image myfilebeat
beats/deploy/kubernetes/filebeat-kubernetes.yaml
as in example section.kubectl apply -f beats/deploy/kubernetes/filebeat-kubernetes.yaml
Related issues
Use cases
Screenshots