-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rate limit "Cannot index event" log messages #40157
Comments
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
@pierrehilbert bumping the priority on this one as it recently had an impact on some users. |
@cmacknz, one question, should the report summarise how many events per status code? E.g.:
|
It's useful if it is easy to do, if it adds significant complexity I wouldn't bother. The status code will be in the event logs. |
after speaking with @pierrehilbert, we would love to celebrate the benefits of the outcomes from this issue. Are we able to quantify the reduction in events/logs sent? |
yes, if we have access to old logs, we can quantify it. I was actually thinking about quantifying it as well and add to the PR, but I got busy with other tasks. let me try to make a quick and rough estimation |
I came across this error today in a Metricbeat log and am wondering where to find the mentioned event log? |
You need to be on 8.15+, by default they are next to the regular logs. If you are below 8.15, you can see the cause by turning on debug logging in the regular log files. See beats/metricbeat/metricbeat.reference.yml Line 2458 in e345f28
|
beats/libbeat/outputs/elasticsearch/client.go
Lines 487 to 491 in 032a4cf
The "Cannot index event" logs messages are a useful signal in the logs that events are being dropped and (as of 8.15.0) you should look at the local event log for the reason.
Since this log message does not contain any useful debugging information, and has the potential to be generated for every event that flows through the pipeline, there is no value in logging it for each event.
Instead we should rate limit it so that it only appears once in a fixed interval when events are being dropped. The rate limit is initially proposed to be one message every 10 seconds.
The rate limited message should include the number of events that dropped in the current interval. The message can be changed to something like "Failed to index N events in last M seconds. Look at the event log to view the events and cause."
The text was updated successfully, but these errors were encountered: