-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Elastic Agent] Default processors created per input can result in high agent CPU usage #35000
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
It does not seem like the processors are marked as global in #34149 they're just added to the config, right? We would need a way to distinguish them from the rest. However, I'm against using anything global. Global processors are very dangerous as we learnt in #34761 Perhaps instead we should modify the processors themselves to better manage their connections (singleton connections shared by all the processors of this type) instead of hacking the global processors. This makes way more sense to me. |
These processors were global in every release prior to 8.6. Global processors are a standard feature of Beats, and a commonly requested feature for agent (https://github.com/elastic/ingest-dev/issues/2442). They aren't something we can get away with not supporting. I agree that where the processors are instantiated shouldn't matter, but in this case we have a critical performance degradation that can easily be fixed by reverting to a known good configuration. This is significantly less risk and effort than rewriting each of the default add_x_metadata processors. I don't see a reason why we shouldn't take the easiest path and restore the behaviour we had before, even if we don't like the way these processors work architecturally. |
This write-up might be helpful for the person working on this issue #34716 (comment) Ignore the issue itself, the comment contains a lot of processor-related code that will help to navigate. |
Background
Starting from 8.6 the default global processors for Beats that are run by agent are configured in code instead of being read from the default Beat configuration file. Beats managed by agent no longer read a configuration file at startup and instead wait for their initial configuration to be sent by agent. This change was done in #34149.
The implementation from #34149 makes the processors global by configuring them for each input run by Elastic Agent. Looking at the
beat-rendered-config.yml
file available in the agent diagnostics for example shows the processors at the input level:This is in contrast to the default configuration file which defines a single instance of each processor for the process by defining them at the top level of the configuration (see Where are processors valid for details):
beats/x-pack/filebeat/filebeat.yml
Lines 167 to 172 in 91906c9
Problem
Similarly in 8.6 there was a change to the aws-s3 input to create a new
beat.Client
for each new SQS worker in #33658 to improve performance. This results in a new input pipeline being constructed for each SQS worker, each of which gets its own instance of the per input processors as of 8.6.This was not a problem until 8.7, when it was discovered that each instance of a beat input pipeline was referencing an accidentally global instance of the per input processors. This was fixed in #34761. The change in #34761 now results in each input pipeline constructing a new instance of the global processors.
Each of these global processors is expensive to create and includes code to try to perform expensive work only at initialization time. The problem is this only works if there is a single instance of the processor, otherwise each unique instance of the processor attempts to reinitialize itself often performing the exact same request multiple times. For example:
add_cloud_metadata
there was in inadvertent change that introduces a 3s worst-case construction cost ([libbeat] add_cloud_metadata - startup blocked by AWS IMSDv2 token fetch #33058).Impact
As of 8.7 we observing extremely high CPU usage for Beats run under agent and the agent itself in situations where inputs are frequently created. For example in the case of the
add_cloud_metadata
processor we are observing the agent itself being spammed by repeated log messages from theadd_cloud_metadata
initialization sequence:This comes from the code below which includes a
sync.Once
block that is being defeated by a new instance of the processor being created for each individual input:beats/libbeat/processors/add_cloud_metadata/add_cloud_metadata.go
Lines 98 to 112 in 91906c9
Solution
When Beats run under agent we need to create the default global processors at the Beat process level, instead of the input level to match what is done in the global configuration file.
The text was updated successfully, but these errors were encountered: