-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix performance issues with processors scaling under agent #35031
Fix performance issues with processors scaling under agent #35031
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make it so that the default processors appear in the beat-rendered-config.yml dumped out in the diagnostics?
I don't see anything in here that would add them, and people expect that file to be a complete list of everything the beat is running.
@@ -15,8 +15,7 @@ import ( | |||
) | |||
|
|||
func filebeatCfg(rawIn *proto.UnitExpectedConfig, agentInfo *client.AgentInfo) ([]*reload.ConfigWithMeta, error) { | |||
procs := defaultProcessors() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth moving the defaultProcessors function into the file that uses it, since it isn't actually used here anymore.
This comment applies to each of the beats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, good point, was kinda unsure of how to clean those up.
I'm not sure there's a particularly "clean" way to do that at that anymore, there's no real state sharing between the central management components and the core beat anymore, at least not that I can see. The easiest way to do that would be expand the |
This sounds like a reasonable approach. This is always going to feel a bit ugly because we are dealing with global state far away from the management interface, but if we want proper input and output status reporting back to agent we are going to be doing a lot of more of this in the future regardless. I think it's important not to regress on what is visible in the diagnostics. The complete state of the Beat should be visible or someone is extremely likely to waste time making an assumption that it is complete when it is not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change looks good, i'm just thinking, is there a way to benchmark pipeline?
i'd like to see that perf actually improved after merging this
@alexsapran Does your benchmarks contains some processors applied? If yes this should be easily testable. |
I fear not, from what I understand the test case here has many inputs with some processors right? |
So, based on my understanding (@cmacknz might be able to be more specific) is that the performance bottleneck occurs in cases where the beat creates large numbers of new inputs, which will happen most often with |
And what is the performance indicator to identify the improvement. Is it EPS, CPU? |
@alexsapran I assume CPU; in my experience, just starting a beat with the default processor sets results in a noticeable startup delay, so it probably won't be subtle. |
Alright, added a diagnostics callback, it now reports the debug data that the individual processors make available:
|
Annoyingly, can't seem to reproduce the test failures locally. Not sure if they're related. |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but there is a comment regarding a debug message.
I'm also not quite sure how to test it. If I understood it correctly:
- If I run any Beat under Agent, and there are no global processors defined in the policy, the default ones defined on
x-pack/filebeat/cmd/root.go:40
(for Filebeat) will be used - If any global processor is defined in the policy, they should be used.
Is that it?
@fearful-symmetry I also took a quick look at the failing tests, they seem to be failing due to some processors not being added when running a standalone Filebeat (
On
on your branch:
Did you make sure to rebuild the test binary with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just changing the review to 'request changes' because the failing tests seem to be related to this PR.
Ah, I didn't know that was a thing I needed to do, might explain why I couldn't reproduce the failures locally. |
Okay, FINALLY managed to reproduce the failures, turns out I just misread the jenkins UI and ran the wrong tests, argh |
Alright, let's see what that does... |
Okay, did some very quick and unscientific tests, it looks like with the patch, metricbeat under agent running 17 inputs uses about half the CPU?
|
I managed to run some benchmarks myself, I will touch base with @fearful-symmetry on the results, but from the initial reading indeed there is a difference in the metadata CPU usage which results in higher EPS. |
The changelog is missing but I can't contribute on this fork so I will add it separately. |
Adding changelog entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I didn't manage to manually test it, but the system tests seem to have a good coverage for this PR.
* fix performance issues with processors scaling under agent * make linter happy * fix test * add comment * move around defaultProcessors * fix tests, add diagnostics * fix default processor on filebeat * change log line * Update CHANGELOG.next.asciidoc Adding changelog entry --------- Co-authored-by: Pierre HILBERT <[email protected]> (cherry picked from commit ea1293f) # Conflicts: # libbeat/management/management.go
… under agent (#35066) * Fix performance issues with processors scaling under agent (#35031) * fix performance issues with processors scaling under agent * make linter happy * fix test * add comment * move around defaultProcessors * fix tests, add diagnostics * fix default processor on filebeat * change log line * Update CHANGELOG.next.asciidoc Adding changelog entry --------- Co-authored-by: Pierre HILBERT <[email protected]> (cherry picked from commit ea1293f) # Conflicts: # libbeat/management/management.go * Fixing conflict --------- Co-authored-by: Alex K <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]>
* fix performance issues with processors scaling under agent * make linter happy * fix test * add comment * move around defaultProcessors * fix tests, add diagnostics * fix default processor on filebeat * change log line * Update CHANGELOG.next.asciidoc Adding changelog entry --------- Co-authored-by: Pierre HILBERT <[email protected]>
What does this PR do?
Fixes #35000
Related to #34149
This fixes a performance issue where we were previously starting agent mode global processors per-input, which could cause some performance issues in cases where we have lots of inputs starting and stopping.
This does this by adding a
fleetDefaultProcessors
argument to theMakeDefaultSupport
function that's used throughout the beats to instantiate processors. Under fleet mode, this function will now use the specified default global processors, unless processors have been manually specified. This extra bit of logic allows us to disable global processors under fleet, which previously wasn't possible.While I have tested this, there's a few conditions I haven't tested:
I'm not sure how to reliably test these two cases, so if someone wants to tell me, or test it themselves, go ahead.
Why is it important?
This is a major performance issue.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.