Fix performance issues with processors scaling under agent #35031

fearful-symmetry · 2023-04-05T20:28:46Z

What does this PR do?

Related to #34149

This fixes a performance issue where we were previously starting agent mode global processors per-input, which could cause some performance issues in cases where we have lots of inputs starting and stopping.

This does this by adding a fleetDefaultProcessors argument to the MakeDefaultSupport function that's used throughout the beats to instantiate processors. Under fleet mode, this function will now use the specified default global processors, unless processors have been manually specified. This extra bit of logic allows us to disable global processors under fleet, which previously wasn't possible.

While I have tested this, there's a few conditions I haven't tested:

Make sure I haven't caused a regression with regards to Filebeat monitoring enters infinite error loop for "closed processor" #34716
Actually test to see what the difference in performance under cases where we're starting lots of inputs

I'm not sure how to reliably test these two cases, so if someone wants to tell me, or test it themselves, go ahead.

Why is it important?

This is a major performance issue.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

elasticmachine · 2023-04-05T20:28:49Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

mergify · 2023-04-05T20:29:21Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-04-05T21:38:31Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-04-12T12:33:59.507+0000
Duration: 95 min 53 sec

Test stats 🧪

Test	Results
Failed	0
Passed	26109
Skipped	1978
Total	28087

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

cmacknz

Can you make it so that the default processors appear in the beat-rendered-config.yml dumped out in the diagnostics?

I don't see anything in here that would add them, and people expect that file to be a complete list of everything the beat is running.

cmacknz · 2023-04-05T22:23:53Z

x-pack/filebeat/cmd/agent.go

@@ -15,8 +15,7 @@ import (
 )

 func filebeatCfg(rawIn *proto.UnitExpectedConfig, agentInfo *client.AgentInfo) ([]*reload.ConfigWithMeta, error) {
-	procs := defaultProcessors()


Probably worth moving the defaultProcessors function into the file that uses it, since it isn't actually used here anymore.

This comment applies to each of the beats.

Yah, good point, was kinda unsure of how to clean those up.

fearful-symmetry · 2023-04-05T22:57:24Z

Can you make it so that the default processors appear in the beat-rendered-config.yml dumped out in the diagnostics?

I'm not sure there's a particularly "clean" way to do that at that anymore, there's no real state sharing between the central management components and the core beat anymore, at least not that I can see. The easiest way to do that would be expand the Manager interface in libbeat/management to have a RegisterDiagnosticHook method, so the core beat runtime can dump state. Not sure if we want to put that in the scope of this PR?

cmacknz · 2023-04-05T23:11:39Z

The easiest way to do that would be expand the Manager interface in libbeat/management to have a RegisterDiagnosticHook method, so the core beat runtime can dump state. Not sure if we want to put that in the scope of this PR?

This sounds like a reasonable approach. This is always going to feel a bit ugly because we are dealing with global state far away from the management interface, but if we want proper input and output status reporting back to agent we are going to be doing a lot of more of this in the future regardless.

I think it's important not to regress on what is visible in the diagnostics. The complete state of the Beat should be visible or someone is extremely likely to waste time making an assumption that it is complete when it is not.

michalpristas

change looks good, i'm just thinking, is there a way to benchmark pipeline?
i'd like to see that perf actually improved after merging this

jlind23 · 2023-04-06T09:45:39Z

@alexsapran Does your benchmarks contains some processors applied? If yes this should be easily testable.

alexsapran · 2023-04-06T11:29:41Z

@alexsapran Does your benchmarks contains some processors applied? If yes this should be easily testable.

I fear not, from what I understand the test case here has many inputs with some processors right?
My benchmarks contain 1 large file. I could try and put something together but I maybe if @fearful-symmetry can assist me with the exact conditions to test.

fearful-symmetry · 2023-04-06T16:03:18Z

I fear not, from what I understand the test case here has many inputs with some processors right?

So, based on my understanding (@cmacknz might be able to be more specific) is that the performance bottleneck occurs in cases where the beat creates large numbers of new inputs, which will happen most often with filestream and aws-s3, although theoretically it could be reproduce with any input, if we create enough of them.

alexsapran · 2023-04-06T18:13:42Z

And what is the performance indicator to identify the improvement. Is it EPS, CPU?

fearful-symmetry · 2023-04-06T18:45:53Z

And what is the performance indicator to identify the improvement. Is it EPS, CPU?

@alexsapran I assume CPU; in my experience, just starting a beat with the default processor sets results in a noticeable startup delay, so it probably won't be subtle.

fearful-symmetry · 2023-04-06T18:54:01Z

Alright, added a diagnostics callback, it now reports the debug data that the individual processors make available:

add_host_metadata=[netinfo.enabled=[true], cache.ttl=[5m0s]]
add_cloud_metadata={}
add_docker_metadata=[match_fields=[] match_pids=[process.pid, process.parent.pid]]
add_kubernetes_metadata

…-globals

fearful-symmetry · 2023-04-06T22:10:46Z

Annoyingly, can't seem to reproduce the test failures locally. Not sure if they're related.

fearful-symmetry · 2023-04-10T16:48:38Z

/test

belimawr

LGTM, but there is a comment regarding a debug message.

I'm also not quite sure how to test it. If I understood it correctly:

If I run any Beat under Agent, and there are no global processors defined in the policy, the default ones defined on x-pack/filebeat/cmd/root.go:40 (for Filebeat) will be used
If any global processor is defined in the policy, they should be used.

Is that it?

libbeat/publisher/processing/default.go

belimawr · 2023-04-11T09:27:49Z

@fearful-symmetry I also took a quick look at the failing tests, they seem to be failing due to some processors not being added when running a standalone Filebeat (x-pack/filebeat-pythonIntegTest), I can easily reproduce it by running the tests locally (multiple tests are failing, this is a single example):

cd x-pack/filebeat/tests/system
INTEGRATION_TEST=1 pytest -s -k 'test_http_endpoint_correct_auth_header' ./test_http_endpoint.py

On main it generates the event:

{
  "@timestamp": "2023-04-11T09:22:09.522Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "_doc",
    "version": "8.8.0"
  },
  "json": {
    "testmessage": "somerandommessage"
  },
  "input": {
    "type": "http_endpoint"
  },
  "ecs": {
    "version": "8.0.0"
  },
  "host": {
    "name": "millennium-falcon"
  },
  "agent": {
    "name": "millennium-falcon",
    "type": "filebeat",
    "version": "8.8.0",
    "ephemeral_id": "bb15b979-01db-402f-a593-cae2cc6ddd7b",
    "id": "5687e3bc-c526-4a0c-971f-7392f2928cb6"
  }
}

on your branch:

{
  "@timestamp": "2023-04-11T09:18:50.837Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "_doc",
    "version": "8.8.0"
  },
  "json": {
    "testmessage": "somerandommessage"
  },
  "input": {
    "type": "http_endpoint"
  }
}

Did you make sure to rebuild the test binary with mage buildSystemTestBinary before running the python tests on your branch?

belimawr

Just changing the review to 'request changes' because the failing tests seem to be related to this PR.

fearful-symmetry · 2023-04-11T17:01:40Z

Did you make sure to rebuild the test binary with mage buildSystemTestBinary before running the python tests on your branch?

Ah, I didn't know that was a thing I needed to do, might explain why I couldn't reproduce the failures locally.

fearful-symmetry · 2023-04-11T17:16:38Z

Okay, FINALLY managed to reproduce the failures, turns out I just misread the jenkins UI and ran the wrong tests, argh

fearful-symmetry · 2023-04-11T18:12:18Z

Alright, let's see what that does...

fearful-symmetry · 2023-04-11T20:10:13Z

Okay, did some very quick and unscientific tests, it looks like with the patch, metricbeat under agent running 17 inputs uses about half the CPU?

// main
alexk    1293186  3.0  0.1 2090116 121440 pts/69 Sl+  12:42   0:00
// this branch
alexk    1401459  1.5  0.2 1942668 123640 pts/69 Sl+  13:02   0:00

fearful-symmetry · 2023-04-11T22:37:55Z

Alright, did a little test with pprof to verify the changes in performance. This was agent running ten system/cpu inputs. You can see the docker client (client.(*Client).Events.func1) using about a quarter of the CPU time.

alexsapran · 2023-04-12T11:57:50Z

I managed to run some benchmarks myself, I will touch base with @fearful-symmetry on the results, but from the initial reading indeed there is a difference in the metadata CPU usage which results in higher EPS.

pierrehilbert · 2023-04-12T12:26:34Z

The changelog is missing but I can't contribute on this fork so I will add it separately.

Adding changelog entry

belimawr

LGTM, I didn't manage to manually test it, but the system tests seem to have a good coverage for this PR.

* fix performance issues with processors scaling under agent * make linter happy * fix test * add comment * move around defaultProcessors * fix tests, add diagnostics * fix default processor on filebeat * change log line * Update CHANGELOG.next.asciidoc Adding changelog entry --------- Co-authored-by: Pierre HILBERT <[email protected]> (cherry picked from commit ea1293f) # Conflicts: # libbeat/management/management.go

… under agent (#35066) * Fix performance issues with processors scaling under agent (#35031) * fix performance issues with processors scaling under agent * make linter happy * fix test * add comment * move around defaultProcessors * fix tests, add diagnostics * fix default processor on filebeat * change log line * Update CHANGELOG.next.asciidoc Adding changelog entry --------- Co-authored-by: Pierre HILBERT <[email protected]> (cherry picked from commit ea1293f) # Conflicts: # libbeat/management/management.go * Fixing conflict --------- Co-authored-by: Alex K <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]>

* fix performance issues with processors scaling under agent * make linter happy * fix test * add comment * move around defaultProcessors * fix tests, add diagnostics * fix default processor on filebeat * change log line * Update CHANGELOG.next.asciidoc Adding changelog entry --------- Co-authored-by: Pierre HILBERT <[email protected]>

fix performance issues with processors scaling under agent

e871627

fearful-symmetry added bug Team:Elastic-Agent Label for the Agent team labels Apr 5, 2023

fearful-symmetry self-assigned this Apr 5, 2023

fearful-symmetry requested review from a team as code owners April 5, 2023 20:28

fearful-symmetry requested review from rdner and cmacknz and removed request for a team April 5, 2023 20:28

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 5, 2023

make linter happy

763a11b

cmacknz requested review from blakerouse and andrewkroh April 5, 2023 22:20

fix test

fffd782

cmacknz reviewed Apr 5, 2023

View reviewed changes

add comment

5c92aa9

move around defaultProcessors

bdc0d5e

michalpristas reviewed Apr 6, 2023

View reviewed changes

fearful-symmetry requested a review from a team as a code owner April 6, 2023 18:41

Merge remote-tracking branch 'upstream/main' into process-performance…

f596110

…-globals

belimawr reviewed Apr 11, 2023

View reviewed changes

libbeat/publisher/processing/default.go Outdated Show resolved Hide resolved

belimawr requested changes Apr 11, 2023

View reviewed changes

fix default processor on filebeat

2269060

change log line

70d2d69

fearful-symmetry requested a review from belimawr April 11, 2023 19:47

pierrehilbert approved these changes Apr 12, 2023

View reviewed changes

pierrehilbert added 2 commits April 12, 2023 14:31

Update CHANGELOG.next.asciidoc

9467c50

Adding changelog entry

Merge branch 'main' into process-performance-globals

cd3f251

belimawr approved these changes Apr 12, 2023

View reviewed changes

pierrehilbert added the backport-v8.7.0 Automated backport with mergify label Apr 12, 2023

pierrehilbert merged commit ea1293f into elastic:main Apr 12, 2023

mergify bot mentioned this pull request Apr 12, 2023

[8.7](backport #35031) Fix performance issues with processors scaling under agent #35066

Merged

emilioalvap mentioned this pull request Jun 6, 2023

[Heartbeat] Add nil check to diagnostics processor serializer #35698

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance issues with processors scaling under agent #35031

Fix performance issues with processors scaling under agent #35031

fearful-symmetry commented Apr 5, 2023

elasticmachine commented Apr 5, 2023

mergify bot commented Apr 5, 2023

elasticmachine commented Apr 5, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

cmacknz left a comment

cmacknz Apr 5, 2023

fearful-symmetry Apr 5, 2023

fearful-symmetry commented Apr 5, 2023

cmacknz commented Apr 5, 2023

michalpristas left a comment •

edited

Loading

jlind23 commented Apr 6, 2023

alexsapran commented Apr 6, 2023

fearful-symmetry commented Apr 6, 2023

alexsapran commented Apr 6, 2023

fearful-symmetry commented Apr 6, 2023

fearful-symmetry commented Apr 6, 2023 •

edited

Loading

fearful-symmetry commented Apr 6, 2023 •

edited

Loading

fearful-symmetry commented Apr 10, 2023

belimawr left a comment

belimawr commented Apr 11, 2023

belimawr left a comment

fearful-symmetry commented Apr 11, 2023

fearful-symmetry commented Apr 11, 2023

fearful-symmetry commented Apr 11, 2023

fearful-symmetry commented Apr 11, 2023

fearful-symmetry commented Apr 11, 2023

alexsapran commented Apr 12, 2023

pierrehilbert commented Apr 12, 2023

belimawr left a comment

Fix performance issues with processors scaling under agent #35031

Fix performance issues with processors scaling under agent #35031

Conversation

fearful-symmetry commented Apr 5, 2023

What does this PR do?

Why is it important?

Checklist

elasticmachine commented Apr 5, 2023

mergify bot commented Apr 5, 2023

elasticmachine commented Apr 5, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

cmacknz left a comment

Choose a reason for hiding this comment

cmacknz Apr 5, 2023

Choose a reason for hiding this comment

fearful-symmetry Apr 5, 2023

Choose a reason for hiding this comment

fearful-symmetry commented Apr 5, 2023

cmacknz commented Apr 5, 2023

michalpristas left a comment • edited Loading

Choose a reason for hiding this comment

jlind23 commented Apr 6, 2023

alexsapran commented Apr 6, 2023

fearful-symmetry commented Apr 6, 2023

alexsapran commented Apr 6, 2023

fearful-symmetry commented Apr 6, 2023

fearful-symmetry commented Apr 6, 2023 • edited Loading

fearful-symmetry commented Apr 6, 2023 • edited Loading

fearful-symmetry commented Apr 10, 2023

belimawr left a comment

Choose a reason for hiding this comment

belimawr commented Apr 11, 2023

belimawr left a comment

Choose a reason for hiding this comment

fearful-symmetry commented Apr 11, 2023

fearful-symmetry commented Apr 11, 2023

fearful-symmetry commented Apr 11, 2023

fearful-symmetry commented Apr 11, 2023

fearful-symmetry commented Apr 11, 2023

alexsapran commented Apr 12, 2023

pierrehilbert commented Apr 12, 2023

belimawr left a comment

Choose a reason for hiding this comment

elasticmachine commented Apr 5, 2023 •

edited by jenkins-beats-ci bot

Loading

michalpristas left a comment •

edited

Loading

fearful-symmetry commented Apr 6, 2023 •

edited

Loading

fearful-symmetry commented Apr 6, 2023 •

edited

Loading