Defer queue creation / activate proxy queue #35118

faec · 2023-04-18T20:10:27Z

Defer creation of the beats queue until the full output configuration is received; Activate the proxy queue when the Shipper output is in use.

The proxy queue is meant to track event acknowledgments without storing event data in memory, to avoid a doubled memory cost for events stored in both Beats and the Shipper. However, activating it has required multiple refactoring passes, since previously the queue was created globally before receiving any output configuration. Now that previous work has moved queue logic into outputController, this PR completes the work by activating the queue-related hooks so queue creation is delayed and queue clients are blocked until the output is active.

The important changes are:

outputs.Group now has a new field, QueueSettings, with which an output can specify the settings to use when creating the queue.
outputController.Set, which gives the output controller its output worker group, is now responsible for creating the queue when a nonempty output is assigned (previously this was done on initialization).
outputController.queueProducer can now be called before the queue is created, in which case it accumulates requests that then block until queue creation.
refactor: the queue's acknowledgment callback is now a parameter to queue creation rather than an element of the various Settings structs. The callback is used for bookkeeping in the pipeline and needs to be controlled by the queue creator, while everything else in {memqueue,diskqueue,proxyqueue}.Settings can be safely set by the output.
refactor: pipeline waitgroup / event tracking. Various event transitions involve calling into the pipeline observer or the pipeline's global event wait group. These are now passed down as creation parameters to the client and outputController so both components can update them in a consistent way instead of distributing the logic across various callbacks and explicit Pipeline pointers.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.~~

How to test this PR locally

There are two paths to testing this PR: one is to make sure that normal queue settings continue to apply, both standalone and under agent. This should just involve making sure that events reach the output without errors. The other is to test with the shipper output and verify that data can flow to the shipper, that event acknowledgments are received, and that the beat memory use remains low even when many events are queued by the shipper.

Related issues

Resolves Beats should use the proxy queue when the shipper output is active #34396

mergify · 2023-04-18T20:11:03Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-04-18T21:15:14Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-04-21T17:53:34.003+0000
Duration: 67 min 8 sec

Test stats 🧪

Test	Results
Failed	0
Passed	26205
Skipped	1979
Total	28184

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-04-19T19:45:14Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

faec · 2023-04-19T19:47:21Z

I've fixed the most relevant tests and added unit tests for the new behavior. I'm still investigating some CI failures that I can't yet reproduce locally but I think it's ready for review.

faec · 2023-04-20T20:05:42Z

There is one remaining failing test, test_seccomp_installed, which only runs on Linux. I set up a Linux environment to try to reproduce it, but for me it still passes. The only information the CI gives is that it timed out, so I'm not sure what to make of it at this point...

Edit: after working more with the python tests and various dependencies I can now reproduce the test_seccomp failure. Hopefully this will lead to a resolution tomorrow...

fearful-symmetry · 2023-04-20T21:54:27Z

Oddly enough, looking at the log files generated by the beat, I can see the log line that the test is looking for:

 cat /home/alexk/go/src/github.com/elastic/beats/libbeat/build/system-tests/run/test_seccomp.Test.test_seccomp_installed199/mockbeat-20230420.ndjson | grep -i syscall
{"log.level":"debug","@timestamp":"2023-04-20T14:46:48.989-0700","log.logger":"seccomp","log.origin":{"file.name":"seccomp/seccomp.go","file.line":117},"message":"Loading syscall filter","service.name":"mockbeat","seccomp_filter":{"no_new_privs":true,"flag":"tsync","policy":
{"log.level":"info","@timestamp":"2023-04-20T14:46:48.989-0700","log.logger":"seccomp","log.origin":{"file.name":"seccomp/seccomp.go","file.line":124},"message":"Syscall filter successfully installed","service.name":"mockbeat","ecs.version":"1.6.0"}

fearful-symmetry · 2023-04-20T22:09:51Z

libbeat/publisher/pipeline/pipeline.go

+// parsing errors: if there is an error in the queue config, we want it to
+// show up as fatal during initialization, even if the queue itself isn't
+// created until later.
+func queueSettingsForUserConfig(queueType string, userConfig *conf.C, inQueueSize int) (interface{}, error) {


That bare interface really bugs me, even if I understand why we ended up with it. At very least I kinda wonder if we could have some kind of named interface or something, just for the sake of making the API itself look less opaque.

It bugs me too and it wasn't my first choice. The constraint that led me here is that we need to be able to fail during initialization if we get an invalid queue config up front, therefore we need to call SettingsFromConfig for the appropriate queue since they all have incompatible settings, but we also need that to be overridable by something the output can create.

Alternative: We could panic later when we realize the config is invalid. I don't like this one.

Alternative: We could report an error when we see it but fall back on the default memory queue. (I'm open to this one but in the end I decided quitting on initialization was preferable.)

The settings alone aren't enough to create the queue, we also need the logger (which we could get from other sources in a pinch) and the queue's global acknowledgment callback (which we can't) so we can't have the output actually create the queue. On the other hand, just accepting a queue type / *conf.C is also awkward: the proxy queue isn't user-configurable, so a natural future choice like passing along a queue subtree from an output configuration would expose that internal type. (Besides which, to me an interface{} that is a placeholder for 3 concrete types is still a lot easier to work with than a *conf.C that could contain literally anything).

The old way of handling this tension was to use factories, which gives thing a nice name but makes the internals too opaque: a factory that just took the callback and created a queue would work, but there would be no way to know what type of queue we'd created, which we currently report.

Which... having talked myself into a corner, maybe we just need to add a QueueType() to the queue interface, and have both the pipeline initialization and the outputs give a queue factory for this field? At this point it could be a func(*logp.Logger, func(eventCount int)) (queue.Queue, error) which is more consistent than the old types, and we could still fall back on the memory queue on error (knowing that an honest queue creation can only fail for the disk queue). I dislike losing access to the settings objects but right now we don't actually need them so this might make sense, thoughts?

Ok, I switched to a version using factories, see if you feel better about this one ^^

fearful-symmetry · 2023-04-20T22:27:57Z

So, the tests produce a huge number of these log lines:

{"log.level":"error","@timestamp":"2023-04-20T15:19:39.507-0700","log.logger":"publish","log.origin":{"file.name":"pipeline/controller.go","file.line":224},"message":"outputController received new output configuration when queue is already active","ecs.version":"1.6.0"}

Not sure how much of that is the tests, but it makes me worried we might produce a bit of log spam with it.

leehinman

Also curious about the large number of log messages that Alex saw.

libbeat/publisher/pipeline/controller.go

faec · 2023-04-21T11:39:00Z

So, the tests produce a huge number of these log lines:
{"log.level":"error","@timestamp":"2023-04-20T15:19:39.507-0700","log.logger":"publish","log.origin":{"file.name":"pipeline/controller.go","file.line":224},"message":"outputController received new output configuration when queue is already active","ecs.version":"1.6.0"}
Not sure how much of that is the tests, but it makes me worried we might produce a bit of log spam with it.

I know why that is, and I felt conflicted about how to handle it, but the good news is that it won't happen in production: the cause is TestOutputReload in test_controller.go, which manually reloads the pipeline ~5000 times while simultaneously sending events through it to make sure nothing is lost.

I'm not sure the exact motivation, considering a real pipeline is only ever reloaded once. What it de facto tests is that the target of an eventConsumer can be changed during live event flow without dropping anything, and that's nice to know even though we don't rely on it, so I was reluctant to delete the test outright, but in a live beat it would be considered an error condition (and it would be extremely unusual for it to happen more than once, so I'm not worried about error spam).

So I'm not sure how to resolve the tension there -- the test is somewhat useful to keep currently-working asynchronous code from degrading, but it really is an error condition in current code so it necessarily spams the pipeline logger during tests. I could give the pipeline a nil logger for that test, but that would prevent logging of things we do want to see. Maybe it's still worth it considering the noise, and just leave an explanatory comment saying to reactivate the logger if the test ever fails? Open to suggestions :-)

leehinman · 2023-04-21T14:00:21Z

I know why that is, and I felt conflicted about how to handle it, but the good news is that it won't happen in production: the cause is TestOutputReload in test_controller.go, which manually reloads the pipeline ~5000 times while simultaneously sending events through it to make sure nothing is lost.

I'm not sure the exact motivation, considering a real pipeline is only ever reloaded once. What it de facto tests is that the target of an eventConsumer can be changed during live event flow without dropping anything, and that's nice to know even though we don't rely on it, so I was reluctant to delete the test outright, but in a live beat it would be considered an error condition (and it would be extremely unusual for it to happen more than once, so I'm not worried about error spam).

So I'm not sure how to resolve the tension there -- the test is somewhat useful to keep currently-working asynchronous code from degrading, but it really is an error condition in current code so it necessarily spams the pipeline logger during tests. I could give the pipeline a nil logger for that test, but that would prevent logging of things we do want to see. Maybe it's still worth it considering the noise, and just leave an explanatory comment saying to reactivate the logger if the test ever fails? Open to suggestions :-)

could we maybe change the test so it constantly sends events but only reloads a small number of times? that should still give us the confidence that reloading doesn't drop anything while limiting the valid error log messages.

faec · 2023-04-21T14:52:56Z

Oddly enough, looking at the log files generated by the beat, I can see the log line that the test is looking for:
...

Yes indeed, my local tests confirm that the exact same log line (modulo timestamp) appears in both cases but only my PR branch considers it a failure. So it must be something in how the testing framework scans logs? Very mysterious, especially since it doesn't seem to affect any other tests. Hopefully getting close to pinning down the problem...

Edit: found it! Or, one step closer: it has nothing to do with the log lines, it's with shutdown. The proc.kill_and_wait call at the end of the test waits forever. Still not sure why this would manifest in exactly one test, but shutdown is something I can at least see my PR having an effect on, even though I tried to keep that logic unchanged...

fearful-symmetry · 2023-04-21T17:21:55Z

libbeat/publisher/queue/queue.go

@@ -72,6 +74,8 @@ type Queue interface {
 	Metrics() (Metrics, error)
 }

+type QueueFactory func(logger *logp.Logger, ack func(eventCount int)) (Queue, error)


Small nit: can we get a comment on the QueueFactory type? Other than that, LGTM.

faec · 2023-04-21T17:37:20Z

Found the full issue with the failing test and it's somewhat ugly:

Beat shutdown blocks on the main Beater.Run routine returning. The beater is sent a Stop callback when there's a shutdown signal (ctrl-c, kill, etc), but until its main Run method returns the rest of the shutdown never happens. This is mostly fine except in cases where no output is ever set, which currently seems like it can only happen when the beat is passed -N on the command line (this sets publishDisabled, which discards any output configuration before it is applied). In this case, beats that blocked on an initial (*Pipeline).Connect call will never unblock, because the pipeline doesn't know it's shutting down and therefore will let the blocked requests for a queue.Producer sit there forever.

Complicating this is the fact that (*Pipeline).Close itself is seemingly never called at all in most cases, even when shutdown does proceed properly (otherwise I could refactor it to be triggered by the beater shutdown callback -- but as things stand I'm wary of adding entirely new code paths to the shutdown process without more thorough testing). This drives home something we already knew, which is that we should fix the Beats shutdown procedure, since it's known to be skipping important steps and at this point that's interfering with other pipeline features.

My proposal for a short-term fix, until we have time for a more thorough cleanup of pipeline shutdown: have queueProducer check the publishDisabled flag and, if set, return a no-op producer, since we know that otherwise any attempt to connect to the pipeline will block forever (unless the beater is extraordinarily responsible about how it handles shutdown signals, which doesn't seem to be the case with current beats). This solution feels not great to me, but it doesn't feel particularly worse than the preexisting special cases around publishDisabled. Thoughts / objections?

fearful-symmetry · 2023-04-21T18:01:45Z

Didn't even know -N was a thing. Considering the flag explicitly disables the output, I think that returning a no-op publisher is fairly reasonable?

squashed commits; activate proxy queue

e17f19d

faec added enhancement Team:Elastic-Agent Label for the Agent team labels Apr 18, 2023

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 18, 2023

mergify bot assigned faec Apr 18, 2023

remove unused struct

3be430a

faec added 6 commits April 18, 2023 22:09

correct comment

00b7d0f

fix client tests

5fe5cc2

comment cleanup

b86fddb

fix controller_test

669a1ab

Merge branch 'main' of github.com:elastic/beats into proxy-queue

1abf5e0

Add unit tests for new outputController behavior

06a00dd

faec marked this pull request as ready for review April 19, 2023 19:45

faec requested a review from a team as a code owner April 19, 2023 19:45

faec requested review from belimawr and leehinman and removed request for a team April 19, 2023 19:45

faec requested a review from fearful-symmetry April 19, 2023 19:45

faec added 2 commits April 19, 2023 15:51

clean up unit test

b7d3f2f

Merge branch 'main' of github.com:elastic/beats into proxy-queue

e2af28e

fearful-symmetry reviewed Apr 20, 2023

View reviewed changes

leehinman requested changes Apr 21, 2023

View reviewed changes

libbeat/publisher/pipeline/controller.go Show resolved Hide resolved

faec added 2 commits April 21, 2023 08:18

clarify comment

9cd1b25

use factories to wrap queue settings

0853c8d

leehinman approved these changes Apr 21, 2023

View reviewed changes

be more careful in outputController.Close

d36ed2b

fearful-symmetry approved these changes Apr 21, 2023

View reviewed changes

Add workaround for publishDisabled shutdown

619d9e3

faec merged commit 519b5cd into elastic:main Apr 21, 2023

faec deleted the proxy-queue branch April 21, 2023 21:15

faec mentioned this pull request May 1, 2023

Fix a comment that wasn't properly updated in previous PR #35269

Merged

faec mentioned this pull request May 30, 2023

Allow queue configuration to be specified under the output type #35615

Closed

2 tasks

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023

Defer queue creation / activate proxy queue (#35118)

80d5774

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer queue creation / activate proxy queue #35118

Defer queue creation / activate proxy queue #35118

faec commented Apr 18, 2023 •

edited

Loading

mergify bot commented Apr 18, 2023

elasticmachine commented Apr 18, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

elasticmachine commented Apr 19, 2023

faec commented Apr 19, 2023

faec commented Apr 20, 2023 •

edited

Loading

fearful-symmetry commented Apr 20, 2023

fearful-symmetry Apr 20, 2023

faec Apr 21, 2023

faec Apr 21, 2023

fearful-symmetry commented Apr 20, 2023

leehinman left a comment

faec commented Apr 21, 2023

leehinman commented Apr 21, 2023

faec commented Apr 21, 2023 •

edited

Loading

fearful-symmetry Apr 21, 2023

faec commented Apr 21, 2023

fearful-symmetry commented Apr 21, 2023

Defer queue creation / activate proxy queue #35118

Defer queue creation / activate proxy queue #35118

Conversation

faec commented Apr 18, 2023 • edited Loading

Checklist

How to test this PR locally

Related issues

mergify bot commented Apr 18, 2023

elasticmachine commented Apr 18, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Apr 19, 2023

faec commented Apr 19, 2023

faec commented Apr 20, 2023 • edited Loading

fearful-symmetry commented Apr 20, 2023

fearful-symmetry Apr 20, 2023

Choose a reason for hiding this comment

faec Apr 21, 2023

Choose a reason for hiding this comment

faec Apr 21, 2023

Choose a reason for hiding this comment

fearful-symmetry commented Apr 20, 2023

leehinman left a comment

Choose a reason for hiding this comment

faec commented Apr 21, 2023

leehinman commented Apr 21, 2023

faec commented Apr 21, 2023 • edited Loading

fearful-symmetry Apr 21, 2023

Choose a reason for hiding this comment

faec commented Apr 21, 2023

fearful-symmetry commented Apr 21, 2023

faec commented Apr 18, 2023 •

edited

Loading

elasticmachine commented Apr 18, 2023 •

edited by jenkins-beats-ci bot

Loading

faec commented Apr 20, 2023 •

edited

Loading

faec commented Apr 21, 2023 •

edited

Loading