Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer queue creation / activate proxy queue #35118

Merged
merged 14 commits into from
Apr 21, 2023
Merged

Conversation

faec
Copy link
Contributor

@faec faec commented Apr 18, 2023

Defer creation of the beats queue until the full output configuration is received; Activate the proxy queue when the Shipper output is in use.

The proxy queue is meant to track event acknowledgments without storing event data in memory, to avoid a doubled memory cost for events stored in both Beats and the Shipper. However, activating it has required multiple refactoring passes, since previously the queue was created globally before receiving any output configuration. Now that previous work has moved queue logic into outputController, this PR completes the work by activating the queue-related hooks so queue creation is delayed and queue clients are blocked until the output is active.

The important changes are:

  • outputs.Group now has a new field, QueueSettings, with which an output can specify the settings to use when creating the queue.
  • outputController.Set, which gives the output controller its output worker group, is now responsible for creating the queue when a nonempty output is assigned (previously this was done on initialization).
  • outputController.queueProducer can now be called before the queue is created, in which case it accumulates requests that then block until queue creation.
  • refactor: the queue's acknowledgment callback is now a parameter to queue creation rather than an element of the various Settings structs. The callback is used for bookkeeping in the pipeline and needs to be controlled by the queue creator, while everything else in {memqueue,diskqueue,proxyqueue}.Settings can be safely set by the output.
  • refactor: pipeline waitgroup / event tracking. Various event transitions involve calling into the pipeline observer or the pipeline's global event wait group. These are now passed down as creation parameters to the client and outputController so both components can update them in a consistent way instead of distributing the logic across various callbacks and explicit Pipeline pointers.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

There are two paths to testing this PR: one is to make sure that normal queue settings continue to apply, both standalone and under agent. This should just involve making sure that events reach the output without errors. The other is to test with the shipper output and verify that data can flow to the shipper, that event acknowledgments are received, and that the beat memory use remains low even when many events are queued by the shipper.

Related issues

@faec faec added enhancement Team:Elastic-Agent Label for the Agent team labels Apr 18, 2023
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 18, 2023
@mergify mergify bot assigned faec Apr 18, 2023
@mergify
Copy link
Contributor

mergify bot commented Apr 18, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 18, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-04-21T17:53:34.003+0000

  • Duration: 67 min 8 sec

Test stats 🧪

Test Results
Failed 0
Passed 26205
Skipped 1979
Total 28184

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@faec faec marked this pull request as ready for review April 19, 2023 19:45
@faec faec requested a review from a team as a code owner April 19, 2023 19:45
@faec faec requested review from belimawr and leehinman and removed request for a team April 19, 2023 19:45
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@faec
Copy link
Contributor Author

faec commented Apr 19, 2023

I've fixed the most relevant tests and added unit tests for the new behavior. I'm still investigating some CI failures that I can't yet reproduce locally but I think it's ready for review.

@faec
Copy link
Contributor Author

faec commented Apr 20, 2023

There is one remaining failing test, test_seccomp_installed, which only runs on Linux. I set up a Linux environment to try to reproduce it, but for me it still passes. The only information the CI gives is that it timed out, so I'm not sure what to make of it at this point...

Edit: after working more with the python tests and various dependencies I can now reproduce the test_seccomp failure. Hopefully this will lead to a resolution tomorrow...

@fearful-symmetry
Copy link
Contributor

Oddly enough, looking at the log files generated by the beat, I can see the log line that the test is looking for:

 cat /home/alexk/go/src/github.com/elastic/beats/libbeat/build/system-tests/run/test_seccomp.Test.test_seccomp_installed199/mockbeat-20230420.ndjson | grep -i syscall
{"log.level":"debug","@timestamp":"2023-04-20T14:46:48.989-0700","log.logger":"seccomp","log.origin":{"file.name":"seccomp/seccomp.go","file.line":117},"message":"Loading syscall filter","service.name":"mockbeat","seccomp_filter":{"no_new_privs":true,"flag":"tsync","policy":
{"log.level":"info","@timestamp":"2023-04-20T14:46:48.989-0700","log.logger":"seccomp","log.origin":{"file.name":"seccomp/seccomp.go","file.line":124},"message":"Syscall filter successfully installed","service.name":"mockbeat","ecs.version":"1.6.0"}

// parsing errors: if there is an error in the queue config, we want it to
// show up as fatal during initialization, even if the queue itself isn't
// created until later.
func queueSettingsForUserConfig(queueType string, userConfig *conf.C, inQueueSize int) (interface{}, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That bare interface really bugs me, even if I understand why we ended up with it. At very least I kinda wonder if we could have some kind of named interface or something, just for the sake of making the API itself look less opaque.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It bugs me too and it wasn't my first choice. The constraint that led me here is that we need to be able to fail during initialization if we get an invalid queue config up front, therefore we need to call SettingsFromConfig for the appropriate queue since they all have incompatible settings, but we also need that to be overridable by something the output can create.

  • Alternative: We could panic later when we realize the config is invalid. I don't like this one.
  • Alternative: We could report an error when we see it but fall back on the default memory queue. (I'm open to this one but in the end I decided quitting on initialization was preferable.)

The settings alone aren't enough to create the queue, we also need the logger (which we could get from other sources in a pinch) and the queue's global acknowledgment callback (which we can't) so we can't have the output actually create the queue. On the other hand, just accepting a queue type / *conf.C is also awkward: the proxy queue isn't user-configurable, so a natural future choice like passing along a queue subtree from an output configuration would expose that internal type. (Besides which, to me an interface{} that is a placeholder for 3 concrete types is still a lot easier to work with than a *conf.C that could contain literally anything).

The old way of handling this tension was to use factories, which gives thing a nice name but makes the internals too opaque: a factory that just took the callback and created a queue would work, but there would be no way to know what type of queue we'd created, which we currently report.

Which... having talked myself into a corner, maybe we just need to add a QueueType() to the queue interface, and have both the pipeline initialization and the outputs give a queue factory for this field? At this point it could be a func(*logp.Logger, func(eventCount int)) (queue.Queue, error) which is more consistent than the old types, and we could still fall back on the memory queue on error (knowing that an honest queue creation can only fail for the disk queue). I dislike losing access to the settings objects but right now we don't actually need them so this might make sense, thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I switched to a version using factories, see if you feel better about this one ^^

@fearful-symmetry
Copy link
Contributor

So, the tests produce a huge number of these log lines:

{"log.level":"error","@timestamp":"2023-04-20T15:19:39.507-0700","log.logger":"publish","log.origin":{"file.name":"pipeline/controller.go","file.line":224},"message":"outputController received new output configuration when queue is already active","ecs.version":"1.6.0"}

Not sure how much of that is the tests, but it makes me worried we might produce a bit of log spam with it.

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also curious about the large number of log messages that Alex saw.

libbeat/publisher/pipeline/controller.go Show resolved Hide resolved
@faec
Copy link
Contributor Author

faec commented Apr 21, 2023

So, the tests produce a huge number of these log lines:

{"log.level":"error","@timestamp":"2023-04-20T15:19:39.507-0700","log.logger":"publish","log.origin":{"file.name":"pipeline/controller.go","file.line":224},"message":"outputController received new output configuration when queue is already active","ecs.version":"1.6.0"}

Not sure how much of that is the tests, but it makes me worried we might produce a bit of log spam with it.

I know why that is, and I felt conflicted about how to handle it, but the good news is that it won't happen in production: the cause is TestOutputReload in test_controller.go, which manually reloads the pipeline ~5000 times while simultaneously sending events through it to make sure nothing is lost.

I'm not sure the exact motivation, considering a real pipeline is only ever reloaded once. What it de facto tests is that the target of an eventConsumer can be changed during live event flow without dropping anything, and that's nice to know even though we don't rely on it, so I was reluctant to delete the test outright, but in a live beat it would be considered an error condition (and it would be extremely unusual for it to happen more than once, so I'm not worried about error spam).

So I'm not sure how to resolve the tension there -- the test is somewhat useful to keep currently-working asynchronous code from degrading, but it really is an error condition in current code so it necessarily spams the pipeline logger during tests. I could give the pipeline a nil logger for that test, but that would prevent logging of things we do want to see. Maybe it's still worth it considering the noise, and just leave an explanatory comment saying to reactivate the logger if the test ever fails? Open to suggestions :-)

@leehinman
Copy link
Contributor

I know why that is, and I felt conflicted about how to handle it, but the good news is that it won't happen in production: the cause is TestOutputReload in test_controller.go, which manually reloads the pipeline ~5000 times while simultaneously sending events through it to make sure nothing is lost.

I'm not sure the exact motivation, considering a real pipeline is only ever reloaded once. What it de facto tests is that the target of an eventConsumer can be changed during live event flow without dropping anything, and that's nice to know even though we don't rely on it, so I was reluctant to delete the test outright, but in a live beat it would be considered an error condition (and it would be extremely unusual for it to happen more than once, so I'm not worried about error spam).

So I'm not sure how to resolve the tension there -- the test is somewhat useful to keep currently-working asynchronous code from degrading, but it really is an error condition in current code so it necessarily spams the pipeline logger during tests. I could give the pipeline a nil logger for that test, but that would prevent logging of things we do want to see. Maybe it's still worth it considering the noise, and just leave an explanatory comment saying to reactivate the logger if the test ever fails? Open to suggestions :-)

could we maybe change the test so it constantly sends events but only reloads a small number of times? that should still give us the confidence that reloading doesn't drop anything while limiting the valid error log messages.

@faec
Copy link
Contributor Author

faec commented Apr 21, 2023

Oddly enough, looking at the log files generated by the beat, I can see the log line that the test is looking for:
...

Yes indeed, my local tests confirm that the exact same log line (modulo timestamp) appears in both cases but only my PR branch considers it a failure. So it must be something in how the testing framework scans logs? Very mysterious, especially since it doesn't seem to affect any other tests. Hopefully getting close to pinning down the problem...

Edit: found it! Or, one step closer: it has nothing to do with the log lines, it's with shutdown. The proc.kill_and_wait call at the end of the test waits forever. Still not sure why this would manifest in exactly one test, but shutdown is something I can at least see my PR having an effect on, even though I tried to keep that logic unchanged...

@@ -72,6 +74,8 @@ type Queue interface {
Metrics() (Metrics, error)
}

type QueueFactory func(logger *logp.Logger, ack func(eventCount int)) (Queue, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: can we get a comment on the QueueFactory type? Other than that, LGTM.

@faec
Copy link
Contributor Author

faec commented Apr 21, 2023

Found the full issue with the failing test and it's somewhat ugly:

Beat shutdown blocks on the main Beater.Run routine returning. The beater is sent a Stop callback when there's a shutdown signal (ctrl-c, kill, etc), but until its main Run method returns the rest of the shutdown never happens. This is mostly fine except in cases where no output is ever set, which currently seems like it can only happen when the beat is passed -N on the command line (this sets publishDisabled, which discards any output configuration before it is applied). In this case, beats that blocked on an initial (*Pipeline).Connect call will never unblock, because the pipeline doesn't know it's shutting down and therefore will let the blocked requests for a queue.Producer sit there forever.

Complicating this is the fact that (*Pipeline).Close itself is seemingly never called at all in most cases, even when shutdown does proceed properly (otherwise I could refactor it to be triggered by the beater shutdown callback -- but as things stand I'm wary of adding entirely new code paths to the shutdown process without more thorough testing). This drives home something we already knew, which is that we should fix the Beats shutdown procedure, since it's known to be skipping important steps and at this point that's interfering with other pipeline features.

My proposal for a short-term fix, until we have time for a more thorough cleanup of pipeline shutdown: have queueProducer check the publishDisabled flag and, if set, return a no-op producer, since we know that otherwise any attempt to connect to the pipeline will block forever (unless the beater is extraordinarily responsible about how it handles shutdown signals, which doesn't seem to be the case with current beats). This solution feels not great to me, but it doesn't feel particularly worse than the preexisting special cases around publishDisabled. Thoughts / objections?

@fearful-symmetry
Copy link
Contributor

Didn't even know -N was a thing. Considering the flag explicitly disables the output, I think that returning a no-op publisher is fairly reasonable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Beats should use the proxy queue when the shipper output is active
4 participants