-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filebeat monitoring enters infinite error loop for "closed processor" #34716
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
This error is new. It is coming from: beats/libbeat/processors/safe_processor.go Lines 28 to 41 in 331f792
This error was added last week in #34647. @rdner can you investigate what might be going on here? |
The filestream monitoring processors are defined in the agent, see https://github.com/elastic/elastic-agent/blob/98b8fe7ff2e5fc5db9a309ffd19c587f10ef8270/internal/pkg/agent/application/monitoring/v1_monitor.go#L320 for example. The monitoring Beats are configured in that file. |
This only affects the unreleased v8.7 and v7.17. Likely this needs to be an 8.7 release blocker, but I'll give us some time to investigate and triage before adding it to the dev issue. @pierrehilbert FYI, added to the current sprint. I don't think we can release v8.7 with this. |
[snip] Desregard this comment, I tested it against the wrong code 😅 |
Sorry, closed the wrong bug -- the additional closer error I described was spurious, it was a mismatched prototype |
@cmacknz this error is a safeguard which makes sure that none of the processors is used after We didn't introduce a new issue with #34647 we exposed an existing one. So, there is no regression here. This is some kind of race condition in the Beats internals, I looked through the code when working on #34647 but could not find a possible place where it could occur, this is how the safeguard was introduced. I'd like to ask a fresh pair of eyes to look at it. |
Removing the safeguard would most likely lead to a panic because on |
The I believe @rdner that #34647 didn't cause this intentionally and has maybe uncovered an old problem, but given the error points to this change it seems like an unintended side effect. @rdner you are the most familiar with this area of Beats, so are the best positioned to try to debug this. I'm not sure who else to give this too otherwise, maybe @pierrehilbert can help with this. If we are indeed failing to send monitoring logs to Fleet we can't release 8.7 with this bug, so addressing this needs to take priority over something else in the sprint. |
Based on how I encountered it (in particular its inconsistency across repeated runs), I wonder if it would previously have produced a panic or similar fatal error which would have just restarted the beat. In that case this would be uncovering an error that was already there, but by spin-locking on the error we are now preventing a natural restart + recovery that masked it in the past. |
Agreed but, as @faec said, previously it was just panicking and restarting, I don't consider this the right behaviour either. The only difference is that we have the actual log line that says we lost the event, it was not the case before. We definitely need to find the root cause which has nothing to do with my change, we should keep the change for cases exactly like this, to expose the incorrect behaviour.
The opposite this is the intended side-effect. We didn't have any prove that we use
I'll try to get to the bottom of this in the 2 days I have before my PTO. @cmacknz #34647 was merged on 22th of February, if I understand correctly the BC for 8.7.0 was created earlier and the change would not affect the current release, it's included in 8.7.1. So, we have plenty of time to look into this, don't we? |
There are multiple build candidates scheduled to pick up fixes before release. The last 8.7.0 build candidate is scheduled for 2023-03-22, so we have until then to fix this (I can't link to the 8.7.0 release schedule here as it's private) |
I would say the intended side effect is to report the error of run-after-close, and the unintended side effect is a spin lock that wedges the whole pipeline while filling the logs with 100MB of error messages per minute. We would prefer to keep the intended side effect, but we can't ship something that spinlocks on what would otherwise be a recoverable error. |
Absolutely agree but to be fair none of us knew that the use of processors was broken that much. I tested a simple processor configuration in my PR and it worked normally, so it's not like it was not tested at all. Just to be clear, in my previous comments I never meant "won't fix", I tried to communicate that the fact that we caught this behaviour was a good thing and I started investigating right away. Who would have thought that the problem was at this scale, it had been previously reported only for the |
I've finally managed to find the cause. The fixing PR is here #34761 The Source of the IssueI've tracked it down to where the multiple
In my tests (the policy in the description of this issue) it happened for the following processors:
Apparently, it does not really depend on a particular processor and happens only in beats/filebeat/input/filestream/internal/input-logfile/harvester.go Lines 193 to 201 in a8ab071
Every time a beats/libbeat/publisher/pipeline/client.go Lines 165 to 172 in a8ab071
Closing the client is normal behaviour, seems like something is re-using the same processors across multiple clients and this is causing the issue. The Root CauseI found this function: beats/filebeat/channel/runner.go Lines 121 to 190 in a8ab071
Which is actually re-using all the user-space (defined in the config) processors since they're created in the closure and then used by the return factory function later. And that was it. Once I moved the initialisation inside the returned function I could not reproduce the errors anymore. My AnalysisThe pipeline creates a new set of processors here: beats/libbeat/publisher/pipeline/pipeline.go Line 247 in a8ab071
Which eventually reaches these two functions: This first – creates user-space processors from the config: beats/libbeat/processors/processor.go Lines 73 to 129 in a8ab071
Then some internal processors on top of it here: beats/libbeat/publisher/processing/default.go Lines 258 to 370 in cbd1a38
Because of these assignmentsbeats/libbeat/cmd/instance/beat.go Lines 810 to 814 in a8ab071
beats/libbeat/publisher/processing/default.go Lines 78 to 80 in a8ab071
beats/libbeat/publisher/processing/default.go Lines 96 to 117 in a8ab071
So, this The builder also does not re-use anything and protects the global processors from being closed: beats/libbeat/publisher/processing/default.go Lines 349 to 352 in cbd1a38
However, Filebeat is a somewhat special in the way it handles its pipeline/input creation and it's using this function: beats/filebeat/beater/filebeat.go Lines 326 to 330 in cbd1a38
which is basically a factory for inputs, I suppose it's also to handle input restarts during runtime. If we go to this function definition, we can see it calls: beats/filebeat/channel/runner.go Lines 109 to 119 in a8ab071
And here we can see the function that caused it all: beats/filebeat/channel/runner.go Lines 121 to 190 in a8ab071
|
Intermittently, the monitoring started by Agent enters a loop where it repeats this message ~10K times a second:
This error is inconsistent -- on some runs it begins soon after startup, on many runs it never happens at all. When it does happen, it severely degrades or blocks other ingestion. Subsequent runs using identical configurations with identical binaries are still inconsistent as to whether this bug occurs.
All the times I've encountered this so far have been while testing the shipper, but since most of my agent runs are testing the shipper I'm uncertain if that's a reliable correlation: the error is happening before events reach the shipper output, which is the first place event handling diverges significantly when the shipper is in use. (There could be some shipper-specific agent-injected configuration that would affect the processor list, though I'm not aware of anything like that.)
The most recent configuration I used to exhibit the error is:
The text was updated successfully, but these errors were encountered: