-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libbeat/management]: support filebeat inputs to report their status to elastic-agent #39209
[libbeat/management]: support filebeat inputs to report their status to elastic-agent #39209
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
💔 Build Failed
Expand to view the summary
Build stats
Pipeline error
❕ Flaky test reportNo test was executed to be analysed. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
149a0c5
to
c3ef49b
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
…ing multiple streams
c3ef49b
to
c6294b7
Compare
Pinging @elastic/sec-deployment-and-devices (Team:Security-Deployment and Devices) |
@andrewkroh @cmacknz @belimawr there are some rough edges (checking if the issue I spotted is actually an issue with elastic-agent-client, making the integration test use the integration stack and not the one I spawned from elastic-package) but the logic is pretty much there |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a filebeat or libbeat expert so I may have missed something, but I left a couple of comments about sections that can be simplified/refactored a bit a couple of questions
Looking at the video, instead of "All streams are degraded" you would be much better off picking the first error and setting it to the overall input state. This will immediately bring it to the user's attention and then they can solve errors one by one. You don't want to require forcing users to look directly at the contents of the .fleet-agent's document via the View agent JSON button. To allow displaying the state of the streams, you can create an issue in Kibana for Fleet to start supporting this. Alternatively you can change your integrations to stop using |
Having potentially more than one streams with either Degraded or Failed statuses but showing only one of them without saying to the user how many in total are having an issue seems weird to me. As a User flow, I would expect the following; An integration is reported as not Healthy, open up the logs and check for errors initially. When something similar to my answer below is implemented this will get finer granularity. Thus, to have a more wider consensus on the matter, I am gonna request the opinion of the reviewers that have already approved this PR; @andrewkroh, @efd6 , @belimawr what do you think?
I would go for supporting this in Kibana through the payload (with |
With the status of the streams being routed to Fleet now, I think it makes sense for the UI to be updated to take advantage of this information directly. It can make sure that the most important information is presented to user for them to investigate, and aid them in that investigation. I think that even the message that the Beat is creating about "Out of N streams, M are failed, X are degraded" should be the responsibility of the presentation layer to construct. If the Beat puts some input status message in there then the UI is basically obligated to use it even though it is something it could compute on its own (perhaps in a better form that is localized) because it won't be able to distinguish it from status data sourced from the input. |
I agree we should update Fleet to take advantage of the extra streams payload if it is present, but since the change in Fleet currently is not coordinated or aligned with this change I want the information this PRs adds to be as useful as possible without the UI change. I would actually rather us forbid using
I agree but there is currently a non-zero chance this lands in 8.15.0 and the UI change would land in 8.16.0+. We can mitigate this by ensuring the additional error context this change introduces is useful without the UI change. Making the overall input message something like "N of M streams failed, first error X" would accomplish this easily with little work and avoids temporarily requiring users to at the Fleet system indices to debug anything. |
UPDATE: @cmacknz this is now in 2e1e21f I don't see any strong concerns raised on having such a behaviour from the reviewers that have already approved the previous behaviour - still such a behaviour feels weird to me - but I am gonna change the code to do what you propose here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Proposed commit message
This PR introduces the following:
Healthy
this is immediately the status of the StatusUnit. This allows the existing runner allocation/deallocation logic to properly propagate any given status and highlight that something is changing, e.g. when a unit is modified the propagated status isconfiguring
which is what we want.Running
: everything is happy, no error or warning produced during the operation of an inputFailed
: when the input encountered an error that it can't continue fromDegraded
: when the input encountered something abnormal but, due to lack of a better expression, it hasn't given up yet 😄 CEL does that a lot, it denies to say bb.Configuring
,Stopping
,Stopped
,Starting
: These statuses are most suitable to be used by the input allocation/deallocation code and not directly from inside the input, as the former can and should override the status of the whole input.Noteworthy code changes:
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
Related issues
Use cases
N/A
Screenshots
output.mp4
Logs