-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution][Detections] Adds Rule Execution Log table #124198
Conversation
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
Pinging @elastic/security-solution (Team: SecuritySolution) |
"number_of_triggered_actions": { | ||
"type": "long" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw this was just added in #123567, shall we add it to the table as well, or is it getting too congested and better to wait till we allow expanding rows with additional data?
@@ -8,10 +8,13 @@ | |||
import * as t from 'io-ts'; | |||
import { ruleExecutionEvent } from '../common'; | |||
|
|||
export const GetRuleExecutionEventsResponse = t.exact( | |||
export const GetRuleExecutionEventsResponse = t.intersection([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not deprecating getLastFailures
this should move to its own dedicated type since we're now including additional information about total underlying executions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Types diverged and ended up making a dedicated GetAggregateRuleExecutionEventsResponse
type. Will still need to clean up this one if deprecating getLastFailures
.
...s/security_solution/public/detections/containers/detection_engine/rules/rules_table/utils.ts
Outdated
Show resolved
Hide resolved
field: 'kibana.alert.rule.execution.metrics.total_indexing_duration_ms', | ||
name: i18n.COLUMN_INDEX_DURATION, | ||
render: (value: number) => getOrEmptyTagFromValue(value), | ||
sortable: true, | ||
truncateText: false, | ||
width: '7%', | ||
}, | ||
{ | ||
field: 'kibana.alert.rule.execution.metrics.total_search_duration_ms', | ||
name: i18n.COLUMN_SEARCH_DURATION, | ||
render: (value: number) => getOrEmptyTagFromValue(value), | ||
sortable: true, | ||
truncateText: false, | ||
width: '8%', | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to normalize these duration fields to seconds
to match execution duration
, gap
, & scheduling delay
? I had normalized them but remembered when searching you must use ms
, so a slight disconnect here if we normalize. Optimize for readability or consistency with search?
}, | ||
]; | ||
|
||
export const ExecutionLogSearchBar = React.memo<ExecutionLogTableSearchProps>(({ onSearch }) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really wanted to expose suggestions for the above schema
but in checking with the EUI folks it seems you must use EuiSuggest
for auto-complete, but that component does not support FilterGroups
out of the box, so would need to wire that up separately (along with query text override when applying filters).
In testing I found an issue with the validation
feature of EuiSearchBar
where it reports >
<
as invalid operators. Even with specifying type as number
in the schema I'm still seeing this, so will need to debug further, or just abandon to use EuiSuggest
with custom filters that way we get auto-complete (or expose column header style filters like what was brought up previously).
}, | ||
options: { | ||
tags: ['access:securitySolution'], | ||
}, | ||
}, | ||
async (context, request, response) => { | ||
const { ruleId } = request.params; | ||
const { start, end, filters = '' } = request.query; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filters
can be defaulted as part of the validation schema, no?
"total_alerts": { | ||
"type": "long" | ||
}, | ||
"total_hits": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"hits" isn't a general alerting thing, so I think we'll need a comment somewhere describing what it is - perhaps in the mappings.js
file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "hits" isn't a general alerting thing, I don't think it should be part of the event-log. How do the "hits" differ from the "alerts"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see these two issues for details #120678 #120668, but gist here is total_hits
is the total number of candidate alerts found during a specific execution, and total_alerts
is the total number of those candidate alerts that were actually created during that execution. The totals between candidates and created can differ as a result of reaching the configured max_signals
threshold, or if duplicate alerts (alerts previously detected) were identified, but never written (and I believe a case with exceptions as well, but would need to verify).
I totally agree about naming (perhaps total_candidate_alerts
better fits?), but that aside, is there no representation of these concepts in general alerting? Is there no circuit breaker for the number alerts that can be written per execution, or management of duplicates where discerning between these two values is of use to the general alerting user (like for the security solution user outlined in the issues above)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting. We'll have an issue open soon-ish (from some previous research) about having a maximum number of actions and alerts (separate values) per rule execution. And we would want to track the total number as being different if we cap the number via the maximums.
So it isn't a thing now, but probably will be. So ... maybe just some word-smithing on this. Candidate doesn't seem quite right. Something like "total" as the number the rule produced, and "triggered" instead of "candidate", so "totalAlerts" and "triggeredAlerts"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "hits" isn't a general alerting thing, I don't think it should be part of the event-log
We already have a few metrics stored in the event log which are not a general alerting thing (strictly speaking) but which make sense from the Security Solution and AAD perspective. These are total_indexing_duration_ms
, total_search_duration_ms
and execution_gap_duration_s
. Imho the new metrics have the same relation to the overall logic of Detection Engine and AAD.
I agree that additional documentation and/or better naming would help though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, @spong, that makes a lot of sense to me.
What about total_alerts_created
and total_alerts_detected
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about
total_alerts_created
andtotal_alerts_detected
?
Those two sound good to me @kobelb! 🙂 I'll wait a moment to see if anyone else has additional input and will then make the change.
In addition to adding documentation within event-log
as @pmuellr recommended, are there any other supporting docs we should be updating with these fields as well? I know there's the AAD schema spreadsheet, but I don't believe the event-log
is schema represented in there, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to adding documentation within event-log as @pmuellr recommended, are there any other supporting docs we should be updating with these fields as well? I know there's the AAD schema spreadsheet, but I don't believe the event-log is schema represented in there, no?
I'm not aware of any docs about the fields in the event-log or any docs that recommend that users search these indices. We've treated them as a traditional hidden index where users might find it, but they can't rely on the names of fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulled this feature out to a dedicated PR (#126210), will ping reviewers when ready for review.
…w-up (#124194) **Related to:** #121644 **Addresses:** #86202 ## Summary Done in this PR: - Removed the deprecated `warning` rule execution status ([comment](#121644 (comment))). - Added a new `running` status ([ticket](#86202)). - Simplified the internal implementation of the `rule_execution_log` folder. Hopefully naming of folders, files and interfaces is clearer now as well. ([comment](#121644 (comment)), [comment](#121644 (comment))) - Added APM measurements with `withSecuritySpan`. - Added rule id to the react-query key used for loading last rule failures ([comment](#124198 (comment))) - Addressed most of the `// TODO: https://github.com/elastic/kibana/pull/121644` comments In the next PR that could be merged after the FF I'd address the rest of the stuff: - Add comments to all the interfaces and methods in the `rule_execution_log` folder. Write a readme for it. - Address the remaining of the `// TODO: https://github.com/elastic/kibana/pull/121644` comments. All of them are related to tests. - Fix for the gap column ([comment](#121644 (comment)))
I believe this is the status of "pending" from the alerting data side. A rule will be in this state after it's created, but before it's run. Which means if it's created in a disabled state, it would have that status until someone enabled it. AFAIK, SIEM was the only alerting client that was creating rules disabled by default (don't remember why), so if that is still a thing, I think there may be some value in leaving it. I certainly would expect that it shows a lot of empty rows, in the usual case :-) !!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The screenshots of this pr are very different from the original figmas @yiyangliu9286 designed. We created a list of the items that need to be addressed before merging. Please sync with @yiyangliu9286 on the following:
- Duration column: There’s different format for displaying duration time in the design vs. implementation (the way we designed is aligned with Stack Rule’s table for showing 00:00:00.000 format).
- Duration column is missing tooltip
- In the design, we do not have “Total Alerts”, “Total Hits”, “Gap Duration (s)“, “Index Duration (ms)“, “Search Duration (ms)” “Scheduling Delay (s)” columns.
- Missing status filter number badge: in the design, there is a number badge for showing how many statuses there’re available for filtering.
- Auto refresh vs. Manual refresh: in the design, we show for example, “Updated 3 minutes ago” on top right of this table, and in implementation, we have a manual refresh button, do we not support auto refresh for this table?
- Showing x number of rule executions by default: in the design, we show 10 rule executions by default with a footer pagination, but in implementation, we show “385 rule executions” by default.
- Suggested UI enhancements:
- Expand the width for EuiDatePicker component to its 400px, right now it looks narrow to be able to display details.
- For columns that show numbers (“Duration (s)“, “Total Alerts”, “Total Hits”, “Gap Duration (s)“, “Index Duration (ms)“, “Search Duration (ms)“, “Scheduling Delar (s)“) right aligned the number.
Question:
- What’s the difference between showing “0” and a “-” in this table? What does this imply for users?
Thanks for the feedback @adamwdraper! 🙂 It should be noted that this feature is going to need a full design review as we had never finalized the designs after our initial review with the team on 9-NOV-2021, and as mentioned in the description there were limitations with regards to implementing the initial design as proposed. We're also now supporting the new Response-Ops team as stakeholders too, so they should participate as well. That said, you can look at this PR as a first-pass / pathfinding effort to determine design feasibility and any additional engineering efforts required to meet design's needs. I'll touch base with @yiyangliu9286 and review the feedback and get it integrated 👍
Like in the Alerts and other Security Solution tables |
This is really cool @spong ! I was wondering about the in-memory aggregation and what impact this might have on the event loop. In addition, regarding support for aggregations in the Event Log - this is something we do want to do, but we lack concrete requirements here. Would you mind filing an issue detailing your exact use case and requirements? That way we can explore adding that support in 8.2 so that you can avoid an in-memory approach going forward. cc @XavierM as @elastic/response-ops-ram are looking into building an activity/execution log in Stack Management, and this will likely affect their efforts |
Thanks for the feedback @gmmorris! 🙂
Yeah, that was the biggest concern with this impl/workaround. I touch on this a little in the description above, but in concern of memory here results are bound to 500 execution events (roughly 2500 rule execution logs, though more analysis is necessary to identify max event-log docs per execution). At the moment we're returning the max results the query matches so the can UI inform the user that they should narrow their search further to better isolate and find possible issues. This seemed to work well in effort to allows users to see a fair number of executions, and still make sorting on columns functional for identifying problem executions.
Awesome! Since this is pushed to |
Brilliant, thanks!
Yeah, incremental progress here sounds absolutely reasonable to me 👍 |
💚 Build SucceededMetrics [docs]Module Count
Async chunks
Saved Objects .kibana field count
Unknown metric groupsESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: cc @spong |
@banderror has created this issue for the Detection Rules Area (#125645) outlining the requirements for adding aggregations to the event-log client. As for this PR, I'm going to close this larger one and break it into two: one for adding Total Alerts Created/Detected to event-log, and one for the Rule Execution Log Table itself. |
Notes
This PR will be broken out into two: one for adding Total Alerts Created/Detected to event-log (#126210), and one for the Rule Execution Log Table itself. (Will add links here...)
Summary
Resolves #119598, #119599, #101014, #120678, #120668
Adds
Rule Execution Log
table to Rule Details page, and introducesTotal Alerts
&Total Hits
fields to thesiem-detection-engine-rule-execution-info
SO and adds them as columns on theRule Monitoring
andRule Execution Log
tables.Implementation notes
The useful metrics within
event-log
for a given rule execution are spread between a few different platform (execute-start
,execute
) and security (execution-metrics
,status-change
) events. In effort to provide consolidated metrics per rule execution (and avoiding a lot of empty cells and mis-matched statuses like in the image below)these rule execution events are aggregated by their
executionId
, and then fields are merged from each different event. Since theevent-log
client doesn't currently support aggregations, this aggregation happens in-memory on the kibana server. See the specific merge logic in getAggregateExecutionEvents (still an optimization or two when filtering on certain fields, as data from other execution events can be filtered out and needs re-fetched).Since there are also limitations on what fields the
event-log
client can sort, it's beneficial to enable the user to narrow their rule execution window withdaterange
,status
andfield
filters, and then enable in-memory sorting of all fields via anEuiInMemoryTable
. In discussions with the team, this implementation proved to be most beneficial to the user in terms of identifying long-running executions, scheduling backups, max_signals thresholds, and so forth.In concern of memory when performing aggregations server-side, the execution window size is limited to 500 execution events (roughly 2500 rule execution logs, though more analysis of always failing rules necessary). This is implemented by first performing a query with the user's
daterange
filter and the field filterevent.action:execute-start
to find the total number of unique executions for that window. If it's greater than 500, we limit the number of composite execution events we return to the client to 500, and also return the total for the window asmaxEvents
. This allows the UI to inform the user that they should narrow their search further to better isolate and find possible issues.This should be a be a reasonable constraint for most all rules as a rule executing every 5 minutes, 500 executions would cover 41hrs of execution time (with 5 events per execution).
Todo:
Total Alerts
&Total Hits
for each rule executor (event-log + logger)Rule Execution Log
tab for deleted rules (RBAC onevent-log
can't fetch for deleted rules)EuiSearchBar
& out of the box filter forEuiSuggest
and customFilterGroup
Checklist
Delete any items that are not applicable to this PR.