[Security Solution][Detections] Adds Rule Execution Log table #124198

spong · 2022-02-01T02:40:15Z

Notes

This PR will be broken out into two: one for adding Total Alerts Created/Detected to event-log (#126210), and one for the Rule Execution Log Table itself. (Will add links here...)

Summary

Resolves #119598, #119599, #101014, #120678, #120668

Adds Rule Execution Log table to Rule Details page, and introduces Total Alerts & Total Hits fields to the siem-detection-engine-rule-execution-info SO and adds them as columns on the Rule Monitoring and Rule Execution Log tables.

Implementation notes

The useful metrics within event-log for a given rule execution are spread between a few different platform (execute-start, execute) and security (execution-metrics, status-change) events. In effort to provide consolidated metrics per rule execution (and avoiding a lot of empty cells and mis-matched statuses like in the image below)

these rule execution events are aggregated by their executionId, and then fields are merged from each different event. Since the event-log client doesn't currently support aggregations, this aggregation happens in-memory on the kibana server. See the specific merge logic in getAggregateExecutionEvents (still an optimization or two when filtering on certain fields, as data from other execution events can be filtered out and needs re-fetched).

Since there are also limitations on what fields the event-log client can sort, it's beneficial to enable the user to narrow their rule execution window with daterange, status and field filters, and then enable in-memory sorting of all fields via an EuiInMemoryTable. In discussions with the team, this implementation proved to be most beneficial to the user in terms of identifying long-running executions, scheduling backups, max_signals thresholds, and so forth.

In concern of memory when performing aggregations server-side, the execution window size is limited to 500 execution events (roughly 2500 rule execution logs, though more analysis of always failing rules necessary). This is implemented by first performing a query with the user's daterange filter and the field filter event.action:execute-start to find the total number of unique executions for that window. If it's greater than 500, we limit the number of composite execution events we return to the client to 500, and also return the total for the window as maxEvents. This allows the UI to inform the user that they should narrow their search further to better isolate and find possible issues.

This should be a be a reasonable constraint for most all rules as a rule executing every 5 minutes, 500 executions would cover 41hrs of execution time (with 5 events per execution).

Todo:

FTR tests exercising Total Alerts & Total Hits for each rule executor (event-log + logger)
Disable Rule Execution Log tab for deleted rules (RBAC on event-log can't fetch for deleted rules)
Normalize duration columns?
Swap out EuiSearchBar & out of the box filter for EuiSuggest and custom FilterGroup
API Integration tests for execution event route
...
...
...

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
- TODO: Collaborate with @elastic/security-docs
Unit or functional tests were updated or added to match the most common scenarios
- Existing updated, additional tests required, including FTR suite for exercising total_alerts/total_hits changes
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)

elasticmachine · 2022-02-01T02:40:17Z

Pinging @elastic/security-detections-response (Team:Detections and Resp)

elasticmachine · 2022-02-01T02:40:17Z

Pinging @elastic/security-solution (Team: SecuritySolution)

…log-table

spong · 2022-02-01T03:03:42Z

x-pack/plugins/event_log/generated/mappings.json

                                                "number_of_triggered_actions": {
                                                    "type": "long"
                                                },


Saw this was just added in #123567, shall we add it to the table as well, or is it getting too congested and better to wait till we allow expanding rows with additional data?

spong · 2022-02-01T03:13:21Z

...rity_solution/common/detection_engine/schemas/response/get_rule_execution_events_response.ts

@@ -8,10 +8,13 @@
 import * as t from 'io-ts';
 import { ruleExecutionEvent } from '../common';

-export const GetRuleExecutionEventsResponse = t.exact(
+export const GetRuleExecutionEventsResponse = t.intersection([


If not deprecating getLastFailures this should move to its own dedicated type since we're now including additional information about total underlying executions.

Types diverged and ended up making a dedicated GetAggregateRuleExecutionEventsResponse type. Will still need to clean up this one if deprecating getLastFailures.

...s/security_solution/public/detections/containers/detection_engine/rules/rules_table/utils.ts

spong · 2022-02-01T03:30:15Z

...etections/pages/detection_engine/rules/details/execution_log_table/execution_log_columns.tsx

+    field: 'kibana.alert.rule.execution.metrics.total_indexing_duration_ms',
+    name: i18n.COLUMN_INDEX_DURATION,
+    render: (value: number) => getOrEmptyTagFromValue(value),
+    sortable: true,
+    truncateText: false,
+    width: '7%',
+  },
+  {
+    field: 'kibana.alert.rule.execution.metrics.total_search_duration_ms',
+    name: i18n.COLUMN_SEARCH_DURATION,
+    render: (value: number) => getOrEmptyTagFromValue(value),
+    sortable: true,
+    truncateText: false,
+    width: '8%',
+  },


Do we want to normalize these duration fields to seconds to match execution duration, gap, & scheduling delay? I had normalized them but remembered when searching you must use ms, so a slight disconnect here if we normalize. Optimize for readability or consistency with search?

spong · 2022-02-01T04:01:21Z

...ctions/pages/detection_engine/rules/details/execution_log_table/execution_log_search_bar.tsx

+  },
+];
+
+export const ExecutionLogSearchBar = React.memo<ExecutionLogTableSearchProps>(({ onSearch }) => {


Really wanted to expose suggestions for the above schema but in checking with the EUI folks it seems you must use EuiSuggest for auto-complete, but that component does not support FilterGroups out of the box, so would need to wire that up separately (along with query text override when applying filters).

In testing I found an issue with the validation feature of EuiSearchBar where it reports > < as invalid operators. Even with specifying type as number in the schema I'm still seeing this, so will need to debug further, or just abandon to use EuiSuggest with custom filters that way we get auto-complete (or expose column header style filters like what was brought up previously).

spong · 2022-02-01T04:08:17Z

...ecurity_solution/server/lib/detection_engine/routes/rules/get_rule_execution_events_route.ts

      },
      options: {
        tags: ['access:securitySolution'],
      },
    },
    async (context, request, response) => {
      const { ruleId } = request.params;
+      const { start, end, filters = '' } = request.query;


Filters can be defaulted as part of the validation schema, no?

…log-table

pmuellr · 2022-02-01T14:10:26Z

x-pack/plugins/event_log/generated/mappings.json

+                                                "total_alerts": {
+                                                  "type": "long"
+                                                },
+                                                "total_hits": {


"hits" isn't a general alerting thing, so I think we'll need a comment somewhere describing what it is - perhaps in the mappings.js file?

If "hits" isn't a general alerting thing, I don't think it should be part of the event-log. How do the "hits" differ from the "alerts"?

Please see these two issues for details #120678 #120668, but gist here is total_hits is the total number of candidate alerts found during a specific execution, and total_alerts is the total number of those candidate alerts that were actually created during that execution. The totals between candidates and created can differ as a result of reaching the configured max_signals threshold, or if duplicate alerts (alerts previously detected) were identified, but never written (and I believe a case with exceptions as well, but would need to verify).

I totally agree about naming (perhaps total_candidate_alerts better fits?), but that aside, is there no representation of these concepts in general alerting? Is there no circuit breaker for the number alerts that can be written per execution, or management of duplicates where discerning between these two values is of use to the general alerting user (like for the security solution user outlined in the issues above)?

Ah, interesting. We'll have an issue open soon-ish (from some previous research) about having a maximum number of actions and alerts (separate values) per rule execution. And we would want to track the total number as being different if we cap the number via the maximums.

So it isn't a thing now, but probably will be. So ... maybe just some word-smithing on this. Candidate doesn't seem quite right. Something like "total" as the number the rule produced, and "triggered" instead of "candidate", so "totalAlerts" and "triggeredAlerts"?

If "hits" isn't a general alerting thing, I don't think it should be part of the event-log

We already have a few metrics stored in the event log which are not a general alerting thing (strictly speaking) but which make sense from the Security Solution and AAD perspective. These are total_indexing_duration_ms, total_search_duration_ms and execution_gap_duration_s. Imho the new metrics have the same relation to the overall logic of Detection Engine and AAD.

I agree that additional documentation and/or better naming would help though.

Thanks for the explanation, @spong, that makes a lot of sense to me.

What about total_alerts_created and total_alerts_detected?

What about total_alerts_created and total_alerts_detected?

Those two sound good to me @kobelb! 🙂 I'll wait a moment to see if anyone else has additional input and will then make the change.

In addition to adding documentation within event-log as @pmuellr recommended, are there any other supporting docs we should be updating with these fields as well? I know there's the AAD schema spreadsheet, but I don't believe the event-log is schema represented in there, no?

In addition to adding documentation within event-log as @pmuellr recommended, are there any other supporting docs we should be updating with these fields as well? I know there's the AAD schema spreadsheet, but I don't believe the event-log is schema represented in there, no?

I'm not aware of any docs about the fields in the event-log or any docs that recommend that users search these indices. We've treated them as a traditional hidden index where users might find it, but they can't rely on the names of fields.

Pulled this feature out to a dedicated PR (#126210), will ping reviewers when ready for review.

xcrzx · 2022-02-01T15:35:38Z

We don't write warning statuses anymore, I think it could be safely removed from the filter options:

…w-up (#124194) **Related to:** #121644 **Addresses:** #86202 ## Summary Done in this PR: - Removed the deprecated `warning` rule execution status ([comment](#121644 (comment))). - Added a new `running` status ([ticket](#86202)). - Simplified the internal implementation of the `rule_execution_log` folder. Hopefully naming of folders, files and interfaces is clearer now as well. ([comment](#121644 (comment)), [comment](#121644 (comment))) - Added APM measurements with `withSecuritySpan`. - Added rule id to the react-query key used for loading last rule failures ([comment](#124198 (comment))) - Addressed most of the `// TODO: https://github.com/elastic/kibana/pull/121644` comments In the next PR that could be merged after the FF I'd address the rest of the stuff: - Add comments to all the interfaces and methods in the `rule_execution_log` folder. Write a readme for it. - Address the remaining of the `// TODO: https://github.com/elastic/kibana/pull/121644` comments. All of them are related to tests. - Fix for the gap column ([comment](#121644 (comment)))

pmuellr · 2022-02-02T21:29:38Z

@banderror: We allow the user to select the going to run status in the dropdown. I don't think we should do this because this status is not final and so doesn't make much sense in this table where we show, in fact, rule execution results (or as we call it in the UI "responses"). When the user selects it, we show a lot of empty rows in the table.

I believe this is the status of "pending" from the alerting data side. A rule will be in this state after it's created, but before it's run. Which means if it's created in a disabled state, it would have that status until someone enabled it. AFAIK, SIEM was the only alerting client that was creating rules disabled by default (don't remember why), so if that is still a thing, I think there may be some value in leaving it.

I certainly would expect that it shows a lot of empty rows, in the usual case :-) !!!

…log-table

adamwdraper

The screenshots of this pr are very different from the original figmas @yiyangliu9286 designed. We created a list of the items that need to be addressed before merging. Please sync with @yiyangliu9286 on the following:

Duration column: There’s different format for displaying duration time in the design vs. implementation (the way we designed is aligned with Stack Rule’s table for showing 00:00:00.000 format).
Duration column is missing tooltip
In the design, we do not have “Total Alerts”, “Total Hits”, “Gap Duration (s)“, “Index Duration (ms)“, “Search Duration (ms)” “Scheduling Delay (s)” columns.
Missing status filter number badge: in the design, there is a number badge for showing how many statuses there’re available for filtering.
Auto refresh vs. Manual refresh: in the design, we show for example, “Updated 3 minutes ago” on top right of this table, and in implementation, we have a manual refresh button, do we not support auto refresh for this table?
Showing x number of rule executions by default: in the design, we show 10 rule executions by default with a footer pagination, but in implementation, we show “385 rule executions” by default.
Suggested UI enhancements:
Expand the width for EuiDatePicker component to its 400px, right now it looks narrow to be able to display details.
For columns that show numbers (“Duration (s)“, “Total Alerts”, “Total Hits”, “Gap Duration (s)“, “Index Duration (ms)“, “Search Duration (ms)“, “Scheduling Delar (s)“) right aligned the number.

Question:

What’s the difference between showing “0” and a “-” in this table? What does this imply for users?

spong · 2022-02-05T01:18:51Z

Thanks for the feedback @adamwdraper! 🙂 It should be noted that this feature is going to need a full design review as we had never finalized the designs after our initial review with the team on 9-NOV-2021, and as mentioned in the description there were limitations with regards to implementing the initial design as proposed. We're also now supporting the new Response-Ops team as stakeholders too, so they should participate as well. That said, you can look at this PR as a first-pass / pathfinding effort to determine design feasibility and any additional engineering efforts required to meet design's needs.

I'll touch base with @yiyangliu9286 and review the feedback and get it integrated 👍

Question:

What’s the difference between showing “0” and a “-” in this table? What does this imply for users?

Like in the Alerts and other Security Solution tables — is going to indicate the absence of data, whereas 0 is going to be an actual concrete value for the field. For instance, an Indexing Duration of 0 means the indexing operation occurred (just didn't take any time since there were no documents, or the duration was less than the precision being captured in the field type), whereas a — will indicate that the indexing operation did not occur (or potentially completed unsuccessfully based on the message field).

…log-table

gmmorris · 2022-02-07T15:57:47Z

these rule execution events are aggregated by their executionId, and then fields are merged from each different event. Since the event-log client doesn't currently support aggregations, this aggregation happens in-memory on the kibana server.

This is really cool @spong !

I was wondering about the in-memory aggregation and what impact this might have on the event loop.
Are we concerned about the potential scale limitations here? Is this unbounded or do we keep it limited in such a way that it isn't a big deal?

In addition, regarding support for aggregations in the Event Log - this is something we do want to do, but we lack concrete requirements here. Would you mind filing an issue detailing your exact use case and requirements? That way we can explore adding that support in 8.2 so that you can avoid an in-memory approach going forward.

cc @XavierM as @elastic/response-ops-ram are looking into building an activity/execution log in Stack Management, and this will likely affect their efforts

spong · 2022-02-07T22:02:23Z

Thanks for the feedback @gmmorris! 🙂

I was wondering about the in-memory aggregation and what impact this might have on the event loop.
Are we concerned about the potential scale limitations here? Is this unbounded or do we keep it limited in such a way that it isn't a big deal?

Yeah, that was the biggest concern with this impl/workaround. I touch on this a little in the description above, but in concern of memory here results are bound to 500 execution events (roughly 2500 rule execution logs, though more analysis is necessary to identify max event-log docs per execution). At the moment we're returning the max results the query matches so the can UI inform the user that they should narrow their search further to better isolate and find possible issues. This seemed to work well in effort to allows users to see a fair number of executions, and still make sorting on columns functional for identifying problem executions.

In addition, regarding #91731 - this is something we do want to do, but we lack concrete requirements here. Would you mind filing an issue detailing your exact use case and requirements? That way we can explore adding that support in 8.2 so that you can avoid an in-memory approach going forward.

Awesome! Since this is pushed to 8.2, I discussed next steps with the team and I'm going to determine what agg queries we'd need to support this table view (and a nice time series visualization for visualizing executions over time/coverage/gaps/etc) and then will distill those requirements into a dedicated issue for adding agg support to the event-log client. Was discussing with @peluja1012 today, and depending on scope this may be something I can assist with. We're flexible too, so if we need to ship the first version with limited in-memory-aggs and follow-up with true agg support that should be fine too.

…log-table

gmmorris · 2022-02-08T10:52:55Z

then will distill those requirements into a dedicated issue for adding agg support to the event-log client.

Brilliant, thanks!
@pmuellr can help out once that issue is filed 😄 🙏

Was discussing with @peluja1012 today, and depending on scope this may be something I can assist with. We're flexible too, so if we need to ship the first version with limited in-memory-aggs and follow-up with true agg support that should be fine too.

Yeah, incremental progress here sounds absolutely reasonable to me 👍

…log-table

kibana-ci · 2022-02-14T19:29:27Z

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`securitySolution`	2847	2849	+2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`securitySolution`	4.7MB	4.7MB	+7.1KB

Saved Objects .kibana field count

Every field in each saved object type adds overhead to Elasticsearch. Kibana needs to keep the total field count below Elasticsearch's default limit of 1000 fields. Only specify field mappings for the fields you wish to search on or query. See https://www.elastic.co/guide/en/kibana/master/development-plugin-saved-objects.html#_mappings

id	before	after	diff
`siem-detection-engine-rule-execution-info`	10	12	+2

Unknown metric groups

ESLint disabled line counts

id	before	after	diff
`securitySolution`	442	443	+1

Total ESLint disabled count

id	before	after	diff
`securitySolution`	509	510	+1

History

💚 Build #22309 succeeded 825b103
💚 Build #22000 succeeded 90c1ef6
💛 Build #21496 was flaky 5a6d237
💚 Build #20975 succeeded de5ec8d
💔 Build #20966 failed bd96ce4
💔 Build #20957 failed 1fec617

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @spong

spong · 2022-02-15T14:23:52Z

@gmmorris @pmuellr

@banderror has created this issue for the Detection Rules Area (#125645) outlining the requirements for adding aggregations to the event-log client.

As for this PR, I'm going to close this larger one and break it into two: one for adding Total Alerts Created/Detected to event-log, and one for the Rule Execution Log Table itself.

Adds Rule Execution Log table

d8ddc0b

spong requested a review from a team February 1, 2022 02:40

spong self-assigned this Feb 1, 2022

spong requested review from a team as code owners February 1, 2022 02:40

Merge branch 'main' of github.com:elastic/kibana into rule-execution-…

c336868

…log-table

spong commented Feb 1, 2022

View reviewed changes

...s/security_solution/public/detections/containers/detection_engine/rules/rules_table/utils.ts Outdated Show resolved Hide resolved

spong commented Feb 1, 2022

View reviewed changes

spong added 7 commits January 31, 2022 22:07

Fixes tests and cleanup

2187823

Adds support for maxEvents and cleans up types around execution event

9de216d

Merge branch 'main' of github.com:elastic/kibana into rule-execution-…

729b272

…log-table

Adds totalHits metric to remaining executors

defe453

Fixes test

1fec617

Unused import

bd96ce4

Fixed translations

de5ec8d

pmuellr reviewed Feb 1, 2022

View reviewed changes

banderror mentioned this pull request Feb 1, 2022

[Security Solution][Detections] Rule execution logging overhaul follow-up #124194

Merged

yiyangliu9286 self-requested a review February 2, 2022 21:44

spong added 2 commits February 2, 2022 18:29

Merge branch 'main' of github.com:elastic/kibana into rule-execution-…

5a6d237

…log-table

Merge branch 'main' of github.com:elastic/kibana into rule-execution-…

eb66e1a

…log-table

adamwdraper self-requested a review February 3, 2022 23:45

adamwdraper suggested changes Feb 4, 2022

View reviewed changes

spong added 2 commits February 4, 2022 18:45

Merge branch 'main' of github.com:elastic/kibana into rule-execution-…

90c1ef6

…log-table

Merge branch 'main' of github.com:elastic/kibana into rule-execution-…

5a736ab

…log-table

spong added 2 commits February 7, 2022 16:23

Merge branch 'main' of github.com:elastic/kibana into rule-execution-…

cdb317b

…log-table

Removes going-to-run event

825b103

gmmorris mentioned this pull request Feb 8, 2022

[Event Log] Added a new event log API, which allows to fetch events summary aggregations for saved objects by Ids. #91731

Closed

spong added 5 commits February 8, 2022 15:57

Merge branch 'main' of github.com:elastic/kibana into rule-execution-…

b5e50a5

…log-table

Merge branch 'main' of github.com:elastic/kibana into rule-execution-…

404b2bb

…log-table

Merge branch 'main' of github.com:elastic/kibana into rule-execution-…

0ae6f0d

…log-table

Merge branch 'main' of github.com:elastic/kibana into rule-execution-…

b492e77

…log-table

Merge branch 'main' of github.com:elastic/kibana into rule-execution-…

d703131

…log-table

spong closed this Feb 15, 2022

spong mentioned this pull request Feb 22, 2022

[Security Solution][Detections] Adds 'Alerts Detected' and 'Alerts Created' metrics to Rule Monitoring table #126210

Closed

4 tasks

spong deleted the rule-execution-log-table branch April 12, 2022 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution][Detections] Adds Rule Execution Log table #124198

[Security Solution][Detections] Adds Rule Execution Log table #124198

spong commented Feb 1, 2022 •

edited

Loading

elasticmachine commented Feb 1, 2022

elasticmachine commented Feb 1, 2022

spong Feb 1, 2022

spong Feb 1, 2022

spong Feb 1, 2022 •

edited

Loading

spong Feb 1, 2022

spong Feb 1, 2022

spong Feb 1, 2022

pmuellr Feb 1, 2022

kobelb Feb 1, 2022

spong Feb 1, 2022

pmuellr Feb 1, 2022

banderror Feb 1, 2022

kobelb Feb 1, 2022

spong Feb 1, 2022

kobelb Feb 1, 2022

spong Feb 22, 2022

xcrzx commented Feb 1, 2022

pmuellr commented Feb 2, 2022

adamwdraper left a comment

spong commented Feb 5, 2022

gmmorris commented Feb 7, 2022 •

edited

Loading

spong commented Feb 7, 2022

gmmorris commented Feb 8, 2022 •

edited

Loading

kibana-ci commented Feb 14, 2022

ESLint disabled line counts

Total ESLint disabled count

spong commented Feb 15, 2022

[Security Solution][Detections] Adds Rule Execution Log table #124198

[Security Solution][Detections] Adds Rule Execution Log table #124198

Conversation

spong commented Feb 1, 2022 • edited Loading

Notes

Summary

Implementation notes

Todo:

Checklist

elasticmachine commented Feb 1, 2022

elasticmachine commented Feb 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spong Feb 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xcrzx commented Feb 1, 2022

pmuellr commented Feb 2, 2022

adamwdraper left a comment

Choose a reason for hiding this comment

spong commented Feb 5, 2022

gmmorris commented Feb 7, 2022 • edited Loading

spong commented Feb 7, 2022

gmmorris commented Feb 8, 2022 • edited Loading

kibana-ci commented Feb 14, 2022

💚 Build Succeeded

Metrics [docs]

Module Count

Async chunks

Saved Objects .kibana field count

ESLint disabled line counts

Total ESLint disabled count

History

spong commented Feb 15, 2022

spong commented Feb 1, 2022 •

edited

Loading

spong Feb 1, 2022 •

edited

Loading

gmmorris commented Feb 7, 2022 •

edited

Loading

gmmorris commented Feb 8, 2022 •

edited

Loading