Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Detections] Adds Rule Execution Log table #124198

Closed
wants to merge 21 commits into from

Conversation

spong
Copy link
Member

@spong spong commented Feb 1, 2022

Notes

This PR will be broken out into two: one for adding Total Alerts Created/Detected to event-log (#126210), and one for the Rule Execution Log Table itself. (Will add links here...)

Summary

Resolves #119598, #119599, #101014, #120678, #120668

Adds Rule Execution Log table to Rule Details page, and introduces Total Alerts & Total Hits fields to the siem-detection-engine-rule-execution-info SO and adds them as columns on the Rule Monitoring and Rule Execution Log tables.

Implementation notes

The useful metrics within event-log for a given rule execution are spread between a few different platform (execute-start, execute) and security (execution-metrics, status-change) events. In effort to provide consolidated metrics per rule execution (and avoiding a lot of empty cells and mis-matched statuses like in the image below)

Discover_-_Elastic

these rule execution events are aggregated by their executionId, and then fields are merged from each different event. Since the event-log client doesn't currently support aggregations, this aggregation happens in-memory on the kibana server. See the specific merge logic in getAggregateExecutionEvents (still an optimization or two when filtering on certain fields, as data from other execution events can be filtered out and needs re-fetched).

Since there are also limitations on what fields the event-log client can sort, it's beneficial to enable the user to narrow their rule execution window with daterange, status and field filters, and then enable in-memory sorting of all fields via an EuiInMemoryTable. In discussions with the team, this implementation proved to be most beneficial to the user in terms of identifying long-running executions, scheduling backups, max_signals thresholds, and so forth.

In concern of memory when performing aggregations server-side, the execution window size is limited to 500 execution events (roughly 2500 rule execution logs, though more analysis of always failing rules necessary). This is implemented by first performing a query with the user's daterange filter and the field filter event.action:execute-start to find the total number of unique executions for that window. If it's greater than 500, we limit the number of composite execution events we return to the client to 500, and also return the total for the window as maxEvents. This allows the UI to inform the user that they should narrow their search further to better isolate and find possible issues.

This should be a be a reasonable constraint for most all rules as a rule executing every 5 minutes, 500 executions would cover 41hrs of execution time (with 5 events per execution).

Todo:

  • FTR tests exercising Total Alerts & Total Hits for each rule executor (event-log + logger)
  • Disable Rule Execution Log tab for deleted rules (RBAC on event-log can't fetch for deleted rules)
  • Normalize duration columns?
  • Swap out EuiSearchBar & out of the box filter for EuiSuggest and custom FilterGroup
  • API Integration tests for execution event route
  • ...
  • ...
  • ...

Checklist

Delete any items that are not applicable to this PR.

  • Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
  • Documentation was added for features that require explanation or tutorials
    • TODO: Collaborate with @elastic/security-docs
  • Unit or functional tests were updated or added to match the most common scenarios
    • Existing updated, additional tests required, including FTR suite for exercising total_alerts/total_hits changes
  • Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
  • Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)

@spong spong added release_note:enhancement enhancement New value added to drive a business result Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.1.0 Feature:Rule Monitoring Security Solution Detection Rule Monitoring Team:Detection Rule Management Security Detection Rule Management Team labels Feb 1, 2022
@spong spong requested a review from a team February 1, 2022 02:40
@spong spong self-assigned this Feb 1, 2022
@spong spong requested review from a team as code owners February 1, 2022 02:40
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

Comment on lines 292 to 294
"number_of_triggered_actions": {
"type": "long"
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw this was just added in #123567, shall we add it to the table as well, or is it getting too congested and better to wait till we allow expanding rows with additional data?

@@ -8,10 +8,13 @@
import * as t from 'io-ts';
import { ruleExecutionEvent } from '../common';

export const GetRuleExecutionEventsResponse = t.exact(
export const GetRuleExecutionEventsResponse = t.intersection([
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not deprecating getLastFailures this should move to its own dedicated type since we're now including additional information about total underlying executions.

Copy link
Member Author

@spong spong Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types diverged and ended up making a dedicated GetAggregateRuleExecutionEventsResponse type. Will still need to clean up this one if deprecating getLastFailures.

Comment on lines +83 to +97
field: 'kibana.alert.rule.execution.metrics.total_indexing_duration_ms',
name: i18n.COLUMN_INDEX_DURATION,
render: (value: number) => getOrEmptyTagFromValue(value),
sortable: true,
truncateText: false,
width: '7%',
},
{
field: 'kibana.alert.rule.execution.metrics.total_search_duration_ms',
name: i18n.COLUMN_SEARCH_DURATION,
render: (value: number) => getOrEmptyTagFromValue(value),
sortable: true,
truncateText: false,
width: '8%',
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to normalize these duration fields to seconds to match execution duration, gap, & scheduling delay? I had normalized them but remembered when searching you must use ms, so a slight disconnect here if we normalize. Optimize for readability or consistency with search?

},
];

export const ExecutionLogSearchBar = React.memo<ExecutionLogTableSearchProps>(({ onSearch }) => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really wanted to expose suggestions for the above schema but in checking with the EUI folks it seems you must use EuiSuggest for auto-complete, but that component does not support FilterGroups out of the box, so would need to wire that up separately (along with query text override when applying filters).

In testing I found an issue with the validation feature of EuiSearchBar where it reports > < as invalid operators. Even with specifying type as number in the schema I'm still seeing this, so will need to debug further, or just abandon to use EuiSuggest with custom filters that way we get auto-complete (or expose column header style filters like what was brought up previously).

},
options: {
tags: ['access:securitySolution'],
},
},
async (context, request, response) => {
const { ruleId } = request.params;
const { start, end, filters = '' } = request.query;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filters can be defaulted as part of the validation schema, no?

"total_alerts": {
"type": "long"
},
"total_hits": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"hits" isn't a general alerting thing, so I think we'll need a comment somewhere describing what it is - perhaps in the mappings.js file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "hits" isn't a general alerting thing, I don't think it should be part of the event-log. How do the "hits" differ from the "alerts"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see these two issues for details #120678 #120668, but gist here is total_hits is the total number of candidate alerts found during a specific execution, and total_alerts is the total number of those candidate alerts that were actually created during that execution. The totals between candidates and created can differ as a result of reaching the configured max_signals threshold, or if duplicate alerts (alerts previously detected) were identified, but never written (and I believe a case with exceptions as well, but would need to verify).

I totally agree about naming (perhaps total_candidate_alerts better fits?), but that aside, is there no representation of these concepts in general alerting? Is there no circuit breaker for the number alerts that can be written per execution, or management of duplicates where discerning between these two values is of use to the general alerting user (like for the security solution user outlined in the issues above)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting. We'll have an issue open soon-ish (from some previous research) about having a maximum number of actions and alerts (separate values) per rule execution. And we would want to track the total number as being different if we cap the number via the maximums.

So it isn't a thing now, but probably will be. So ... maybe just some word-smithing on this. Candidate doesn't seem quite right. Something like "total" as the number the rule produced, and "triggered" instead of "candidate", so "totalAlerts" and "triggeredAlerts"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "hits" isn't a general alerting thing, I don't think it should be part of the event-log

We already have a few metrics stored in the event log which are not a general alerting thing (strictly speaking) but which make sense from the Security Solution and AAD perspective. These are total_indexing_duration_ms, total_search_duration_ms and execution_gap_duration_s. Imho the new metrics have the same relation to the overall logic of Detection Engine and AAD.

I agree that additional documentation and/or better naming would help though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, @spong, that makes a lot of sense to me.

What about total_alerts_created and total_alerts_detected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about total_alerts_created and total_alerts_detected?

Those two sound good to me @kobelb! 🙂 I'll wait a moment to see if anyone else has additional input and will then make the change.

In addition to adding documentation within event-log as @pmuellr recommended, are there any other supporting docs we should be updating with these fields as well? I know there's the AAD schema spreadsheet, but I don't believe the event-log is schema represented in there, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to adding documentation within event-log as @pmuellr recommended, are there any other supporting docs we should be updating with these fields as well? I know there's the AAD schema spreadsheet, but I don't believe the event-log is schema represented in there, no?

I'm not aware of any docs about the fields in the event-log or any docs that recommend that users search these indices. We've treated them as a traditional hidden index where users might find it, but they can't rely on the names of fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled this feature out to a dedicated PR (#126210), will ping reviewers when ready for review.

@xcrzx
Copy link
Contributor

xcrzx commented Feb 1, 2022

We don't write warning statuses anymore, I think it could be safely removed from the filter options:

spong pushed a commit that referenced this pull request Feb 2, 2022
…w-up (#124194)

**Related to:** #121644
**Addresses:** #86202

## Summary

Done in this PR:

- Removed the deprecated `warning` rule execution status ([comment](#121644 (comment))).
- Added a new `running` status ([ticket](#86202)).
- Simplified the internal implementation of the `rule_execution_log` folder. Hopefully naming of folders, files and interfaces is clearer now as well. ([comment](#121644 (comment)), [comment](#121644 (comment)))
- Added APM measurements with `withSecuritySpan`.
- Added rule id to the react-query key used for loading last rule failures ([comment](#124198 (comment)))
- Addressed most of the `// TODO: https://github.com/elastic/kibana/pull/121644` comments

In the next PR that could be merged after the FF I'd address the rest of the stuff:

- Add comments to all the interfaces and methods in the `rule_execution_log` folder. Write a readme for it.
- Address the remaining of the `// TODO: https://github.com/elastic/kibana/pull/121644` comments. All of them are related to tests.
- Fix for the gap column ([comment](#121644 (comment)))
@pmuellr
Copy link
Member

pmuellr commented Feb 2, 2022

@banderror: We allow the user to select the going to run status in the dropdown. I don't think we should do this because this status is not final and so doesn't make much sense in this table where we show, in fact, rule execution results (or as we call it in the UI "responses"). When the user selects it, we show a lot of empty rows in the table.

I believe this is the status of "pending" from the alerting data side. A rule will be in this state after it's created, but before it's run. Which means if it's created in a disabled state, it would have that status until someone enabled it. AFAIK, SIEM was the only alerting client that was creating rules disabled by default (don't remember why), so if that is still a thing, I think there may be some value in leaving it.

I certainly would expect that it shows a lot of empty rows, in the usual case :-) !!!

@yiyangliu9286 yiyangliu9286 self-requested a review February 2, 2022 21:44
@adamwdraper adamwdraper self-requested a review February 3, 2022 23:45
Copy link

@adamwdraper adamwdraper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The screenshots of this pr are very different from the original figmas @yiyangliu9286 designed. We created a list of the items that need to be addressed before merging. Please sync with @yiyangliu9286 on the following:

  • Duration column: There’s different format for displaying duration time in the design vs. implementation (the way we designed is aligned with Stack Rule’s table for showing 00:00:00.000 format).
  • Duration column is missing tooltip
  • In the design, we do not have “Total Alerts”, “Total Hits”, “Gap Duration (s)“, “Index Duration (ms)“, “Search Duration (ms)” “Scheduling Delay (s)” columns.
  • Missing status filter number badge: in the design, there is a number badge for showing how many statuses there’re available for filtering.
  • Auto refresh vs. Manual refresh: in the design, we show for example, “Updated 3 minutes ago” on top right of this table, and in implementation, we have a manual refresh button, do we not support auto refresh for this table?
  • Showing x number of rule executions by default: in the design, we show 10 rule executions by default with a footer pagination, but in implementation, we show “385 rule executions” by default.
  • Suggested UI enhancements:
  • Expand the width for EuiDatePicker component to its 400px, right now it looks narrow to be able to display details.
  • For columns that show numbers (“Duration (s)“, “Total Alerts”, “Total Hits”, “Gap Duration (s)“, “Index Duration (ms)“, “Search Duration (ms)“, “Scheduling Delar (s)“) right aligned the number.

Question:

  • What’s the difference between showing “0” and a “-” in this table? What does this imply for users?

@spong
Copy link
Member Author

spong commented Feb 5, 2022

Thanks for the feedback @adamwdraper! 🙂 It should be noted that this feature is going to need a full design review as we had never finalized the designs after our initial review with the team on 9-NOV-2021, and as mentioned in the description there were limitations with regards to implementing the initial design as proposed. We're also now supporting the new Response-Ops team as stakeholders too, so they should participate as well. That said, you can look at this PR as a first-pass / pathfinding effort to determine design feasibility and any additional engineering efforts required to meet design's needs.

I'll touch base with @yiyangliu9286 and review the feedback and get it integrated 👍

Question:

  • What’s the difference between showing “0” and a “-” in this table? What does this imply for users?

Like in the Alerts and other Security Solution tables is going to indicate the absence of data, whereas 0 is going to be an actual concrete value for the field. For instance, an Indexing Duration of 0 means the indexing operation occurred (just didn't take any time since there were no documents, or the duration was less than the precision being captured in the field type), whereas a will indicate that the indexing operation did not occur (or potentially completed unsuccessfully based on the message field).

@gmmorris
Copy link
Contributor

gmmorris commented Feb 7, 2022

these rule execution events are aggregated by their executionId, and then fields are merged from each different event. Since the event-log client doesn't currently support aggregations, this aggregation happens in-memory on the kibana server.

This is really cool @spong !

I was wondering about the in-memory aggregation and what impact this might have on the event loop.
Are we concerned about the potential scale limitations here? Is this unbounded or do we keep it limited in such a way that it isn't a big deal?

In addition, regarding support for aggregations in the Event Log - this is something we do want to do, but we lack concrete requirements here. Would you mind filing an issue detailing your exact use case and requirements? That way we can explore adding that support in 8.2 so that you can avoid an in-memory approach going forward.

cc @XavierM as @elastic/response-ops-ram are looking into building an activity/execution log in Stack Management, and this will likely affect their efforts

@spong
Copy link
Member Author

spong commented Feb 7, 2022

Thanks for the feedback @gmmorris! 🙂

I was wondering about the in-memory aggregation and what impact this might have on the event loop.
Are we concerned about the potential scale limitations here? Is this unbounded or do we keep it limited in such a way that it isn't a big deal?

Yeah, that was the biggest concern with this impl/workaround. I touch on this a little in the description above, but in concern of memory here results are bound to 500 execution events (roughly 2500 rule execution logs, though more analysis is necessary to identify max event-log docs per execution). At the moment we're returning the max results the query matches so the can UI inform the user that they should narrow their search further to better isolate and find possible issues. This seemed to work well in effort to allows users to see a fair number of executions, and still make sorting on columns functional for identifying problem executions.

In addition, regarding #91731 - this is something we do want to do, but we lack concrete requirements here. Would you mind filing an issue detailing your exact use case and requirements? That way we can explore adding that support in 8.2 so that you can avoid an in-memory approach going forward.

Awesome! Since this is pushed to 8.2, I discussed next steps with the team and I'm going to determine what agg queries we'd need to support this table view (and a nice time series visualization for visualizing executions over time/coverage/gaps/etc) and then will distill those requirements into a dedicated issue for adding agg support to the event-log client. Was discussing with @peluja1012 today, and depending on scope this may be something I can assist with. We're flexible too, so if we need to ship the first version with limited in-memory-aggs and follow-up with true agg support that should be fine too.

@gmmorris
Copy link
Contributor

gmmorris commented Feb 8, 2022

then will distill those requirements into a dedicated issue for adding agg support to the event-log client.

Brilliant, thanks!
@pmuellr can help out once that issue is filed 😄 🙏

Was discussing with @peluja1012 today, and depending on scope this may be something I can assist with. We're flexible too, so if we need to ship the first version with limited in-memory-aggs and follow-up with true agg support that should be fine too.

Yeah, incremental progress here sounds absolutely reasonable to me 👍

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
securitySolution 2847 2849 +2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
securitySolution 4.7MB 4.7MB +7.1KB

Saved Objects .kibana field count

Every field in each saved object type adds overhead to Elasticsearch. Kibana needs to keep the total field count below Elasticsearch's default limit of 1000 fields. Only specify field mappings for the fields you wish to search on or query. See https://www.elastic.co/guide/en/kibana/master/development-plugin-saved-objects.html#_mappings

id before after diff
siem-detection-engine-rule-execution-info 10 12 +2
Unknown metric groups

ESLint disabled line counts

id before after diff
securitySolution 442 443 +1

Total ESLint disabled count

id before after diff
securitySolution 509 510 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @spong

@spong
Copy link
Member Author

spong commented Feb 15, 2022

@gmmorris @pmuellr

@banderror has created this issue for the Detection Rules Area (#125645) outlining the requirements for adding aggregations to the event-log client.

As for this PR, I'm going to close this larger one and break it into two: one for adding Total Alerts Created/Detected to event-log, and one for the Rule Execution Log Table itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Rule Monitoring Security Solution Detection Rule Monitoring release_note:enhancement Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Security Solution][Detections] New endpoint for Rule Execution Log
9 participants