[SIEM] [Detections] Gap detection mitigation and remediation summary #63290

dhurley14 · 2020-04-10T20:53:08Z

Gap detection and remediation workflow

As it currently stands, rules flow backwards in time when looking for events to generate signals from. Rules look at events from "now" back to some duration in the past which is determined by the interval they run at plus some optional look-back time. The look-back time is there to optionally capture events that the rule might have already analyzed in case the rule is not started at a consistent interval. This design allows analysts to consistently have a view of the "newest" events but can allow events that might have triggered signal generation to slip by if that rule fails to start at a reliable interval.

Given this, there are a few proposals for solutions from a query perspective and from a user experience perspective that would help document when these situations occur and possibly remediate / mitigate.

From our discussion there are ways to mix some of these solutions together, but for now I will just list them as-is and we can determine how best to mix them in further conversations.

Creating signals on gaps
schedule rules ad-hoc (future).
Try to resolve the gap first by searching with the additional look-back + gap difference, then if that is too many documents or some circuit breaker occurs, go ahead and open an error state (whether that is a signal, or just updating the rules status failed) and/or open an ad-hoc rule run (not currently possible).
Switch the order in which we process events such that we always process events starting at the last event processed from the last rule run, and go forward by interval or until we hit max signals. With this, now we can be certain we won't have gaps from a historical perspective. This has the problem of continually trying to "catch up" to new events. We will be "behind" in processing new events and may forever be trying to play catch up.

edit: Adding fifth option we discussed - some form of sampling with acknowledgement there will be "gaps" that we generally control.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-10T20:53:10Z

Pinging @elastic/siem (Team:SIEM)

NerdSec · 2020-06-24T07:09:39Z

@spong @dhurley14
I think all the options discussed have their trade-offs. In a sense the issue is with the implementation of the current SIEM, where the data is indexed first and then queried in a periodic manner. So delays in log ingestion cannot be accounted for in the queries, without accepting a global delay of some sort. In a sense we are running smaller chunks of a report in a shorter timeframe and not really implementing a RealTime correlation platform.

In many organizations, data is often encountered with a delay. Now in some severe cases, it could be as large as a couple of days. But mostly, it is limited to an hour or two. An approach to solve this would be to break the detection process in two parts:

Batch detection
Near Real-Time detection

Real time detection is only possible if the source system facilitates it. In this scenario we will not face any challenge, so i have ignored that scenario for this discussion.

Batch Detection

An analyst or the content creator could identify the indices that have data coming in with a delay and preferably mark them with a delay tag to identify them.
These indices will be queried once every few hours/days depending on the use case with an aggregated query. The output of this will probably be similar to a report.
We might have to implement partitions in order to fetch all data coming in, and this has to preferably be dynamic or at runtime. Maybe a model where we increase the num_partition by a factor till the error count is low enough.

Near Real-Time detection

We configure a global acceptable delay of some sorts. Each scheduled query can now run with this delay in mind.
This method avoids missing out on events when they come with a delay while still providing accurate results.
Every other method discussed above in the original comment could result in the rule skipping a few events during its run cycle.

dhurley14 · 2020-07-02T16:03:42Z

#68339 closed this.

elasticmachine · 2021-09-23T14:36:34Z

Pinging @elastic/security-solution (Team: SecuritySolution)

dhurley14 added discuss enhancement New value added to drive a business result Team:SIEM labels Apr 10, 2020

dhurley14 self-assigned this Apr 10, 2020

spong mentioned this issue Jun 23, 2020

[Security][Detections] Create Threshold-based Rule type #68409

Closed

dhurley14 closed this as completed Jul 2, 2020

spong mentioned this issue Aug 18, 2020

[Security Solution][Detections] Support querying with multiple timestamps in Detection Rules #75382

Closed

MindyRS added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIEM] [Detections] Gap detection mitigation and remediation summary #63290

[SIEM] [Detections] Gap detection mitigation and remediation summary #63290

dhurley14 commented Apr 10, 2020 •

edited

Loading

elasticmachine commented Apr 10, 2020

NerdSec commented Jun 24, 2020 •

edited

Loading

dhurley14 commented Jul 2, 2020

elasticmachine commented Sep 23, 2021

[SIEM] [Detections] Gap detection mitigation and remediation summary #63290

[SIEM] [Detections] Gap detection mitigation and remediation summary #63290

Comments

dhurley14 commented Apr 10, 2020 • edited Loading

Gap detection and remediation workflow

elasticmachine commented Apr 10, 2020

NerdSec commented Jun 24, 2020 • edited Loading

Batch Detection

Near Real-Time detection

dhurley14 commented Jul 2, 2020

elasticmachine commented Sep 23, 2021

dhurley14 commented Apr 10, 2020 •

edited

Loading

NerdSec commented Jun 24, 2020 •

edited

Loading