Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIEM] [Detections] Gap detection mitigation and remediation summary #63290

Closed
dhurley14 opened this issue Apr 10, 2020 · 4 comments
Closed

[SIEM] [Detections] Gap detection mitigation and remediation summary #63290

dhurley14 opened this issue Apr 10, 2020 · 4 comments
Assignees
Labels
discuss enhancement New value added to drive a business result Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:SIEM

Comments

@dhurley14
Copy link
Contributor

dhurley14 commented Apr 10, 2020

Gap detection and remediation workflow

As it currently stands, rules flow backwards in time when looking for events to generate signals from. Rules look at events from "now" back to some duration in the past which is determined by the interval they run at plus some optional look-back time. The look-back time is there to optionally capture events that the rule might have already analyzed in case the rule is not started at a consistent interval. This design allows analysts to consistently have a view of the "newest" events but can allow events that might have triggered signal generation to slip by if that rule fails to start at a reliable interval.

Given this, there are a few proposals for solutions from a query perspective and from a user experience perspective that would help document when these situations occur and possibly remediate / mitigate.

From our discussion there are ways to mix some of these solutions together, but for now I will just list them as-is and we can determine how best to mix them in further conversations.

  1. Creating signals on gaps
  2. schedule rules ad-hoc (future).
  3. Try to resolve the gap first by searching with the additional look-back + gap difference, then if that is too many documents or some circuit breaker occurs, go ahead and open an error state (whether that is a signal, or just updating the rules status failed) and/or open an ad-hoc rule run (not currently possible).
  4. Switch the order in which we process events such that we always process events starting at the last event processed from the last rule run, and go forward by interval or until we hit max signals. With this, now we can be certain we won't have gaps from a historical perspective. This has the problem of continually trying to "catch up" to new events. We will be "behind" in processing new events and may forever be trying to play catch up.

edit: Adding fifth option we discussed - some form of sampling with acknowledgement there will be "gaps" that we generally control.

@dhurley14 dhurley14 added discuss enhancement New value added to drive a business result Team:SIEM labels Apr 10, 2020
@dhurley14 dhurley14 self-assigned this Apr 10, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/siem (Team:SIEM)

@NerdSec
Copy link

NerdSec commented Jun 24, 2020

@spong @dhurley14
I think all the options discussed have their trade-offs. In a sense the issue is with the implementation of the current SIEM, where the data is indexed first and then queried in a periodic manner. So delays in log ingestion cannot be accounted for in the queries, without accepting a global delay of some sort. In a sense we are running smaller chunks of a report in a shorter timeframe and not really implementing a RealTime correlation platform.

In many organizations, data is often encountered with a delay. Now in some severe cases, it could be as large as a couple of days. But mostly, it is limited to an hour or two. An approach to solve this would be to break the detection process in two parts:

  1. Batch detection
  2. Near Real-Time detection

Real time detection is only possible if the source system facilitates it. In this scenario we will not face any challenge, so i have ignored that scenario for this discussion.

Batch Detection

  • An analyst or the content creator could identify the indices that have data coming in with a delay and preferably mark them with a delay tag to identify them.
  • These indices will be queried once every few hours/days depending on the use case with an aggregated query. The output of this will probably be similar to a report.
  • We might have to implement partitions in order to fetch all data coming in, and this has to preferably be dynamic or at runtime. Maybe a model where we increase the num_partition by a factor till the error count is low enough.

Near Real-Time detection

  • We configure a global acceptable delay of some sorts. Each scheduled query can now run with this delay in mind.
  • This method avoids missing out on events when they come with a delay while still providing accurate results.
  • Every other method discussed above in the original comment could result in the rule skipping a few events during its run cycle.

@dhurley14
Copy link
Contributor Author

#68339 closed this.

@MindyRS MindyRS added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Sep 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss enhancement New value added to drive a business result Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:SIEM
Projects
None yet
Development

No branches or pull requests

4 participants