Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SECURITY SOLUTION] [Detections] Increase lookback when gap is detected #68339

Merged
merged 26 commits into from
Jun 30, 2020

Conversation

dhurley14
Copy link
Contributor

@dhurley14 dhurley14 commented Jun 5, 2020

Summary

Gap detection remediation strategy option 1 from this discussion

When gaps occur in detections we need to make sure that we generate signals alerts during those gaps, if such alerts exist. In order to do this I have added a function to determine how many discrete time periods (not exceeding the rule interval) to search, given a gap duration. We then iterate over these time period 'tuples' to do an exclusive search and bulk create over each gapped rule interval. My hope is this will resolve the issue of not creating alerts during a gap period even if we search over it.

This implementation also caps the number of full rule interval gaps at 4. If a gap equivalent to more than 4 missed rule executions is present, the rule executor will not look at events that occurred more than 4 rule intervals back in search.

For example if a rule runs every 5 minutes, but has not run in the past 25 minutes, we will only look at the last 20 minutes so as to cap how much data we are looking back so we attempt to prevent search timeouts from occurring. Obviously this method is still open to discussion if there are proposals / addendums people think would be good.

Because we are still preventing the executor from searching a gap greater than 4 full rule interval runs, it is still possible even more alerts will be missed. So we are keeping the alert in the "failed" state when a gap is detected in order to alert the analyst of the existence of the gap, despite our attempts at resolving the gap. From there the analyst could determine if task manager is overextended or if they just need to manually adjust their cluster settings.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@MikePaquette
Copy link

@dhurley Great to see this!

Just to confirm, this change is not performing any "chunking" of queries during the "catch-up" period, correct?

Thus the adjustment to max_signals is calculated as an average across the catch-up period. In your example, a 1 minute gap on a 5 minute interval will result in a 6-minute query with a max_signals that is 1.2x the original value.

Likewise, a 15-minute gap on a 5-minute interval would result in a 20-minute query with a max_signals that is 4x its original value.

However, when the 20-minute query runs, the max_signals is not spread across the 4 intervals, but is applied based on the query results from newest to oldest, and if the 4x value of max_signals is reached before the end of, say, the first interval, then the user might see 4x max_signals signals in one time interval, correct?

While this is not ideal, it is OK, since our first priority is to eliminate gaps that could cause missed detections, and in the case where there are large number of detections, we have accomplished that.

Are there plans to take any specific action if the catch-up query times out or otherwise fails?

@dhurley14
Copy link
Contributor Author

dhurley14 commented Jun 9, 2020

@MikePaquette currently we do “chunk” by 100 events per search, but you are correct in that the implementation in this PR could hit the new calculated max signals before reaching the events which occurred within the “gap” period, despite the changes made to perform a search over the gapped period. So with that case in mind, a new time-based “chunking” could be written to modify not only the from parameter in the search_after but the to as well, to ensure a search is explicitly performed on the missed gap(s).

An example of this new time-based chunking (for this example assume rule interval every 5 minutes and 1 minute gap occurs) would perform two searches:

  1. one 'from': '5m-6m' and 'to': 'now-5m' (with a max_signals of 100 for this search) which covers the 1 minute "gap".

  2. and then the normal interval search for the example rule of 'from': 'now-5m' and 'to': 'now'.

Juxtapose this with the current implementation in this PR which would yield search from and to parameters of 'from': 'now-6m' and 'to':'now'.

I think this would solve the case you mention above but let me know if I'm misunderstanding.

As far as timeouts related to the search taking too long, I don't think there is much in the way I can do about that situation. I think the only thing we have right now is setting the rule status state to ‘failed’ with the error message that the search timed out.

@MikePaquette
Copy link

@dhurley14 yes, what you proposed above would address the concern. If it is not too difficult or risky, then I'd vote to go with the updated approach. Thanks!

@dhurley14 dhurley14 force-pushed the gap-inc-look-back branch 2 times, most recently from db56b9a to 19c9f35 Compare June 24, 2020 01:49
@dhurley14 dhurley14 self-assigned this Jun 24, 2020
@dhurley14 dhurley14 marked this pull request as ready for review June 24, 2020 13:47
@dhurley14 dhurley14 requested review from a team as code owners June 24, 2020 13:47
@elasticmachine
Copy link
Contributor

Pinging @elastic/siem (Team:SIEM)

@dhurley14 dhurley14 force-pushed the gap-inc-look-back branch 2 times, most recently from 00fd85a to 56a933b Compare June 25, 2020 16:44
@dhurley14
Copy link
Contributor Author

jenkins test this

Copy link
Contributor

@rylnd rylnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhurley14 I checked out behavior by changing newFrom as advised in the comments. I did not test intricacies involved with gap detection other than to verify rules continue to run and generate signals in the case of a gap. Unit tests look good though so 👍 ; let me know if there are interesting cases you’d like specific review on.

Regarding rule status in the case of a gap error, I was reminded of #62383 which explains our current behavior there 😉

try {
logger.debug(`sortIds: ${sortId}`);
const {
// @ts-ignore https://github.com/microsoft/TypeScript/issues/35546
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not getting an error if I remove this line; what are you seeing?

searchDuration,
}: { searchResult: SignalSearchResponse; searchDuration: string } = await singleSearchAfter(
{
// @ts-ignore we are using sortId before being assigned but that's ok.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here; I'm not seeing an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm yeah this might have been leftover from when I was testing things. It was complaining I was using sortId before assignment but that is not the case anymore. thanks!

@@ -28,7 +28,9 @@ export const filterEventsAgainstList = async ({
eventSearchResult,
}: FilterEventsAgainstList): Promise<SignalSearchResponse> => {
try {
logger.debug(`exceptionsList: ${JSON.stringify(exceptionsList, null, 4)}`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: two spaces for indentation

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@dhurley14 dhurley14 merged commit 432f93a into elastic:master Jun 30, 2020
@dhurley14 dhurley14 deleted the gap-inc-look-back branch June 30, 2020 20:43
Bamieh pushed a commit to Bamieh/kibana that referenced this pull request Jul 1, 2020
…ed (elastic#68339)

* add POC logic to modify the 'from' param in the search

* fixes formatting for appending gap diff to from

* computes new max signals based on how many intervals of rule runs were missed when gap in consecutive rule runs is detected

* adds logging, fixes bug where we could end up with negative values for diff, adds calculatedFrom to the search after query

* remove console.log and for some reason two eslint disables were added so i removed one of them

* rename variables, add test based on log message - need to figure out a better way to test this

* remove unused import

* fully re-worked the algorithm for searching discrete time periods, still need search_after because a user could submit a rule with a custom maxSignals so that would still serve a purpose. This needs heavy refactoring though, and tests.

* updated loop to include maxSignals per time interval tuple, this way we guarantee maxSignals per full rule interval. Needs some refactoring though.

* move logic into utils function, utils function still needs refactoring

* adds unit tests and cleans up new util function for determining time intervals for searching to occur

* more code cleanup

* remove more logging statements

* fix type errors

* updates unit tests and fixes bug where search result would return 0 hits but we were accessing property on non-existent hit item

* fix rebase conflict

* fixes a bug where a negative gap could exist if a rule ran before the lookback time, also fixes a bug where the search and bulk loop would return false when successful.

* gap is a duration, not a number.

* remove logging variable

* remove logging function from test

* fix type import from rebase with master

* updates missed test when rebased with master, removes unused import

* modify log statements to include meta information for logged rule events, adds tests

* remove unnecessary ts-ignores

* indentation on stringify

* adds a test to ensure we are parsing the elapsed time correctly
dhurley14 added a commit that referenced this pull request Jul 1, 2020
…detected (#68339) (#70371)

* add POC logic to modify the 'from' param in the search

* fixes formatting for appending gap diff to from

* computes new max signals based on how many intervals of rule runs were missed when gap in consecutive rule runs is detected

* adds logging, fixes bug where we could end up with negative values for diff, adds calculatedFrom to the search after query

* remove console.log and for some reason two eslint disables were added so i removed one of them

* rename variables, add test based on log message - need to figure out a better way to test this

* remove unused import

* fully re-worked the algorithm for searching discrete time periods, still need search_after because a user could submit a rule with a custom maxSignals so that would still serve a purpose. This needs heavy refactoring though, and tests.

* updated loop to include maxSignals per time interval tuple, this way we guarantee maxSignals per full rule interval. Needs some refactoring though.

* move logic into utils function, utils function still needs refactoring

* adds unit tests and cleans up new util function for determining time intervals for searching to occur

* more code cleanup

* remove more logging statements

* fix type errors

* updates unit tests and fixes bug where search result would return 0 hits but we were accessing property on non-existent hit item

* fix rebase conflict

* fixes a bug where a negative gap could exist if a rule ran before the lookback time, also fixes a bug where the search and bulk loop would return false when successful.

* gap is a duration, not a number.

* remove logging variable

* remove logging function from test

* fix type import from rebase with master

* updates missed test when rebased with master, removes unused import

* modify log statements to include meta information for logged rule events, adds tests

* remove unnecessary ts-ignores

* indentation on stringify

* adds a test to ensure we are parsing the elapsed time correctly
@MindyRS MindyRS added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Sep 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:enhancement review Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:SIEM v7.9.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants