Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIEM] [Detections] Fixes faulty circuit breaker #71999

Merged
merged 7 commits into from
Jul 20, 2020

Conversation

dhurley14
Copy link
Contributor

@dhurley14 dhurley14 commented Jul 16, 2020

Summary

One of the circuit breakers used for exiting out of the search after and bulk create function revealed itself as broken after introducing #71768 (commit 56de45d in master, reverted here aee2a12 with this pr #71956). After changing how we count the number of signals created to ensure we don't go beyond maxSignals in #71768, the circuit breaker started to get triggered when a gap appeared when it never was previously during the development of gap detection mitigation (#68339) because of how we were counting up to maxSignals. This resulted in a never ending loop of searching and bulk creating the same signals when there was a gap. The boolean variable (useSortIds) was leftover during development of the gap detection mitigation code and should not have been a part of the boolean logic for the circuit breaker. I missed deleting it when I opened the gap detection mitigation pr.

This PR removes that boolean, adds additional logic to ensure we are only calling bulk create when we have candidate signals, and adds more comments and some log statements.

TL;DR - faulty circuit breaker created in #68339 started to show up after merging #71768.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@@ -118,7 +107,6 @@ export const searchAfterAndBulkCreate = async ({
interval,
buildRuleMessage,
});
const useSortIds = totalToFromTuples.length <= 1;
Copy link
Contributor Author

@dhurley14 dhurley14 Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was the culprit of the bug that caused this revert #71956 .

useSortIds was only > 1 when there was a gap, which would mean boolean useSortIds was false. This bug showed up now because I modified the count we used to determine when we hit max signals from the number of search results to the number of bulk create results (signals indexed) with this pr #71768. When I did that, one of the circuit breakers for jumping out of the loop relied on checking the next sort id would fail because useSortIds was false. I’ve removed the variable and simplified the logic for breaking out of the while loop. That variable was leftover from a different implementation I attempted while doing the gap detection fixes and I missed deleting it and updating the logic.

Again, this bug stayed hidden for so long after all the gap detection code was added because I never changed the count used to determine when we hit max signals until this commit went in 56de45d right before FF.

@dhurley14 dhurley14 self-assigned this Jul 16, 2020
@dhurley14 dhurley14 added release_note:skip Skip the PR/issue when compiling release notes review v7.10.0 v7.9.0 v8.0.0 Team:SIEM labels Jul 16, 2020
@dhurley14 dhurley14 marked this pull request as ready for review July 16, 2020 21:09
@dhurley14 dhurley14 requested review from a team as code owners July 16, 2020 21:09
@dhurley14 dhurley14 force-pushed the never-ending-loop branch 2 times, most recently from 91af03c to 01b450c Compare July 20, 2020 14:22
…ementing gap detection mitigation code. This only showed up because I modified the count variable used to determine when we hit maxSignals from utilizing the searchResult hits length to using the count of bulk created items (signals indexed) in this commit 56de45d
@dhurley14 dhurley14 added the bug Fixes for quality problems that affect the customer experience label Jul 20, 2020
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@FrankHassanabad FrankHassanabad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for all the details in the conversation section and the tests here. Very very helpful

@dhurley14 dhurley14 merged commit b9413cf into elastic:master Jul 20, 2020
@dhurley14 dhurley14 deleted the never-ending-loop branch July 20, 2020 19:55
dhurley14 added a commit to dhurley14/kibana that referenced this pull request Jul 20, 2020
* removes useSortIds which was leftover from a previous attempt at implementing gap detection mitigation code. This only showed up because I modified the count variable used to determine when we hit maxSignals from utilizing the searchResult hits length to using the count of bulk created items (signals indexed) in this commit 56de45d

* removes logs and fixes if statement ordering

* adds tests, increases code coverage for search after and bulk create function, updates log statements

* update tests after rebase onto master

* clean up if statements

* fix test data

* merge conflicts are hard
dhurley14 added a commit to dhurley14/kibana that referenced this pull request Jul 20, 2020
* removes useSortIds which was leftover from a previous attempt at implementing gap detection mitigation code. This only showed up because I modified the count variable used to determine when we hit maxSignals from utilizing the searchResult hits length to using the count of bulk created items (signals indexed) in this commit 56de45d

* removes logs and fixes if statement ordering

* adds tests, increases code coverage for search after and bulk create function, updates log statements

* update tests after rebase onto master

* clean up if statements

* fix test data

* merge conflicts are hard
dhurley14 added a commit that referenced this pull request Jul 20, 2020
* removes useSortIds which was leftover from a previous attempt at implementing gap detection mitigation code. This only showed up because I modified the count variable used to determine when we hit maxSignals from utilizing the searchResult hits length to using the count of bulk created items (signals indexed) in this commit 56de45d

* removes logs and fixes if statement ordering

* adds tests, increases code coverage for search after and bulk create function, updates log statements

* update tests after rebase onto master

* clean up if statements

* fix test data

* merge conflicts are hard
dhurley14 added a commit that referenced this pull request Jul 20, 2020
* removes useSortIds which was leftover from a previous attempt at implementing gap detection mitigation code. This only showed up because I modified the count variable used to determine when we hit maxSignals from utilizing the searchResult hits length to using the count of bulk created items (signals indexed) in this commit 56de45d

* removes logs and fixes if statement ordering

* adds tests, increases code coverage for search after and bulk create function, updates log statements

* update tests after rebase onto master

* clean up if statements

* fix test data

* merge conflicts are hard
@MindyRS MindyRS added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Sep 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience release_note:skip Skip the PR/issue when compiling release notes review Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:SIEM v7.9.0 v7.10.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants