-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Response Ops][Alerting] Investigate ES query rule firing unexpectedly #175980
Comments
Pinging @elastic/response-ops (Team:ResponseOps) |
I think we should start with a code review. One possibility here is that for some reason the date we are starting the search from is somehow set further back than we want (or not set at all), and so we find older documents. I already did this once, didn't see any obvious way this could happen, but would be good to have another set of eyes. Note that we store the date to start querying from based on the dates of the documents returned, which is stored in the rule task state. I think we could perhaps add some kind of diagnostic as well. Have the rule peek at the documents returned, and determine if they are in the search range. If not, log an error message with tons of info (search JSON, search options, results), and a unique |
I tried out an internal sarch facility to see if elasticsearch is known to return documents not matching the specified filter: question:
answer:
We are currently placing the time window for the search in a Seems unlikely to be 4 as well, since we only see this transiently - presumably we'd see this more consistently if there was a mapping issue, for instance, the same old doc showing up in the search hits for subsequent rule runs. We don't. We'd have to check if there is an analyzer for the time fields, but it seems hard to imagine that's the problem (5) - again, we'd see the same old docs in subsequent runs, but we've not seen that in practice. AFAIK we do not have multiple versions of ES in the mix here, ruling out 6. These aren't nested documents, ruling out 7. Leaving 2, 3, and 8. I have NO IDEA how accurate this answer is, it's somewhat AI-generated :-). The simplest answer and perhaps easiest to check is if we have a shard failure (3). I guess look in the ES logs around the time of the rule run? |
I've looked into this a few times, and come up with nothing. One thing from https://github.com/elastic/sdh-kibana/issues/4177 , is that it appears old documents, out of the range we were searching through, were returned as hits. That could explain the other referenced issues as well. So, here's one thing we can do to try to "catch" that issue: add some code after the query is run to check that the documents' time range matches what we were searching for. If we find documents out of the range (basically, old documents that should not have been in the search), dump a bunch of info to the logger. The query, the document id's / times, maybe some of the search result meta data to see if there are any clues there. Perhaps we should even return an error from the rule run, so that we do NOT create alerts, and make the failure more obvious (than just being logged). These appear to be transient issues, so very likely the customer wouldn't notice the error in the rule run since the next one is likely to be successful, so I'm a little torn on throwing the error. |
resolves elastic#175980 Adds a check with logging if an ES Query rule returns hits which fall outside the time range it's searching. This shouldn't ever happen, but seems to be happening on rare occaisons, so we wanted to add some diagnostics to try to help narrow down the problem.
…ge (elastic#186332) resolves elastic#175980 ## Summary Adds a check with logging if an ES Query rule returns hits which fall outside the time range it's searching. This shouldn't ever happen, but seems to be happening on rare occaisons, so we wanted to add some diagnostics to try to help narrow down the problem. Note that the ES|QL flavor rule does not use this diagnostic, just search source (KQL) and query dsl. We check 3 things: - ensure the `dateStart` sent to fetch was valid - ensure the `dateEnd` sent to fetch was valid - ensure the relevant time fields in hits are within the dateStart/dateEnd range These produce three different error messages: `For rule '<rule-id>', hits were returned with invalid time range start date '<date>' from field '<field>' using query <query>` `For rule '<rule-id>', hits were returned with invalid time range end date '<date>' from field '<field>' using query <query>` `For rule '<rule-id>', the hit with date '<date>' from field '<field>' is outside the query time range. Query: <query>. Document: <document>` Each message has one tag on it: `query-result-out-of-time-range` ## To Verify To test invalid dateStart/dateEnd, hack the Kibana code to set the values to NaN's: https://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L263-L264 For instance, change that to: const epochStart = new Date('x').getTime(); const epochEnd = new Date('y').getTime(); To test the invdivual document hits, first back out the change you made above - when those error, the checks we're testing below do not run. Hack the Kibana code to make the time out of range: https://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L294 For instance, change that to: const epochDate = epochStart - 100 For both tests, create an es query rule - kql or dsl - make the relevant changes, and arrange for the rule to get hits each time. The relevant messages should be logged in the Kibana console when the rule runs. ### Checklist Delete any items that are not applicable to this PR. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: Elastic Machine <[email protected]> (cherry picked from commit e12e449) # Conflicts: # x-pack/plugins/stack_alerts/server/rule_types/es_query/lib/fetch_search_source_query.ts
…ime range (#186332) (#189019) # Backport This will backport the following commits from `main` to `8.15`: - [[ResponseOps] log error when ES Query rules find docs out of time range (#186332)](#186332) <!--- Backport version: 8.9.8 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Patrick Mueller","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-07-23T20:17:11Z","message":"[ResponseOps] log error when ES Query rules find docs out of time range (#186332)\n\nresolves https://github.com/elastic/kibana/issues/175980\r\n\r\n## Summary\r\n\r\nAdds a check with logging if an ES Query rule returns hits which fall\r\noutside the time range it's searching. This shouldn't ever happen, but\r\nseems to be happening on rare occaisons, so we wanted to add some\r\ndiagnostics to try to help narrow down the problem.\r\n\r\nNote that the ES|QL flavor rule does not use this diagnostic, just\r\nsearch source (KQL) and query dsl.\r\n\r\nWe check 3 things:\r\n- ensure the `dateStart` sent to fetch was valid\r\n- ensure the `dateEnd` sent to fetch was valid\r\n- ensure the relevant time fields in hits are within the\r\ndateStart/dateEnd range\r\n\r\nThese produce three different error messages:\r\n\r\n`For rule '<rule-id>', hits were returned with invalid time range start\r\ndate '<date>' from field '<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', hits were returned with invalid time range end\r\ndate '<date>' from field '<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', the hit with date '<date>' from field '<field>'\r\nis outside the query time range. Query: <query>. Document: <document>`\r\n\r\nEach message has one tag on it: `query-result-out-of-time-range`\r\n\r\n## To Verify\r\n\r\nTo test invalid dateStart/dateEnd, hack the Kibana code to set the\r\nvalues to NaN's:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L263-L264\r\n\r\nFor instance, change that to:\r\n\r\n const epochStart = new Date('x').getTime();\r\n const epochEnd = new Date('y').getTime();\r\n\r\nTo test the invdivual document hits, first back out the change you made\r\nabove - when those error, the checks we're testing below do not run.\r\nHack the Kibana code to make the time out of range:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L294\r\n\r\nFor instance, change that to:\r\n\r\n const epochDate = epochStart - 100\r\n\r\nFor both tests, create an es query rule - kql or dsl - make the relevant\r\nchanges, and arrange for the rule to get hits each time. The relevant\r\nmessages should be logged in the Kibana console when the rule runs.\r\n\r\n### Checklist\r\n\r\nDelete any items that are not applicable to this PR.\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>","sha":"e12e4496e0594d34509ef9235d1d7b7e3461e5d8","branchLabelMapping":{"^v8.16.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Feature:Alerting","release_note:skip","Team:ResponseOps","Feature:Alerting/RulesFramework","backport:prev-minor","v8.16.0"],"number":186332,"url":"https://github.com/elastic/kibana/pull/186332","mergeCommit":{"message":"[ResponseOps] log error when ES Query rules find docs out of time range (#186332)\n\nresolves https://github.com/elastic/kibana/issues/175980\r\n\r\n## Summary\r\n\r\nAdds a check with logging if an ES Query rule returns hits which fall\r\noutside the time range it's searching. This shouldn't ever happen, but\r\nseems to be happening on rare occaisons, so we wanted to add some\r\ndiagnostics to try to help narrow down the problem.\r\n\r\nNote that the ES|QL flavor rule does not use this diagnostic, just\r\nsearch source (KQL) and query dsl.\r\n\r\nWe check 3 things:\r\n- ensure the `dateStart` sent to fetch was valid\r\n- ensure the `dateEnd` sent to fetch was valid\r\n- ensure the relevant time fields in hits are within the\r\ndateStart/dateEnd range\r\n\r\nThese produce three different error messages:\r\n\r\n`For rule '<rule-id>', hits were returned with invalid time range start\r\ndate '<date>' from field '<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', hits were returned with invalid time range end\r\ndate '<date>' from field '<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', the hit with date '<date>' from field '<field>'\r\nis outside the query time range. Query: <query>. Document: <document>`\r\n\r\nEach message has one tag on it: `query-result-out-of-time-range`\r\n\r\n## To Verify\r\n\r\nTo test invalid dateStart/dateEnd, hack the Kibana code to set the\r\nvalues to NaN's:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L263-L264\r\n\r\nFor instance, change that to:\r\n\r\n const epochStart = new Date('x').getTime();\r\n const epochEnd = new Date('y').getTime();\r\n\r\nTo test the invdivual document hits, first back out the change you made\r\nabove - when those error, the checks we're testing below do not run.\r\nHack the Kibana code to make the time out of range:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L294\r\n\r\nFor instance, change that to:\r\n\r\n const epochDate = epochStart - 100\r\n\r\nFor both tests, create an es query rule - kql or dsl - make the relevant\r\nchanges, and arrange for the rule to get hits each time. The relevant\r\nmessages should be logged in the Kibana console when the rule runs.\r\n\r\n### Checklist\r\n\r\nDelete any items that are not applicable to this PR.\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>","sha":"e12e4496e0594d34509ef9235d1d7b7e3461e5d8"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.16.0","labelRegex":"^v8.16.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/186332","number":186332,"mergeCommit":{"message":"[ResponseOps] log error when ES Query rules find docs out of time range (#186332)\n\nresolves https://github.com/elastic/kibana/issues/175980\r\n\r\n## Summary\r\n\r\nAdds a check with logging if an ES Query rule returns hits which fall\r\noutside the time range it's searching. This shouldn't ever happen, but\r\nseems to be happening on rare occaisons, so we wanted to add some\r\ndiagnostics to try to help narrow down the problem.\r\n\r\nNote that the ES|QL flavor rule does not use this diagnostic, just\r\nsearch source (KQL) and query dsl.\r\n\r\nWe check 3 things:\r\n- ensure the `dateStart` sent to fetch was valid\r\n- ensure the `dateEnd` sent to fetch was valid\r\n- ensure the relevant time fields in hits are within the\r\ndateStart/dateEnd range\r\n\r\nThese produce three different error messages:\r\n\r\n`For rule '<rule-id>', hits were returned with invalid time range start\r\ndate '<date>' from field '<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', hits were returned with invalid time range end\r\ndate '<date>' from field '<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', the hit with date '<date>' from field '<field>'\r\nis outside the query time range. Query: <query>. Document: <document>`\r\n\r\nEach message has one tag on it: `query-result-out-of-time-range`\r\n\r\n## To Verify\r\n\r\nTo test invalid dateStart/dateEnd, hack the Kibana code to set the\r\nvalues to NaN's:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L263-L264\r\n\r\nFor instance, change that to:\r\n\r\n const epochStart = new Date('x').getTime();\r\n const epochEnd = new Date('y').getTime();\r\n\r\nTo test the invdivual document hits, first back out the change you made\r\nabove - when those error, the checks we're testing below do not run.\r\nHack the Kibana code to make the time out of range:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L294\r\n\r\nFor instance, change that to:\r\n\r\n const epochDate = epochStart - 100\r\n\r\nFor both tests, create an es query rule - kql or dsl - make the relevant\r\nchanges, and arrange for the rule to get hits each time. The relevant\r\nmessages should be logged in the Kibana console when the rule runs.\r\n\r\n### Checklist\r\n\r\nDelete any items that are not applicable to this PR.\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>","sha":"e12e4496e0594d34509ef9235d1d7b7e3461e5d8"}}]}] BACKPORT-->
Woops, didn't really mean to close this - we still haven't figured out the problem, but do have some new diagnostics from #186332 if we see this happen again ... |
In both SDH 5049 and 5056, it appears there was ILM activity before/during the rule run. Thinking this may have something to do with it ... |
Regarding the new logging we added in #186332 : turns out it's producing a lot of false positives, and is difficult to analyze by sight. So issue #200023 has been opened to improve that. I did write a node js script to analyze the existing messages though: https://gist.github.com/pmuellr/fe30b8e28261e2d22996e6ba573945b3 - it expects to be passed a file which contains the results of an es query, custom tailored to our overview clusters. It should weed through the false positives and log any true positives (or potential positives anyway). Ran it last night on a large set of the messages we're seeing, and found no true positives. |
Discussing this issue with the team, it was suggested we could try to repro this with ILM, to see if we can catch it ourselves. Here's a basic idea:
This puts us in a place where we have a rule querying over an index subject to ILM. What I'm not sure about is what the alert should alert on. I guess we can depend on the existing logging and script (mentioned in comment ^^^), and have the rule just match on all the event log docs. noteI just realized we no longer have a separate ILM policy for the event log, we now use a setting in the index template that handles this: kibana/x-pack/plugins/event_log/server/es/documents.ts Lines 12 to 37 in 190430b
You can use a datastream lifecycle API to modify the default 90d setting; the following seems to work:
I didn't see a way to manually rollover, and it didn't seem to happen automagically with the
|
Ah, I don't think using a plain old datastream with the lifecycle data_retention is going to work - it doesn't actually roll over indices and retain them, it only rolls over and deletes the old one. Still ... maybe? I'll let this go for a while, but suspect we will have to create an ILM policy, index template, and initial index. And then? I guess have the rule read from that index, and add an index connector to also write to that index - the rule will be reading and writing to the same index :-). |
We have gotten several reports from users of receiving alert notifications from the ES query rule where they were unable to trace back to the underlying documents that may have generated the alert. We need to investigate how and why that may be happening.
There seem to be several commonalities between the rule definitions that spawn the zombie alerts:
excludeMatchesFromPreviousRuns
set to true.The text was updated successfully, but these errors were encountered: