[Alerting] Stack Rules on Rule Registry POC #96966

ymao1 · 2021-04-13T13:38:36Z

Resolves #98319

Summary (WIP)

Use rule registry to write alerts-as-data for Index Threshold and ES Query rule types. Using terminology from the Alerts as Data Schema Definition issue, I tried to determine what, if anything, to write out as alert (signal) data and metric (evaluation) data for these two rule types.

Initially, I used the existing CreateLifecycleRuleType to write lifecycle data for these two rule types. Index threshold and ES query are similar in that they both specify a threshold condition. During each execution cycle, the condition is evaluated, which can generate a metric document. When the condition is met, the rule becomes active and will stay active if the condition continues to be met in subsequent rule executions. When the condition is not met, the rule is considered recovered. Each active or recovered alert can generate an alert document. Grouping the active alerts with a UUID as in the lifecycle rule makes sense as well. I extended the functionality of the CreateLifecycleRuleType with a CreateThresholdRuleType that is very similar, but allows different status and action constants (active/recovered vs open/closed) and two additional TAdditionalRuleExecutorServices for writing out metric and event documents.

Index Threshold

Desired data

event.kind: metric - A metric document is written during each rule execution, for each alert id. Contains the numeric value that is evaluated against the condition for this rule. *Should this include the threshold and comparator from the rule params? Could also include more information like description of field this is "avg of cpu.pct" *

Example metric document

{
    "@timestamp" : "2021-04-14T15:17:00.113Z",
    "event.kind" : "metric",
    "kibana.rac.alert.id" : "host-1",
    "kibana.rac.alert.threshold" : 0.6, <-- copied from rule params, maybe not needed?
    "kibana.rac.alert.value" : 0.4730000009139379, <-- value evaluated during rule execution
    "kibana.rac.producer" : "stackAlerts",
    "rule.category" : "Index threshold",
    "rule.name" : "test rule",
    "rule.id" : ".index-threshold",
    "rule.uuid" : "d9805ed0-9d2d-11eb-8c96-c9b25c6f1379",
    "tags" : [ ]
 }

event.kind: alert - An alert document is written out each time the condition being evaluated during rule execution is true. A single recovery alert document is written out when the condition evaluation changes from true to false. This is the "mutable" doc if we are making docs mutable, so in the future, instead of a series of alert documents with the same kibana.rac.alert.uuid and a series of statuses: active, active, active, active, recovered, this might be a single document with a kibana.rac.alert.uuid, a start and end date and a duration?

Example active alert document

{
    "@timestamp" : "2021-04-14T15:15:54.094Z",
    "event.action" : "active",
    "event.kind" : "alert",
    "kibana.rac.alert.duration.us" : 132042000,
    "kibana.rac.alert.id" : "host-2",
    "kibana.rac.alert.start" : "2021-04-14T15:13:42.052Z",
    "kibana.rac.alert.status" : "active",
    "kibana.rac.alert.threshold" : 0.6, <-- copied from rule params, maybe not needed?
    "kibana.rac.alert.uuid" : "cf58a558-ea2a-4f1c-8276-03bfea640abf",
    "kibana.rac.alert.value" : 0.8730000009139379, <-- value evaluated during rule execution
    "kibana.rac.producer" : "stackAlerts",
    "rule.category" : "Index threshold",
    "rule.name" : "test rule",
    "rule.id" : ".index-threshold",
    "rule.uuid" : "d9805ed0-9d2d-11eb-8c96-c9b25c6f1379",
    "tags" : [ ]
}

Example recovery alert document

{
    "@timestamp" : "2021-04-14T15:16:27.222Z",
    "event.action" : "recovered",
    "event.kind" : "alert",
    "kibana.rac.alert.duration.us" : 165170000,
    "kibana.rac.alert.end" : "2021-04-14T15:16:27.222Z",
    "kibana.rac.alert.id" : "host-2",
    "kibana.rac.alert.start" : "2021-04-14T15:13:42.052Z",
    "kibana.rac.alert.status" : "recovered",
    "kibana.rac.alert.threshold" : 0.6, <-- copied from rule params, maybe not needed?
    "kibana.rac.alert.value" : 0.8730000009139379, <-- default lifecycle rule behavior, this is copied from the last active alert instance for this alert uuid grouping
    "kibana.rac.alert.uuid" : "cf58a558-ea2a-4f1c-8276-03bfea640abf",
    "kibana.rac.producer" : "stackAlerts",
    "rule.category" : "Index threshold",
    "rule.name" : "test rule",
    "rule.id" : ".index-threshold",
    "rule.uuid" : "d9805ed0-9d2d-11eb-8c96-c9b25c6f1379",
    "tags" : [ ]
}

ES Query (WIP)

Desired data

event.kind: alert - Works same as described for Index Threshold
event.kind: metric - Works same as described for Index Threshold
event.kind: event - It may be useful to copy the source of the matching docs into this index. How useful though? We have no requirement that the rule queries an ECS compliant document so these source documents could be very large and bloat the index for no reason. We could make it a rule param to opt into copy the source, however since a single index will be used for all stack rules, we could have the copied documents from wildly divergent source indices, which would cause mapping conflicts and explosions. In addition, strict mapping is set to true for the alerts-as-data indices so any field that is non-ECS compliant will currently error and cause the data not to be indexed.

What could an alerts-as-data view look like