Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Stack Rules on Rule Registry POC #96966

Closed

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Apr 13, 2021

Resolves #98319

Summary (WIP)

Use rule registry to write alerts-as-data for Index Threshold and ES Query rule types. Using terminology from the Alerts as Data Schema Definition issue, I tried to determine what, if anything, to write out as alert (signal) data and metric (evaluation) data for these two rule types.

Initially, I used the existing CreateLifecycleRuleType to write lifecycle data for these two rule types. Index threshold and ES query are similar in that they both specify a threshold condition. During each execution cycle, the condition is evaluated, which can generate a metric document. When the condition is met, the rule becomes active and will stay active if the condition continues to be met in subsequent rule executions. When the condition is not met, the rule is considered recovered. Each active or recovered alert can generate an alert document. Grouping the active alerts with a UUID as in the lifecycle rule makes sense as well. I extended the functionality of the CreateLifecycleRuleType with a CreateThresholdRuleType that is very similar, but allows different status and action constants (active/recovered vs open/closed) and two additional TAdditionalRuleExecutorServices for writing out metric and event documents.

Index Threshold

Desired data

  • event.kind: metric - A metric document is written during each rule execution, for each alert id. Contains the numeric value that is evaluated against the condition for this rule. *Should this include the threshold and comparator from the rule params? Could also include more information like description of field this is "avg of cpu.pct" *

    Example metric document
    {
        "@timestamp" : "2021-04-14T15:17:00.113Z",
        "event.kind" : "metric",
        "kibana.rac.alert.id" : "host-1",
        "kibana.rac.alert.threshold" : 0.6, <-- copied from rule params, maybe not needed?
        "kibana.rac.alert.value" : 0.4730000009139379, <-- value evaluated during rule execution
        "kibana.rac.producer" : "stackAlerts",
        "rule.category" : "Index threshold",
        "rule.name" : "test rule",
        "rule.id" : ".index-threshold",
        "rule.uuid" : "d9805ed0-9d2d-11eb-8c96-c9b25c6f1379",
        "tags" : [ ]
     }
    
  • event.kind: alert - An alert document is written out each time the condition being evaluated during rule execution is true. A single recovery alert document is written out when the condition evaluation changes from true to false. This is the "mutable" doc if we are making docs mutable, so in the future, instead of a series of alert documents with the same kibana.rac.alert.uuid and a series of statuses: active, active, active, active, recovered, this might be a single document with a kibana.rac.alert.uuid, a start and end date and a duration?

    Example active alert document
    {
        "@timestamp" : "2021-04-14T15:15:54.094Z",
        "event.action" : "active",
        "event.kind" : "alert",
        "kibana.rac.alert.duration.us" : 132042000,
        "kibana.rac.alert.id" : "host-2",
        "kibana.rac.alert.start" : "2021-04-14T15:13:42.052Z",
        "kibana.rac.alert.status" : "active",
        "kibana.rac.alert.threshold" : 0.6, <-- copied from rule params, maybe not needed?
        "kibana.rac.alert.uuid" : "cf58a558-ea2a-4f1c-8276-03bfea640abf",
        "kibana.rac.alert.value" : 0.8730000009139379, <-- value evaluated during rule execution
        "kibana.rac.producer" : "stackAlerts",
        "rule.category" : "Index threshold",
        "rule.name" : "test rule",
        "rule.id" : ".index-threshold",
        "rule.uuid" : "d9805ed0-9d2d-11eb-8c96-c9b25c6f1379",
        "tags" : [ ]
    }
    
    Example recovery alert document
    {
        "@timestamp" : "2021-04-14T15:16:27.222Z",
        "event.action" : "recovered",
        "event.kind" : "alert",
        "kibana.rac.alert.duration.us" : 165170000,
        "kibana.rac.alert.end" : "2021-04-14T15:16:27.222Z",
        "kibana.rac.alert.id" : "host-2",
        "kibana.rac.alert.start" : "2021-04-14T15:13:42.052Z",
        "kibana.rac.alert.status" : "recovered",
        "kibana.rac.alert.threshold" : 0.6, <-- copied from rule params, maybe not needed?
        "kibana.rac.alert.value" : 0.8730000009139379, <-- default lifecycle rule behavior, this is copied from the last active alert instance for this alert uuid grouping
        "kibana.rac.alert.uuid" : "cf58a558-ea2a-4f1c-8276-03bfea640abf",
        "kibana.rac.producer" : "stackAlerts",
        "rule.category" : "Index threshold",
        "rule.name" : "test rule",
        "rule.id" : ".index-threshold",
        "rule.uuid" : "d9805ed0-9d2d-11eb-8c96-c9b25c6f1379",
        "tags" : [ ]
    }
    

ES Query (WIP)

Desired data

  • event.kind: alert - Works same as described for Index Threshold
  • event.kind: metric - Works same as described for Index Threshold
  • event.kind: event - It may be useful to copy the source of the matching docs into this index. How useful though? We have no requirement that the rule queries an ECS compliant document so these source documents could be very large and bloat the index for no reason. We could make it a rule param to opt into copy the source, however since a single index will be used for all stack rules, we could have the copied documents from wildly divergent source indices, which would cause mapping conflicts and explosions. In addition, strict mapping is set to true for the alerts-as-data indices so any field that is non-ECS compliant will currently error and cause the data not to be indexed.

What could an alerts-as-data view look like

Screen Shot 2021-04-22 at 3 03 01 PM

This shows when the alert was triggered and resolved for each alert id in the rule.

Questions/Thoughts

  1. Are rule registry helper factories meant to be generic, framework level factories that can be reused by different rule types or is it generally expected that each rule type will create their own "factories" that is specific to their use case? If it is the former, it would be helpful for the factories to allow overriding the values being written out (event.action and kibana.rac.alert.status in the CreateLifecycleRuleType, for example).
  2. When writing out alert and metric data, I created factories with service functions for each of these types that the rule executor could call to write specific data. Is this the right. For example, createThresholdRuleTypeFactory has an alertWithThreshold and a metricWithThreshold callback. If these are meant to be generic, framework level helpers, it might be useful to be able to have "alert data factories" and "metric data factories" and be able to compose them in different ways in order to reuse them.
  3. Index threshold and ES query rules, while stack rules, are likely not as uniform as some of the solution specific rule types, so their signal and metric documents may look very different from each other and be used in different ways. I have registered a single stack-alerts type of rule registry, which bootstraps a .kibana-alerts-stack-alerts* index but it might make sense to have separate indices for .kibana-alerts-index-threshold* and .kibana-alerts-es-query data indices as well. Rather than creating a IndexThresholdRuleRegistry & an EsQueryRuleRegistry, it might be nice to create a StackAlertsRuleRegistry but bootstrap multiple data indices.
  4. A decent amount of logic within CreateLifecycleRuleType resembles what is happening in the alerting TaskRunner executeAlertInstances function wrt to figuring out which alerts are new/active/recovered. New things that might be nice to have in the framework. Not sure if this will be true for all rule registry types.
    • getting more context information for recovery alert (CreateLifecycleRuleType copies the latest active document for a recovered alert
    • having alert.uuid to track a grouping of active alerts
    • calculating current duration of active alert grouping

ymao1 added 20 commits April 13, 2021 09:34
…ause strict mapping is enabled. Extracted lifecycle calculation into helper function for reuse
@kibanamachine
Copy link
Contributor

kibanamachine commented Apr 22, 2021

💔 Build Failed

Failed CI Steps


Test Failures

Kibana Pipeline / jest / Jest Tests.x-pack/plugins/apm/server/lib/alerts.Transaction duration anomaly alert doesn't send alert ml is not defined

Link to Jenkins

Standard Out

Failed Tests Reporter:
  - Test has not failed recently on tracked branches


Stack Trace

TypeError: Cannot read property 'uuid' of undefined
    at executor (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/x-pack/plugins/rule_registry/server/rule_registry/rule_type_helpers/create_lifecycle_rule_type_factory.ts:98:26)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)

Kibana Pipeline / jest / Jest Tests.x-pack/plugins/apm/server/lib/alerts.Transaction duration anomaly alert doesn't send alert ml jobs are not available

Link to Jenkins

Standard Out

Failed Tests Reporter:
  - Test has not failed recently on tracked branches


Stack Trace

TypeError: Cannot read property 'uuid' of undefined
    at executor (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/x-pack/plugins/rule_registry/server/rule_registry/rule_type_helpers/create_lifecycle_rule_type_factory.ts:98:26)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at Object.<anonymous> (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/x-pack/plugins/apm/server/lib/alerts/register_transaction_duration_anomaly_alert_type.test.ts:56:7)
    at _callCircusTest (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/run.js:212:5)
    at _runTest (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/run.js:149:3)
    at _runTestsForDescribeBlock (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/run.js:63:9)
    at _runTestsForDescribeBlock (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/run.js:57:9)
    at _runTestsForDescribeBlock (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/run.js:57:9)
    at run (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/run.js:25:3)
    at runAndTransformResultsToJestFormat (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapterInit.js:176:21)
    at jestAdapter (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapter.js:109:19)
    at runTestInternal (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-runner/build/runTest.js:380:16)
    at runTest (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-runner/build/runTest.js:472:34)
    at Object.worker (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-runner/build/testWorker.js:133:12)

Kibana Pipeline / jest / Jest Tests.x-pack/plugins/apm/server/lib/alerts.Transaction duration anomaly alert doesn't send alert anomaly is less than threshold

Link to Jenkins

Standard Out

Failed Tests Reporter:
  - Test has not failed recently on tracked branches


Stack Trace

TypeError: Cannot read property 'uuid' of undefined
    at executor (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/x-pack/plugins/rule_registry/server/rule_registry/rule_type_helpers/create_lifecycle_rule_type_factory.ts:98:26)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at Object.<anonymous> (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/x-pack/plugins/apm/server/lib/alerts/register_transaction_duration_anomaly_alert_type.test.ts:107:7)
    at _callCircusTest (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/run.js:212:5)
    at _runTest (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/run.js:149:3)
    at _runTestsForDescribeBlock (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/run.js:63:9)
    at _runTestsForDescribeBlock (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/run.js:57:9)
    at _runTestsForDescribeBlock (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/run.js:57:9)
    at run (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/run.js:25:3)
    at runAndTransformResultsToJestFormat (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapterInit.js:176:21)
    at jestAdapter (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapter.js:109:19)
    at runTestInternal (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-runner/build/runTest.js:380:16)
    at runTest (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-runner/build/runTest.js:472:34)
    at Object.worker (/var/lib/jenkins/workspace/elastic+kibana+pipeline-pull-request/kibana/node_modules/jest-runner/build/testWorker.js:133:12)

and 11 more failures, only showing the first 3.

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
stackAlerts 51 54 +3
triggersActionsUi 365 369 +4
total +7

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
triggersActionsUi 1.5MB 1.5MB +11.3KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
stackAlerts 17.7KB 19.1KB +1.4KB
triggersActionsUi 106.1KB 106.3KB +274.0B
total +1.7KB
Unknown metric groups

API count

id before after diff
ruleRegistry 52 54 +2
stackAlerts 4 8 +4
triggersActionsUi 221 222 +1
total +7

API count missing comments

id before after diff
ruleRegistry 52 54 +2
stackAlerts 4 8 +4
triggersActionsUi 212 213 +1
total +7

API count with any type

id before after diff
stackAlerts 0 1 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@gmmorris
Copy link
Contributor

Rather than creating a IndexThresholdRuleRegistry & an EsQueryRuleRegistry, it might be nice to create a StackAlertsRuleRegistry but bootstrap multiple data indices.

Have we identified specific fields that need to be different?
My gut is that, unless we need to keep them the same, it's better to split them apart, because it allows us to introduce divergence later.

The downside being that we're introducing another index, but we can use aliases to query across them when needed, so the benefit might outweigh the drawback here. 🤔

@ymao1
Copy link
Contributor Author

ymao1 commented Apr 23, 2021

Have we identified specific fields that need to be different?

Index threshold and ES query would actually play nicely together in the same index because both are threshold dependent. The tracking containment alert is the odd one out, where it lives inside stack rules so would presumably be written to the StackRulesRegistry but would contain geo specific information. If we assume that we might have more stack rules like that, we could end up with a stack rules field mapping that is a superset of a bunch of field mappings that are only used by one rule type. Maybe that's not such a big deal though if they are all ECS compliant? I know there's been a discussion about having all ECS fields in the base field mapping so if that's the direction we're veering, then this is not such a problem since the field mapping will be filled with fields that are unused.

@gmmorris
Copy link
Contributor

Maybe that's not such a big deal though if they are all ECS compliant? I know there's been a discussion about having all ECS fields in the base field mapping so if that's the direction we're veering, then this is not such a problem since the field mapping will be filled with fields that are unused.

That's a fair question, but I don't think we want to box ourselves into that corner just yet.
It might be worth talking to @elastic/kibana-gis and see if they can give a hand as part of this POC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Alerting] POC for stack rules to use rule registry
3 participants