Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] POC for stack rules to use rule registry #98319

Closed
ymao1 opened this issue Apr 26, 2021 · 5 comments
Closed

[Alerting] POC for stack rules to use rule registry #98319

ymao1 opened this issue Apr 26, 2021 · 5 comments
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete

Comments

@ymao1
Copy link
Contributor

ymao1 commented Apr 26, 2021

With the merging of the rules registry plugin, Alerting would like to explore how stack rules might work with the rules registry.

@botelastic botelastic bot added the needs-team Issues missing a team label label Apr 26, 2021
@ymao1
Copy link
Contributor Author

ymao1 commented Apr 26, 2021

General thoughts after creating a POC for writing out alerts as data for stack rules using the rule registry:

Solution level vs rule type level
The rule registry seems to make some assumptions that alerts-as-data will be written out at a solution-level. Each solution (o11y, security) creates its own rule registry during plugin setup, which bootstraps an alerts-as-data index for that solution (.alerts-observability*, alerts-security*). This mostly works because the types of rules and the types of alerts written out are fairly consistent across these two solution. This assumption doesn't work quite as well if we consider "Stack Rules" to be a solution and register a single rule registry for all stack rules. There is no guarantee or requirement that stack rules are consistent and all have similar workflows and write similar types of data. (Tracking containment within stack rules is a good example of an outlier.) With the current rule registry, we could circumvent this by creating multiple different rule registries (one for each stack rule?) during plugin setup, but that leads to the question of whether this framework-level assumption of grouping by solution is necessary?

Cross-solution access to alerts
As discussed above, when rule registries are created, indices to hold the alerts are bootstrapped. This means there is a separate alerts as data index for observability, security and (with this POC) stack-rules. Then each ruleRegistryClient scopes all its requests to the specific index it bootstrapped. This works great when only security alerts are shown within security or only observability alerts are shown within observability, but it gets more complicated when we think about using a stack rule within security or observability (currently not done, but it is possible with Alerting's producer/consumer access model). A user might be able to create an ES query stack rule within security but security's scoped rule registry would not ever retrieve the data because it lives in .alerts-stack-rules* which it does not have access to.

Rule type factories
This is preliminary until we can determine how much rules can reuse these "rule type factories" but if it ends up that each solution or rule type is creating its own "ruleTypeFactory", then it might make sense to instead move some of the factory functionality down to the solution instead of maintaining them at a framework level.

Framework vs library
I believe there's already been some chatter in this area but I like the idea of creating a library of well-tested helper functions vs a full-fledged framework that the alerting framework can then incrementally migrate.

@ymao1
Copy link
Contributor Author

ymao1 commented Apr 26, 2021

Alerting framework possible improvements

The lifecycleRuleTypeFactory repeats a lot of the logic that occurs within the alerting task runner wrt to determining whether an alert is active/new/recovered, with the nice addition of grouping a series of consecutive active alerts with a UUID (and determining duration), Knowing that security is also creating some ruleTypeFactories for their rule types POC here and knowing that security rules have a different lifecycle than observability rules, it will be interesting to see how many rule registry executors reuse logic from the alerting task runner vs implementing their own (different) logic. It's possible that the lifecycle determinations we're doing in alerting is too specific to a single type of rule (one with a distinct lifecycle) and we should be making it more generic? A complicating factor is that the alerting task runner functionality is heavily coupled with the event log

@ymao1 ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete labels Apr 26, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Apr 26, 2021
@ymao1 ymao1 self-assigned this Apr 26, 2021
@dgieselaar
Copy link
Member

General thoughts after creating a POC for writing out alerts as data for stack rules using the rule registry:

Solution level vs rule type level

The rule registry seems to make some assumptions that alerts-as-data will be written out at a solution-level. Each solution (o11y, security) creates its own rule registry during plugin setup, which bootstraps an alerts-as-data index for that solution (.alerts-observability*, alerts-security*). This mostly works because the types of rules and the types of alerts written out are fairly consistent across these two solution. This assumption doesn't work quite as well if we consider "Stack Rules" to be a solution and register a single rule registry for all stack rules. There is no guarantee or requirement that stack rules are consistent and all have similar workflows and write similar types of data. (Tracking containment within stack rules is a good example of an outlier.) With the current rule registry, we could circumvent this by creating multiple different rule registries (one for each stack rule?) during plugin setup, but that leads to the question of whether this framework-level assumption of grouping by solution is necessary?

Cross-solution access to alerts
As discussed above, when rule registries are created, indices to hold the alerts are bootstrapped. This means there is a separate alerts as data index for observability, security and (with this POC) stack-rules. Then each ruleRegistryClient scopes all its requests to the specific index it bootstrapped. This works great when only security alerts are shown within security or only observability alerts are shown within observability, but it gets more complicated when we think about using a stack rule within security or observability (currently not done, but it is possible with Alerting's producer/consumer access model). A user might be able to create an ES query stack rule within security but security's scoped rule registry would not ever retrieve the data because it lives in .alerts-stack-rules* which it does not have access to.

The way I see it, technical fields (e.g. alert id, uuid, rule id, duration, threshold, value, building block, etc) should be in the common schema. In some cases, solutions know the shape of the source data, so they can add mappings where they think it'll be useful. Mapped fields are easier to work with than runtime fields. For stack rules, but also for some solution rule types, we don't know upfront what the shape of the data is. I don't see another solution there currently but to use runtime fields.

Ideally users can configure write targets at some point (e.g. write alert data from this rule into the security solution index, or into my own index), but that is not something we can easily do (RBAC, potential mapping conflicts).

Rule type factories
This is preliminary until we can determine how much rules can reuse these "rule type factories" but if it ends up that each solution or rule type is creating its own "ruleTypeFactory", then it might make sense to instead move some of the factory functionality down to the solution instead of maintaining them at a framework level.

Framework vs library
I believe there's already been some chatter in this area but I like the idea of creating a library of well-tested helper functions vs a full-fledged framework that the alerting framework can then incrementally migrate.

Totally agree, hope we can figure out over the next few weeks what utilities should be shared and what is better off being handled by specific teams.

@ymao1
Copy link
Contributor Author

ymao1 commented Jul 1, 2021

Closing as POC for rules registry V1 is complete. Will open new issue for actually migrating stack rules to the rule data service.

@ymao1 ymao1 closed this as completed Jul 1, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants