[Alerting] Research alert summaries / bulk-able actions #68828

andrewvc · 2020-06-10T21:56:12Z

Describe the feature:

It would be nice to support in alerting the ability to collapse multiple events into a single one on a per action type basis. For instance, imagine using uptime and creating an alert to monitor 1000+ hosts. You may want email actions to send one email with a summary of everything being down (which is what we do now), but send each outage as a separate event to pager duty.

A complication here is that the alert message would likely need to be on a per-alert-type basis. There would also need to be a checkbox enabling batching / un-batching, and a high level notion of discrete items to batch by.

Describe a specific use case for the feature:

This use case has been described by a few users, including internally from @jarpy and @Crazybus . It would be nice for services like PagerDuty to receive alerts separately. This still doesn't make sense for emails, no one wants to receive 1000+ emails.

CC @drewpost @pmuellr

elasticmachine · 2020-06-10T21:56:14Z

Pinging @elastic/uptime (Team:uptime)

elasticmachine · 2020-06-10T23:40:44Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2020-06-10T23:52:04Z

a high level notion of discrete items to batch by

My original thought on "batching" was to run a single action for all the instances that had actions scheduled for an action group, rather than the current level of code where we run an action per each of those instances. Eg, 1 email with a list of 100 hosts, vs 100 emails with a single host.

I may be misinterpreting, but the "items to batch by" sounds like it could be "sub-batching", where you might want to have an multiple actions invoked, partitioning the instances somehow.

Action groups already handles this, although it may not be a good fit to handle from a practical standpoint (and could be confusing to customers - lots more knobs and dials!).

So curious if you're suggesting explicit sub-batching or not.

pmuellr · 2020-06-10T23:56:27Z

Perhaps the batching choice should be defaulted by the action type itself. Email you probably want batch, PagerDuty you probably want single.

Crazybus · 2020-06-11T12:26:31Z

It would be nice for services like PagerDuty to receive alerts separately. This still doesn't make sense for emails, no one wants to receive 1000+ emails.

I would really like to see this feature for all alert output types. I'd also argue that I think the default should be to not batch them too. While 1000 emails when 1000 hosts are down isn't ideal, I think I would still prefer it over getting an email with 1000 hosts in it.

Here are some more of my thoughts on why I can see users wanting to not have batched alerts for email and other similar output types:

Email may be used as input for another alerting system

Some alerting systems take email as the input for events. At a previous job our alerting system (think a home grown PagerDuty) recieved its input only via emails and had no other API. We would have all monitoring systems send alerts to it because being able to send an email was something that all of our tooling and environments could do.

Funny enough Pagerduty also offers an Email Integration for this exact use case.

Users running in airgapped environments might use similar notification strategies. Since it is unlikely their internal systems are allowed to talk directly to an external service like PagerDuty however they might be allowed to talk to a local email server.

Clear ownership and discussions

Imagine a team is using email as their sole alerting mechanism. In the event that multiple hosts are down having them in separate emails is ideal. It will mean:

Discussions around a single host go into a dedicated email thread. E.g. "I'll take a look at this one!". If 10 hosts go down at the same time and you have 5 people all replying to the same email it gets messy real fast.
The history from previous downtime and discussions are in the same thread
Recovery alerts can be sent for each host separately

This is something we still do for our Slack (and email) based alerting that isn't yet hooked up to PagerDuty. Each Slack alert will have a thread created to discuss the alert, normally ending with a link to a GitHub issue where we can assign ownership and add more details.

Throttle periods

With batched emails you will only be sending them every X amount of time. Let's use 60 minutes as an example throttle period. In the situation that 1000 hosts go down in the span of 5 minutes things get tricky. If the first host goes down and the alert fires you won't know about the other 999 hosts for another 55 minutes.

Acknowledgements and muting

How do acknowledgments and muting of notifications work with batched alerts? If a single test environment host is down can I mute that single host? Or will I only be able to mute all future alerts for all hosts? It's very normal for large environments to have muted hosts all the time. If someone were to mute that notification they may accidentally disable alerts for important production hosts.

mikecote · 2020-06-11T13:22:12Z

This is a duplicate of #50258. Because there is more use cases here, I will close the other one.

andrewvc · 2020-06-11T20:39:20Z

Thanks for the detailed writeup @Crazybus good points about email, I agree with you that the batch should not be the default.

mikecote · 2020-06-24T17:01:08Z

Linking this issue with alert digest / scheduled reports #50257. There may be some overlap.

kindsun · 2020-12-17T21:29:03Z

This came up again as a potential solution for the large volume of alerts generated by geo containment alerts.

The problem

Unless throttled, geo containment alerts can produce a large number of alert instances for each interval. In local tests of busses moving in Manhattan, upwards of 3000 alert instances were generated for each 20 second interval, however the task manager was only able to process ~10 actions/4 seconds, creating a growing backlog with each batch.

The conversation

The ability to execute actions in bulk is also interesting to us for this scenario. These actions might include indexing for later display in Maps, standard logging, email, etc. The idea mentioned above "Email may be used as input for another alerting system" is relevant here as geo alerts evolve to handle more IoT use cases. Think of tracking an item in a building with the item pinging out its location as it moves from room to room (i.e.- asset tracking).

Other solutions to this issue focused more on ways we might reduce the number of alert instances or more optimal ways to index data. While this might work fine for some use cases, I still think a solution such batched actions should be part of the picture.

pmuellr · 2021-01-06T21:34:02Z

Depending on how badly this is needed, and when alerting can deliver something, "bulk" processing could be done by the alert itself. It could be an option on the alert itself (a new param) - so it could process alerts individually, or in bulk.

I don't think any alerts have done this yet, so we'd want to design this pretty carefully. And hopefully when we do have the "bulk" feature, we'd want to evolve the alert to use the new "bulk" feature.

Another thought is the way the security alerts work. Two levels. First level generates data based on findings, and writes that to one index. The second level reads that index to generate actual alerts that have customer-facing actions (eg, email, slack).

mikecote · 2021-02-05T00:14:31Z

Moving from 7.x - Candidates to 8.x - Candidates (Backlog) after the latest 7.x planning session.

We will bring this up to the alerting working group to see if this is necessary for the alert as data initiative.

mikecote · 2022-06-27T13:22:04Z

I am adding this issue to the current iteration. We should have a brainstorming session as a team to ensure we aggregate the problems and requirements that this issue intends to solve prior to creating an RFC (it's been around for a while 😁).

aarju · 2022-06-28T13:05:35Z

We built our own 12h summary report of the low severity alerts using our SOAR system to automate. We have a scheduled script that runs every 12h with the following agg query. We then format the results and send them to our 'threat hunting' slack channel so we can have a summary of the low severity alerts to keep an eye on them for anything strange. We display the alert with the number of total times that alert triggered, then that is broken down by the list of host.name values with the number of times the alerts triggered for each host.

It would be nice to have the ability to do this natively within Kibana and have the output sent to a connector such as slack or email.

{
  "size": "1",
  "query": {
    "bool": {
      "filter": [
        {
          "match_phrase": {
            "signal.rule.severity": "low"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-12h"
            }
          }
        }
      ],
      "must_not": [
        {
          "match_phrase": {
            "signal.rule.building_block_type": "default"
          }
        }
      ]
    }
  },
  "aggs": {
    "rulename_agg": {
      "terms": {
        "field": "signal.rule.name",
        "order": {
          "_count": "desc"
        },
        "size": 500
      },
      "aggs": {
        "ruleid_agg": {
          "terms": {
            "field": "signal.rule.id",
            "order": {
              "_count": "desc"
            },
            "size": 500
          },
          "aggs": {
            "hostname_agg": {
              "terms": {
                "field": "host.name",
                "order": {
                  "_count": "desc"
                },
                "size": 500
              }
            }
          }
        }
      }
    }
  }
}

ersin-erdal · 2022-10-14T12:15:06Z

Since research for this is done and issues are created, closing this in favor of: #143200

andrewvc added enhancement New value added to drive a business result Feature:Alerting Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability :Alerting labels Jun 10, 2020

pmuellr added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed :Alerting labels Jun 10, 2020

mikecote mentioned this issue Jun 11, 2020

Alert summary #50258

Closed

mikecote mentioned this issue Jun 24, 2020

Alert digest / scheduled reports #50257

Closed

andrewvc added the [zube]: Investigate label Jul 14, 2020

andrewvc changed the title ~~[Alerting] Bulk/Un-bulkable alerts~~ [Alerting] Bulk/Un-bulkable alerts + standard fields Jul 15, 2020

andrewvc mentioned this issue Jul 17, 2020

Simple, Singular Monitor Alerts elastic/uptime#237

Closed

mikecote added the R&D Research and development ticket (not meant to produce code, but to make a decision) label Sep 9, 2020

mikecote mentioned this issue Sep 30, 2020

[Monitoring][Alerting] Investigate a solution to avoid always creating new instances with replaceState #78724

Closed

pmuellr mentioned this issue Oct 6, 2020

[Alerting][Discuss] meta issue on mustache templating and action parameter processing #79786

Open

mikecote mentioned this issue Oct 28, 2020

Alerts don't group their incidents by a dedupKey / Object ID #77772

Closed

mikecote mentioned this issue Dec 16, 2020

Dependencies on Kibana Alerting #67992

Open

59 tasks

mikecote changed the title ~~[Alerting] Bulk/Un-bulkable alerts + standard fields~~ [Alerting] Alert summaries / bulk-able actions Dec 16, 2020

kindsun mentioned this issue Feb 2, 2021

Add readme to geo containment alert covering test alert setup #89625

Merged

gmmorris added NeededFor:Maps NeededFor:Uptime Project:AlertingNotifyEfficiently Alerting team project for reducing the noise created by the alerting framework. Feature:Actions Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework labels Jun 30, 2021

gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 14, 2021

gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021

gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021

mikecote mentioned this issue Sep 28, 2021

Move connector types into own plugin and split ownership #90931

Closed

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

mikecote changed the title ~~[Alerting] Alert summaries / bulk-able actions~~ [Alerting] Research alert summaries / bulk-able actions Jun 27, 2022

mikecote added the research label Jun 27, 2022

ersin-erdal self-assigned this Jun 29, 2022

ersin-erdal mentioned this issue Aug 2, 2022

68828 alert summary POC #137837

Closed

mikecote mentioned this issue Oct 4, 2022

[Rule actions] Send a mail for each alert #142426

Open

ersin-erdal closed this as completed Oct 14, 2022

zube bot added [zube]: Done and removed [zube]: Investigate labels Oct 14, 2022

zube bot removed the [zube]: Done label Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Research alert summaries / bulk-able actions #68828

[Alerting] Research alert summaries / bulk-able actions #68828

andrewvc commented Jun 10, 2020 •

edited

Loading

elasticmachine commented Jun 10, 2020

elasticmachine commented Jun 10, 2020

pmuellr commented Jun 10, 2020

pmuellr commented Jun 10, 2020

Crazybus commented Jun 11, 2020

mikecote commented Jun 11, 2020

andrewvc commented Jun 11, 2020

mikecote commented Jun 24, 2020

kindsun commented Dec 17, 2020

pmuellr commented Jan 6, 2021

mikecote commented Feb 5, 2021

mikecote commented Jun 27, 2022 •

edited

Loading

aarju commented Jun 28, 2022

ersin-erdal commented Oct 14, 2022

[Alerting] Research alert summaries / bulk-able actions #68828

[Alerting] Research alert summaries / bulk-able actions #68828

Comments

andrewvc commented Jun 10, 2020 • edited Loading

elasticmachine commented Jun 10, 2020

elasticmachine commented Jun 10, 2020

pmuellr commented Jun 10, 2020

pmuellr commented Jun 10, 2020

Crazybus commented Jun 11, 2020

Email may be used as input for another alerting system

Clear ownership and discussions

Throttle periods

Acknowledgements and muting

mikecote commented Jun 11, 2020

andrewvc commented Jun 11, 2020

mikecote commented Jun 24, 2020

kindsun commented Dec 17, 2020

The problem

The conversation

pmuellr commented Jan 6, 2021

mikecote commented Feb 5, 2021

mikecote commented Jun 27, 2022 • edited Loading

aarju commented Jun 28, 2022

ersin-erdal commented Oct 14, 2022

andrewvc commented Jun 10, 2020 •

edited

Loading

mikecote commented Jun 27, 2022 •

edited

Loading