Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Detections] Proposal: building Rule Execution Log on top of Event Log and ECS #94143

Closed

Conversation

banderror
Copy link
Contributor

@banderror banderror commented Mar 9, 2021

Ticket: #91265

Summary

This is a proposal for building detection engine Rule Execution Log on top of Event Log and ECS. It is related to the rules management part of the RAC workstream. It also follows up our previous discussions here. Rebuilding Rule Execution Log on top of Event Log would allow us to achieve 3 goals: pay tech debt, optimize performance of rules-related endpoints and unblock the possibility to implement in-memory rules table while we can’t filter/sort/search by rule params on the server side.

You can read the proposal in the PR description below, but I also added a proposal.md file so you could comment it line-by-line (I encourage you to do that!). There are also other files in this PR worth checking, all the details below.

Proposal: building Rule Execution Log on top of Event Log and ECS

Overview

We're going to get rid of storing rule execution statuses and additional data in custom "sidecar" saved objects. Those are objects stored in .kibana index and having type = siem-detection-engine-rule-status. The corresponding SO attributes are:

export interface IRuleStatusSOAttributes extends Record<string, any> {
  alertId: string; // created alert id.
  statusDate: StatusDate;
  lastFailureAt: LastFailureAt | null | undefined;
  lastFailureMessage: LastFailureMessage | null | undefined;
  lastSuccessAt: LastSuccessAt | null | undefined;
  lastSuccessMessage: LastSuccessMessage | null | undefined;
  status: JobStatus | null | undefined;
  lastLookBackDate: string | null | undefined;
  gap: string | null | undefined;
  bulkCreateTimeDurations: string[] | null | undefined;
  searchAfterTimeDurations: string[] | null | undefined;
}

We're going to start using the Event Log (event_log plugin built by the Alerting team).

For more context, please read #91265 (comment)

Regarding software design:

  • We will be storing rule status updates and all the additional metrics data (search durations, gaps, etc) in event log during the rule execution.
  • We will query event log on the Rule Details page to fetch the current execution status and the log itself. Eventually the idea is to have an "Execution Log" tab instead of "Failure History" and fetch let's say last 50 events of different types. We will be able to log generic info messages, generic errors/exceptions, add special types of events.
  • We will query event log on the Rules Management page to render our rule management and monitoring tables. We will need to be able to aggregate log events to fetch M*N metrics of N rules in a single of a few queries (where M is the number of metrics we show, e.g. "Last gap", "Indexing time", etc).
  • We will have 3 data models representing the "rule execution log":
    • an underlying ECS model for events stored in Event Log
    • a write model where we will have our rule execution events in terms of Detection Engine, mapping of these events to their ECS form, services to write them (something like Rule Execution Logger)
    • a read model where we will have our rule execution events in terms of app user (Rule Details page), aggregation results (Rules Monitoring table), services to read them
  • Services will wrap the underlying Event Log.
  • In the beginning, Rule Execution Logger (the writer) will be encapsulated in the RuleStatusService, and the only execution event we will have is a Status Changed event. That means the Detection Engine will stay mostly untouched. The idea is to reduce the amount of refactoring. The disadvantage is that time-wise events in the log might be written not exactly when they happen. We will probably address that later, when implementing enhancements in the UI ("Execution Log" tab), so that all the logged events can have precise timestamps and show when exactly any of the events happened. Also, we'll be able to split a single fat Status Changed event into separate dedicated events, and use the Status Changed event only for cases when the status actually needs to be changed.
  • Each Status Changed event will be mapped to a series of ECS events.
  • In our API route handlers we will replace requests to siem-detection-engine-rule-status saved objects with requests to the new rule execution log service (read model).
  • Our API schema will not change, except later in order to show the full "Execution Log" instead of "Failure History" we might want to introduce a non-breaking change.

What to review

Please take a look at the code submitted in this PR:

  • rule_execution_log/common_model contains the ECS model of events for Event Log, common types and constants, a builder for creating ECS events for this particular log
  • rule_execution_log/write_model contains the definition of rule execution events, StatusChangedEvent in particular, and mapping of it to a series of ECS events

Please check the ECS events we're going to store and queries we will need to execute. You can play with it in Kibana Dev Tools:

  1. Use dev_tools_index.txt to create a test index. This file contains Elasticsearch index mappings for ECS version 1.9.0 (https://github.com/elastic/ecs/blob/1.9/generated/elasticsearch/7/template.json) adjusted with a custom mapping of kibana.detection_engine field.
  2. Index events using dev_tools_events.txt. It contains ECS events as text that you can copy-paste into Kibana Dev Tools.
  3. Run queries from dev_tools_queries.txt. It contains queries to event log index we will need to be able to execute.
  4. Optionally, play with data in generate_events.ts. It transforms StatusChangedEvent to ECS events and writes them to dev_tools_events.txt. Run it with node x-pack/plugins/security_solution/server/lib/detection_engine/rule_execution_log/generate_events.js.

What's missing in the current Event Log API

In short:

  • extended support for ECS - more fields and field sets, in particular:
    • standard event.*, log.*, rule.*
    • custom kibana.detection_engine.*
  • aggregation queries
  • sorting by multiple fields
  • nice-to-have: custom ES DSL filters (term, terms) instead of string KQL filter
  • limiting source fields to return: "_source": ["@timestamp", "message", "log.level", "event.action"]

Extended support for ECS

In this proposal, I'm designing our rule execution log events using a few additional fields. These fields are not currently supported by Event Log.

interface IEcsAdditionalFields {
  // https://www.elastic.co/guide/en/ecs/1.9/ecs-event.html
  event?: {
    dataset?: string;
    created?: string;
    kind?: string;
    type?: string[];
    severity?: number;
    sequence?: number;
  };

  // https://www.elastic.co/guide/en/ecs/1.9/ecs-log.html
  log?: {
    logger?: string;
    level?: string;
  };

  // https://www.elastic.co/guide/en/ecs/1.9/ecs-rule.html
  rule?: {
    id?: string;
  };

  // custom fields
  kibana?: {
    detection_engine?: {
      rule_status?: string;
      rule_status_severity?: number;
    };
  };
}

My suggestion would be to add support for all the standard event.*, log.*, rule.* fields (at least these field sets), as well as for our custom kibana.detection_engine.* fields. If it's easy to add support for the whole ECS, I'd say let's do it.

It's not super clear to me how exactly we're gonna specify kibana.detection_engine both in terms of TypeScript API and ES mapping. Should it be delegated to Event Log's clients with some registration API rather than hardcoded in Event Log itself?

Aggregation queries

We need to be able to execute aggregation queries with arbitrary aggs, like for example specified in this example query:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "term": { "kibana.saved_objects.type": "alert" } },
        { "term": { "kibana.saved_objects.namespace": "some-space" } },
        {
          "terms": {
            "kibana.saved_objects.id": [
              "1234-56789-dfgdfhgfgh-122346567",
              "1234-56789-dfgdfhgfgh-122346568"
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "rules": {
      "terms": {
        "field": "kibana.saved_objects.id",
        "size": 2
      },
      "aggs": {
        "events_status_changed": {
          "filter": {
            "term": { "event.action": "status-changed" }
          },
          "aggs": {
            "last_item": {
              "top_hits": {
                "size": 1,
                "sort": [
                  { "@timestamp": { "order": "desc" } },
                  { "event.sequence": { "order": "desc" } }
                ],
                "_source": ["@timestamp", "kibana.detection_engine"]
              }
            }
          }
        },
        "metrics_execution_gap": {
          "filter": {
            "term": { "event.action": "metric-execution-gap" }
          },
          "aggs": {
            "last_item": {
              "top_hits": {
                "size": 1,
                "sort": [
                  { "@timestamp": { "order": "desc" } },
                  { "event.sequence": { "order": "desc" } }
                ],
                "_source": ["event.duration"]
              }
            }
          }
        },
        "metrics_search_duration_max": {
          "filter": {
            "term": { "event.action": "metric-search-duration-max" }
          },
          "aggs": {
            "last_item": {
              "top_hits": {
                "size": 1,
                "sort": [
                  { "@timestamp": { "order": "desc" } },
                  { "event.sequence": { "order": "desc" } }
                ],
                "_source": ["event.duration"]
              }
            }
          }
        },
        "metrics_indexing_duration_max": {
          "filter": {
            "term": { "event.action": "metric-indexing-duration-max" }
          },
          "aggs": {
            "last_item": {
              "top_hits": {
                "size": 1,
                "sort": [
                  { "@timestamp": { "order": "desc" } },
                  { "event.sequence": { "order": "desc" } }
                ],
                "_source": ["event.duration"]
              }
            }
          }
        },
        "metrics_indexing_lookback": {
          "filter": {
            "term": { "event.action": "metric-indexing-lookback" }
          },
          "aggs": {
            "last_item": {
              "top_hits": {
                "size": 1,
                "sort": [
                  { "@timestamp": { "order": "desc" } },
                  { "event.sequence": { "order": "desc" } }
                ],
                "_source": ["event.end"]
              }
            }
          }
        }
      }
    }
  }
}

We need a freedom to pack multiple aggs in a single query or to split it into several queries.

We'd also like to be able to combine aggs with top-level filters, pagination, sorting and other options if we need so. See pre-filtering with "term": { "event.action": "status-changed" }:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "term": { "kibana.saved_objects.type": "alert" } },
        { "term": { "kibana.saved_objects.namespace": "some-space" } },
        {
          "terms": {
            "kibana.saved_objects.id": [
              "1234-56789-dfgdfhgfgh-122346567",
              "1234-56789-dfgdfhgfgh-122346568"
            ]
          }
        },
        {
          "term": { "event.action": "status-changed" }
        }
      ]
    }
  },
  "aggs": {
    "rules": {
      "terms": {
        "field": "kibana.saved_objects.id",
        "size": 2
      },
      "aggs": {
        "current_status": {
          "top_hits": {
            "size": 1,
            "sort": [
              { "@timestamp": { "order": "desc" } },
              { "event.sequence": { "order": "desc" } }
            ],
            "_source": ["@timestamp", "kibana.detection_engine"]
          }
        }
      }
    }
  }
}

Sorting by multiple fields

The current API allows to specify only a single sort field and restricts the fields that can be used to sort events.

  sort_field: schema.oneOf(
    [
      schema.literal('@timestamp'),
      schema.literal('event.start'),
      schema.literal('event.end'),
      schema.literal('event.provider'),
      schema.literal('event.duration'),
      schema.literal('event.action'),
      schema.literal('message'),
    ],
    {
      defaultValue: '@timestamp',
    }
  ),
  sort_order: schema.oneOf([schema.literal('asc'), schema.literal('desc')], {
    defaultValue: 'asc',
  }),

We need to be able to sort by multiple fields both in normal queries and within an aggregation scope.

"sort": [
    { "@timestamp": { "order": "desc" } },
    { "event.sequence": { "order": "desc" } }
  ]

In the future we will need to sort by arbitrary fields in order to implement, for example, sorting by the current rule execution status (kibana.detection_engine.rule_status_severity) in our rules monitoring table.

Custom ES DSL filters

Would be nice to have an option to specify custom filters with ES DSL, like for example here:

"query": {
  "bool": {
    "filter": [
      { "term": { "kibana.saved_objects.type": "alert" } },
      { "term": { "kibana.saved_objects.namespace": "some-space" } },
      {
        "terms": {
          "kibana.saved_objects.id": [
            "1234-56789-dfgdfhgfgh-122346567",
            "1234-56789-dfgdfhgfgh-122346568"
          ]
        }
      },
      // This one would be a custom term filter:
      {
        "term": { "event.action": "status-changed" }
      }
    ]
  }
},

With the current API it's possible to specify a string KQL filter. It's definitely a great option to have, but in cases like this one, where we exactly know the resulting query to build, we'd prefer to save some server-side CPU cycles by not parsing KQL.

Limiting source fields to return

Sometimes we don't need to fetch the full event document with all its ECS fields. We need to be able to restrict the source fields as we like, both within aggs and the query scope:

"aggs": {
  "last_item": {
    "top_hits": {
      "size": 1,
      "sort": [
        { "@timestamp": { "order": "desc" } },
        { "event.sequence": { "order": "desc" } }
      ],
      "_source": ["event.duration"]
    }
  }
}
{
  "query": {...},
  "sort": [...],
  "_source": ["@timestamp", "message", "log.level", "event.severity", "event.action"]
}

More questions about using Event Log

ECS version:

  • Which version of ECS does the Event Log support? Is it strictly 1.6.0 or several versions can be supported at the same time for different clients?
  • Who/when/how will upgrade the version(s) supported by Event Log?
  • Should (or can) clients specify ecs.version when logging events?

ECS event.provider, event.dataset, log.logger:

  • With the RAC project going on, our team will be extracting the Detection Engine into a separate plugin. There's a feeling that this plugin might grow over time accumulating all the common functionality.
  • I felt like it would be nice to be able to specify event.provider: 'detection-engine', but have a way to have multiple logs within the provider.
  • In this proposal I'm using:
    • event.dataset: 'detection-engine.rule-execution-log'
    • log.logger: 'detection-engine.rule-execution-log'
  • Similarly, we'd need to be able to query only detection-engine.rule-execution-log where needed, and not the full log.
  • In this model, each logger/dataset would specify its own set of event.actions.
  • Who should set event.provider, event.dataset, log.logger fields for an event - a client of IEventLogger or the logger itself?

Could we theoretically use an instance of ElasticsearchClient to query the event log index, rather than using IEventLogClient? My concern is: while writing to this index and managing the index should be done via dedicated APIs, adding read APIs can become a leaky abstraction, which reminds me of the issues we currently have with saved objects APIs like lack of aggs support etc. On the one hand, for us, application developers, it's hard to predict what read API we will need in the future; would be nice to have freedom to use any that Elasticsearch provides. On the other hand, adding support for all the APIs can be difficult and lead to maintenance issues. Maybe there's a way to provide a decorator on top of ElasticsearchClient that would preserve its API or a thin adapter that would just slightly change it?

@banderror banderror added Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detections and Resp Security Detection Response Team labels Mar 9, 2021
@banderror banderror self-assigned this Mar 9, 2021
@banderror banderror requested a review from a team March 9, 2021 22:15
@banderror banderror marked this pull request as ready for review March 9, 2021 22:15
@banderror banderror requested a review from a team as a code owner March 9, 2021 22:15
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@banderror banderror added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Mar 9, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@banderror banderror changed the title [Security Solution][Detections] Proposal: building Rule Execution Log on top of Event Log and ECS (WIP) [Security Solution][Detections] Proposal: building Rule Execution Log on top of Event Log and ECS Mar 9, 2021
Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments regarding eventLog usage

private _result: IEcsEvent = {};

constructor() {
// TODO: Which version does event_log use? Should it be specified here or inside the event log itself?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The event log plugin provides this value, you don't need to provide it. And probably shouldn't since we could end up with different versions in different documents. Though I don't know of anything that makes use of this field right now.


interface IEcsAdditionalFields {
// https://www.elastic.co/guide/en/ecs/1.9/ecs-event.html
event?: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already use a bunch of the event fields for fairly "generic" purposes, so I'm a little concerned about having application-specific data in here as well. Not a problem now since there's no overlap, but I'm guessing there could be some day?

event: schema.maybe(
schema.object({
action: ecsString(),
provider: ecsString(),
start: ecsDate(),
duration: ecsNumber(),
end: ecsDate(),
outcome: ecsString(),
reason: ecsString(),
})
),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, IEcsAdditionalFields just represents the fields we need to be supported in Event Log and which are currently missing in its schema. I mean, this PR is just a proposal, this interface is not going to be used in the working code as is and the missing fields will need to be moved to Event Log itself.

event fields here are all standard ECS fields (spec), and the only custom field I propose is kibana.detection_engine :)

Not sure if I addressed your comment... Pls let me know :)


export type IEcsEvent = IEventLogEvent & IEcsAdditionalFields;

interface IEcsAdditionalFields {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these will actually be written out today, till they're added to our schema - we use dynamic: false at the top of the mappings, so I believe they will not get indexed, at the very least.

{
"dynamic": "false",
"properties": {

If this PR is just an experiment, you might want to extend the schema, following the directions here: https://github.com/elastic/kibana/blob/master/x-pack/plugins/event_log/generated/README.md - and let us know if any of it is confusing so we can correct it.

I think we'd want to update the event log separately though, if we intend to merge this, just to keep the PRs a little cleaner.

Copy link
Contributor Author

@banderror banderror Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this PR is just an experiment, a proposal. #94143 (comment)
I'll try to explain it better in a separate comment below :)

Copy link
Contributor

@dhurley14 dhurley14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through this PR it is awesome to see how this not only will replace our statuses but also provide an extensible schema to allow for future "events" to be indexed throughout the phases of the rule executor. I played around with the documents posted in the text files and used the queries posted there too and can really start to see this coming together! I have a few clarifying questions and specifically am interested in more information on the detection_engine field. Looks good so far!

INFO: 'info',
WARNING: 'warning',
ERROR: 'error',
} as const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming part of this refactor is to also write our log statements to the event log? Is that correct?

Copy link
Contributor Author

@banderror banderror Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great question. TL;DR: I think we will be able to write both events to the execution log and normal logs. Normal logs are for Kibana sysadmins and debugging, execution logs are for Detections users. What to log where - I'd say TBD, but would be nice to have some flexibility here.


You mean whatever we log through a Kibana logger to a normal log? I didn't imagine it exactly like that, because I'm not sure if it makes sense to see all the logs in the UI from the user perspective: can be too many, too technical or include info that should not be shown to the user. On the other hand, we definitely would be able to log more data, so I added these standard log levels because it's a UI thing familiar to all users.

In Kibana we already have a few views where the user can scroll through logs, e.g. in Fleet there are logs from agents.

Comment on lines +339 to +342
"detection_engine": {
"rule_status": "warning",
"rule_status_severity": 20
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if we need to use this new field and can't repurpose some other fields, maybe message for the rule status? Or keep message how it is but utilize event.kind and set it to 'warning' or something.

Copy link
Contributor Author

@banderror banderror Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point... I was hoping that I would be able to map everything to native ECS fields.

message is mapped as text, not keyword. And we might need to have both status and message in a single event, e.g. if we decide to show status updates in the log. I mean, it would be simple and nice to just have the ability to have an additional message field.

event.kind has a strict set of allowed values, as long as we decide to conform to ECS, of course. Also, you can see in the queries that it's used to filter out metric events when fetching "event" events for showing in the Execution Log tab.

labels could be used for that, but this seems to me the same thing if we talk about event log. We'd either need to specify labels mapping in event log (same solution as introducing a custom field set), or turn on dynamic mapping for labels. I'm not sure about the latter since event log is designed as a single multitenant index. If it could support separate indices per each tenant/provider, that would be a different story. @pmuellr what do you think about these tradeoffs?

As for other native ECS fields, I just didn't find any suitable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay I didn't realize event.kind had limited values.. Maybe we could utilize event.code? That seems to be a flexible field? I"m just trying to think of ways to remove the detection engine specific fields.

Copy link
Contributor

@dplumlee dplumlee Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk if it would cause any issues here but i know we use the event.code field for a lot of endpoint exception stuff. It's one of our prepopulated fields

@@ -0,0 +1,230 @@
# -----------------------------------------------------------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing these little query scripts to test this out. It helped me visualize how these would be integrated in the executor!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My pleasure ✋ I imagine how confusing would it be to review it as a pile of text without examples or some code.

Comment on lines +13 to +16
export const RuleExecutionEventType = {
GENERIC: 'generic',
STATUS_CHANGED: 'status-changed',
} as const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could diversify this field beyond 'generic' and 'status-changed'. Maybe event type could be 'gap', 'warning / partial failure', 'failed', 'succeeded', 'missing privileges'? But maybe we need these specifically for the queries we are planning against the event log. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah yeah, exactly, I share the same kind of vision. We definitely will be able to add more RuleExecutionEventTypes if we need so. Doesn't mean we will have to. Cases where it would make sense to do later, imho:

  • Extract existing stuff like 'gap' to separate RuleExecutionEvents if we want to log it earlier, i.e. exactly when they are captured/created in the code. And decouple them from status updates. Meaning that if we found a gap, we could log it as a separate event with a warning level, but delay or omit changing the status.
  • New types of events with their own payload.

As for breaking StatusChangedEvent down to separate StatusGoingToRunEvent, StatusSucceededEvent - not sure, it would make sense only if they would have completely separate payloads and wouldn't have anything in common (in this case, it's a common status itself and its severity). Imho for now we could have a few factory methods for creating status events, but keep them as a single type.

const ecsEvents = mapStatusChangedEventToEcsEvents(event);

// TODO: move to a service implementing the rule execution log
// All generated ecsEvents have the same @timestamp, so we need an additional field for deterministic ordering.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@banderror
Copy link
Contributor Author

@dhurley14 @pmuellr thank you for your reviews, super appreciate that you switched your context and checked this out.

@pmuellr let me try to provide more context. So this PR is not a normal kind of PR with code, it's more like an RFC to demonstrate the idea of storing our rule execution events in Event Log - in theory. My goal was to understand if it's possible or not, makes sense or not, would it be applicable to our problem, would it solve it, etc.

I think at this point I can say that in theory yes, we could use event log. The only substantial risk I personally see is conforming to ECS itself. ECS seems to be very limited to allow storing arbitrary structured application logs. To avoid that, we will probably need full flexibility in terms of defining custom fields. Other than that, event log seems to me a good way to go.

However, at this point event log doesn't have all the APIs we need, and we need your comments on that as the event log creators and maintainers. I wrote up all the queries to event log index we'd need to be able to run to fetch everything we need in our app. Would you please mind going through "What's missing in the current Event Log API" and "More questions about using Event Log" in the proposal, and also dev_tools_events.txt and especially dev_tools_queries.txt? You can comment proposal.md directly.

At the end of the day, we need to know if this idea looks ok from your side, if all the proposed changes in event_log plugin are feasible, what's your team capacity now and who would do what and when. Sorry, I know this is probably a lot, especially considering all the RAC stuff going on and confusion around it.

@dgieselaar
Copy link
Member

I think some of these concerns can be addressed by instrumenting the rule execution. Not all of it, so I wouldn't suggest it as a replacement, but APM data might give you higher-fidelity performance data. Happy to pair next week if somebody wants to try it out. Here are some examples of what you'd get, from a local branch I have running:

image

image

APM data wouldn't be available to end-users, but we do hope to make it available in cloud. Won't solve all your problems but I think the waterfall is very useful in identifying optimisation opportunities.

I'm not familiar enough with the requirements here to comment on the proposal, but as @pmuellr knows, I think a log-based approach is very powerful, not just for rule executions, but also alerts as data, so I'm excited this strategy is being explored.

@dgieselaar
Copy link
Member

dgieselaar commented Mar 13, 2021

Again, not sure if I fully understand the requirements here, but it might be worth considering to just enrich the current execute events the event log already has today. The metrics you need could be stored on each execute event. I think that would make querying easier, because you have one execution event that is a summary of all the data that you need. You can then use a top_metrics aggregation to get the values you need (top_metrics is a lot faster than top_hits in many cases). e.g.:

{
  "aggs": {
    "rules": {
      "terms": {
        "field": "rule.id",
        "size": 50,
        "order": {
          "last_execution": "desc"
        }
      },
      "aggs": {
        "last_execution": {
          "max": {
            "field": "@timestamp"
          }
        },
        "latest": {
          "top_metrics": {
            "metrics": [
              { "field": "rule.execution.gap.ms" },
              { "field": "rule.execution.search_response_time.ms" },
              { "field": "rule.execution.index_response_time.ms" }
            ],
            "sort": {
              "@timestamp": "desc"
            }
          }
        }
      }
    }
  }
}

If you need to pre-aggregate certain metrics (e.g. because there are multiple searches, or multiple index operations), you can consider a histogram field or an aggregate metric double field.

If you are worried about mapping conflicts or ECS compatibility I think the best option here would to just be a scoped event log, which adheres to the common schema of the alerting framework's event log, and matches the index target wildcard that is being used. E.g., you'd end up with .kibana-event-log-siem-7.12.0 (or maybe even .kibana-event-log-execution-siem-7.12.0).

@banderror
Copy link
Contributor Author

@dgieselaar Thank you for your feedback and suggestions, this is very-very interesting. It's kind of a lot to comment, let me do it piece-by-piece :)

Rule execution log needs to be both writeable and queryable from the Security Solution app. Besides metrics, we'd like to be able to log events and simple generic log messages. I'm not familiar with Elastic APM unfortunately (it's a shame), but seems like it requires a separate APM server to send data to, and also is designed to consume metrics, traces and that kind of stuff as opposed to structured logs. Is that correct?

That said, I think the idea of dogfooding APM in Kibana in general and in our detection engine in particular is very interesting. Indeed, we have little data about detection rule execution metrics (only some metrics from task_manager) and no data about other Security Solution code outside of detection engine. And we'd like to start optimizing performance of our endpoints and detection engine. So as a tool for us, developers, integrating APM into our codebase sounds very important to me. Do you know about any existing integrations of APM in Kibana that we could use as a reference in the future?

it might be worth considering to just enrich the current execute events the event log already has today. The metrics you need could be stored on each execute event. I think that would make querying easier, because you have one execution event that is a summary of all the data that you need.

Not sure what do you mean by current execute events, is it events that alerting plugin logs into event log? This might be easier from the querying standpoint, but we (Detections and Response team) don't own alerting's log and imho shouldn't couple these things together at least at this point. Also, alerting doesn't have a notion of "gaps", "searching", "indexing" and other things specific to the detection engine. This might change in the future with the RAC initiative, but this is not guaranteed. All of that is to be designed, @spong and @tsg might have some vision on that. At this point, in my opinion we should have separate logs (and thus, events) for two separate mechanisms: alerting framework as it is now and detection engine.

You can then use a top_metrics aggregation to get the values you need (top_metrics is a lot faster than top_hits in many cases)

Thank you for this suggestion. In this particular case I need to be able to specify 2 sort fields, and seems like top_metrics doesn't support that: "It only supports a single sort value so which document wins ties is not specified". But this is an interesting aggregation, I didn't know about it, will definitely play with it 👍

If you are worried about mapping conflicts or ECS compatibility I think the best option here would to just be a scoped event log, which adheres to the common schema of the alerting framework's event log, and matches the index target wildcard that is being used. E.g., you'd end up with .kibana-event-log-siem-7.12.0 (or maybe even .kibana-event-log-execution-siem-7.12.0).

Yes, I was thinking about that: in order to "own" and control custom mappings (as well as potentially RBAC requirements), clients of event log might want to keep their logs in a separate index (that index would have all the default ECS fields, common event log fields like kibana.saved_objects, plus custom fields if a client specifies them). This is an open question, currently event log uses a single index, and it's multitenant by original design. Maybe I'm missing something here, feedback needed :)

@dgieselaar
Copy link
Member

Rule execution log needs to be both writeable and queryable from the Security Solution app. Besides metrics, we'd like to be able to log events and simple generic log messages. I'm not familiar with Elastic APM unfortunately (it's a shame), but seems like it requires a separate APM server to send data to, and also is designed to consume metrics, traces and that kind of stuff as opposed to structured logs. Is that correct?

Yes - although the concept of logging has many use cases. In APM terminology, a transaction event could be considered equivalent to a log event, depending on what you need. Theoretically, one could send APM data to the same cluster, similar to (an option of) stack monitoring, but it's more valuable to us on the engineering side than end-users.

That said, I think the idea of dogfooding APM in Kibana in general and in our detection engine in particular is very interesting. Indeed, we have little data about detection rule execution metrics (only some metrics from task_manager) and no data about other Security Solution code outside of detection engine. And we'd like to start optimizing performance of our endpoints and detection engine. So as a tool for us, developers, integrating APM into our codebase sounds very important to me. Do you know about any existing integrations of APM in Kibana that we could use as a reference in the future?

For the APM app, we use withApmSpan:

return withApmSpan('get_transaction_duration_chart_preview', async () => {

Fleet, Task Manager and Reporting also have custom instrumentation.

Thank you for this suggestion. In this particular case I need to be able to specify 2 sort fields, and seems like top_metrics doesn't support that: "It only supports a single sort value so which document wins ties is not specified". But this is an interesting aggregation, I didn't know about it, will definitely play with it 👍

I'm not sure if that is conceptual limitation of the top_metrics aggregation. It also might be possible to add it. But maybe you can just add a field that is seq + timestamp and sort on that?

Yes, I was thinking about that: in order to "own" and control custom mappings (as well as potentially RBAC requirements), clients of event log might want to keep their logs in a separate index (that index would have all the default ECS fields, common event log fields like kibana.saved_objects, plus custom fields if a client specifies them). This is an open question, currently event log uses a single index, and it's multitenant by original design. Maybe I'm missing something here, feedback needed :)

AFAIK there seems to be consensus that this is necessary for alerts at least. To me it seems like a small step to make that available for execute events as well.

@dgieselaar
Copy link
Member

Btw, you can already enable APM today, by running Kibana with ELASTIC_APM_ACTIVE=true. It will then send APM data to the cluster behind https://ela.st/kibana-ops-ci-stats.

@banderror
Copy link
Contributor Author

Ok, after several discussions we identified the following steps. I'll be working on them and adding these changes to event_log plugin:

  • Schema: Add support for missing ECS fields: standard and custom (or even whole ECS); consider bumping up ECS version
  • Querying side: Create a scoped Elasticsearch client respecting the existing security and isolation model of event log. Make sure it supports aggregations, sorting by multiple fields, limiting _source, custom Elasticsearch query DSL filters -- and actually everything that Elasticsearch client provides via its API
  • Querying side: Support scoping (filtering) by log.logger and maybe arbitrary fields from the schema
  • Logging side: Improve registration by registering a logger and specifying any additional fields to write, like it would be in a typical structured logging implementation
  • Remove restrictions to which fields can be used for sorting

Other ideas are out of scope for now. Maybe we will revisit these later in the future when we get more clarity with the RAC's alerts-as-data implementation lead by @spong:

  • Unify event log and alerts-as-data under a single abstraction (or build alerts-as-data and event log on top of a common abstraction). The problem to abstract is writing, querying and managing indices that keep ECS-like objects in them.
  • Add support for multiple event log indices (e.g. one for each plugin/provider, in other words client)
  • Change security model to make it based on index name and/or provider name
  • Implement document migration

Thanks everybody for your feedback - @dgieselaar @dhurley14 @dplumlee @peluja1012 @pmuellr @spong

@kibanamachine
Copy link
Contributor

kibanamachine commented Mar 22, 2021

💔 Build Failed

Failed CI Steps


Test Failures

Kibana Pipeline / general / "before all" hook for "should contain the right query".Timeline query tab Query tab "before all" hook for "should contain the right query"

Link to Jenkins

Stack Trace

Failed Tests Reporter:
  - Test has not failed recently on tracked branches

AssertionError: Timed out retrying after 60000ms: Expected to find element: `[data-test-subj="title-0f2fb180-8b3a-11eb-b88c-ad3fd394cb5d"]`, but never found it.

Because this error occurred during a `before all` hook we are skipping the remaining tests in the current suite: `Timeline query tab`

Although you have test retries enabled, we do not retry tests when `before all` or `after all` hooks fail
    at Object.openTimelineById (http://localhost:6121/__cypress/tests?p=cypress/integration/timelines/query_tab.spec.ts:16074:15)
    at Context.eval (http://localhost:6121/__cypress/tests?p=cypress/integration/timelines/query_tab.spec.ts:15045:28)

Kibana Pipeline / general / "before all" hook for "should open a modal".Open timeline Open timeline modal "before all" hook for "should open a modal"

Link to Jenkins

Stack Trace

Failed Tests Reporter:
  - Test has not failed recently on tracked branches

AssertionError: Timed out retrying after 60000ms: Expected to find element: `[data-test-subj="title-6aa4a4a0-8b33-11eb-b965-913a5ba19dfe"]`, but never found it.

Because this error occurred during a `before all` hook we are skipping the remaining tests in the current suite: `Open timeline`

Although you have test retries enabled, we do not retry tests when `before all` or `after all` hooks fail
    at Object.openTimelineById (http://localhost:6121/__cypress/tests?p=cypress/integration/timelines/open_timeline.spec.ts:16088:15)
    at Context.eval (http://localhost:6121/__cypress/tests?p=cypress/integration/timelines/open_timeline.spec.ts:15046:28)

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @banderror

Comment on lines +372 to +376
ECS version:

- Which version of ECS does the Event Log support? Is it strictly `1.6.0` or several versions can be supported at the same time for different clients?
- Who/when/how will upgrade the version(s) supported by Event Log?
- Should (or can) clients specify `ecs.version` when logging events?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This

banderror added a commit that referenced this pull request Mar 29, 2021
…Engine (#95067)

**Related to:** #94143

## Summary

This PR adds new fields to the schema (`EventSchema`, `IEvent`):

- standard ECS fields: `error.*`, `event.*`, `log.level`, `log.logger`, `rule.*`
- custom field set `kibana.detection_engine`

We need these fields on the Detections side to implement detection rule execution log. See the related proposal (#94143) for more details.

Also, this PR bumps ECS used in Event Log from `1.6.0` to the current `1.8.0` version. They are 100% same in terms of fields used in Event Log, so no changes in the schema were caused by this version increment.
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Mar 29, 2021
…Engine (elastic#95067)

**Related to:** elastic#94143

## Summary

This PR adds new fields to the schema (`EventSchema`, `IEvent`):

- standard ECS fields: `error.*`, `event.*`, `log.level`, `log.logger`, `rule.*`
- custom field set `kibana.detection_engine`

We need these fields on the Detections side to implement detection rule execution log. See the related proposal (elastic#94143) for more details.

Also, this PR bumps ECS used in Event Log from `1.6.0` to the current `1.8.0` version. They are 100% same in terms of fields used in Event Log, so no changes in the schema were caused by this version increment.
kibanamachine added a commit that referenced this pull request Mar 29, 2021
…Engine (#95067) (#95654)

**Related to:** #94143

## Summary

This PR adds new fields to the schema (`EventSchema`, `IEvent`):

- standard ECS fields: `error.*`, `event.*`, `log.level`, `log.logger`, `rule.*`
- custom field set `kibana.detection_engine`

We need these fields on the Detections side to implement detection rule execution log. See the related proposal (#94143) for more details.

Also, this PR bumps ECS used in Event Log from `1.6.0` to the current `1.8.0` version. They are 100% same in terms of fields used in Event Log, so no changes in the schema were caused by this version increment.

Co-authored-by: Georgii Gorbachev <[email protected]>
@spong spong mentioned this pull request Apr 1, 2021
2 tasks
banderror added a commit that referenced this pull request May 27, 2021
**Needed for:** rule execution log for Security #94143
**Related to:**

- alerts-as-data: #93728, #93729, #93730
- RFC for index naming #98912

## Summary

This PR adds a mechanism for writing to / reading from / bootstrapping indices for RAC project into the `rule_registry` plugin. Particularly, indices for alerts-as-data and rule execution events. This implementation is similar to existing implementations like `event_log` plugin (see #98353 (comment) for historical perspective), but we're going to converge all of them into 1 or 2 implementations. At least we should have a single one in `rule_registry` itself.

In this PR I tried to incorporate most of the feedback received in the RFC (#98912), but if you notice I missed/forgot something, please let me know in the comments.

Done in this PR:

- [x] Schema-agnostic APIs for working with Elasticsearch.
- [x] Schema-aware log definition and bootstrapping API (creating hierarchical logs).
- [x] Schema-aware write API (logging events).
- [x] Schema-aware read API (searching logs, filtering, sorting, pagination, aggregation).
- [x] Support for Kibana spaces, space-aware index bootstrapping (either at rule creation or rule execution time).

As for reviewing this PR, perhaps it might be easier to start with:

- checking description of #98912
- checking usage examples https://github.com/elastic/kibana/pull/98353/files#diff-c049ff2198cc69bd50a69e92d29e88da7e10b9a152bdaceaf3d41826e712c12b
- checking public api https://github.com/elastic/kibana/pull/98353/files#diff-8e9ef0dbcbc60b1861d492a03865b2ae76a56ec38ada61898c991d3a74bd6268

## Next steps

Next steps towards rule execution log in Security (#94143):

- define actual schema for rule execution events
- inject instance of rule execution log into Security rule executors and route handlers
- implement actual execution logging in rule executors
- update route handlers to start fetching execution events and metrics from the log instead of custom saved objects

Next steps in the context of RAC and unified implementation:

- converge this implementation with `RuleDataService` implementation
  - implement robust index bootstrapping
  - reconsider using FieldMap as a generic type parameter
  - implement validation for documents being indexed
- cover the final implementation with tests
- write comprehensive docs: update plugin README, add JSDoc comments to all public interfaces
banderror added a commit to banderror/kibana that referenced this pull request May 27, 2021
**Needed for:** rule execution log for Security elastic#94143
**Related to:**

- alerts-as-data: elastic#93728, elastic#93729, elastic#93730
- RFC for index naming elastic#98912

## Summary

This PR adds a mechanism for writing to / reading from / bootstrapping indices for RAC project into the `rule_registry` plugin. Particularly, indices for alerts-as-data and rule execution events. This implementation is similar to existing implementations like `event_log` plugin (see elastic#98353 (comment) for historical perspective), but we're going to converge all of them into 1 or 2 implementations. At least we should have a single one in `rule_registry` itself.

In this PR I tried to incorporate most of the feedback received in the RFC (elastic#98912), but if you notice I missed/forgot something, please let me know in the comments.

Done in this PR:

- [x] Schema-agnostic APIs for working with Elasticsearch.
- [x] Schema-aware log definition and bootstrapping API (creating hierarchical logs).
- [x] Schema-aware write API (logging events).
- [x] Schema-aware read API (searching logs, filtering, sorting, pagination, aggregation).
- [x] Support for Kibana spaces, space-aware index bootstrapping (either at rule creation or rule execution time).

As for reviewing this PR, perhaps it might be easier to start with:

- checking description of elastic#98912
- checking usage examples https://github.com/elastic/kibana/pull/98353/files#diff-c049ff2198cc69bd50a69e92d29e88da7e10b9a152bdaceaf3d41826e712c12b
- checking public api https://github.com/elastic/kibana/pull/98353/files#diff-8e9ef0dbcbc60b1861d492a03865b2ae76a56ec38ada61898c991d3a74bd6268

## Next steps

Next steps towards rule execution log in Security (elastic#94143):

- define actual schema for rule execution events
- inject instance of rule execution log into Security rule executors and route handlers
- implement actual execution logging in rule executors
- update route handlers to start fetching execution events and metrics from the log instead of custom saved objects

Next steps in the context of RAC and unified implementation:

- converge this implementation with `RuleDataService` implementation
  - implement robust index bootstrapping
  - reconsider using FieldMap as a generic type parameter
  - implement validation for documents being indexed
- cover the final implementation with tests
- write comprehensive docs: update plugin README, add JSDoc comments to all public interfaces
banderror added a commit that referenced this pull request May 27, 2021
**Needed for:** rule execution log for Security #94143
**Related to:**

- alerts-as-data: #93728, #93729, #93730
- RFC for index naming #98912

## Summary

This PR adds a mechanism for writing to / reading from / bootstrapping indices for RAC project into the `rule_registry` plugin. Particularly, indices for alerts-as-data and rule execution events. This implementation is similar to existing implementations like `event_log` plugin (see #98353 (comment) for historical perspective), but we're going to converge all of them into 1 or 2 implementations. At least we should have a single one in `rule_registry` itself.

In this PR I tried to incorporate most of the feedback received in the RFC (#98912), but if you notice I missed/forgot something, please let me know in the comments.

Done in this PR:

- [x] Schema-agnostic APIs for working with Elasticsearch.
- [x] Schema-aware log definition and bootstrapping API (creating hierarchical logs).
- [x] Schema-aware write API (logging events).
- [x] Schema-aware read API (searching logs, filtering, sorting, pagination, aggregation).
- [x] Support for Kibana spaces, space-aware index bootstrapping (either at rule creation or rule execution time).

As for reviewing this PR, perhaps it might be easier to start with:

- checking description of #98912
- checking usage examples https://github.com/elastic/kibana/pull/98353/files#diff-c049ff2198cc69bd50a69e92d29e88da7e10b9a152bdaceaf3d41826e712c12b
- checking public api https://github.com/elastic/kibana/pull/98353/files#diff-8e9ef0dbcbc60b1861d492a03865b2ae76a56ec38ada61898c991d3a74bd6268

## Next steps

Next steps towards rule execution log in Security (#94143):

- define actual schema for rule execution events
- inject instance of rule execution log into Security rule executors and route handlers
- implement actual execution logging in rule executors
- update route handlers to start fetching execution events and metrics from the log instead of custom saved objects

Next steps in the context of RAC and unified implementation:

- converge this implementation with `RuleDataService` implementation
  - implement robust index bootstrapping
  - reconsider using FieldMap as a generic type parameter
  - implement validation for documents being indexed
- cover the final implementation with tests
- write comprehensive docs: update plugin README, add JSDoc comments to all public interfaces
@banderror
Copy link
Contributor Author

We've partially implemented rule execution logging to Event Log (v7.16), and started to read last 5 failures from it on the Rule Details page (v8.0) as part of #101013. I'm going to close this one and we will

@banderror banderror closed this Nov 11, 2021
@banderror banderror deleted the rule-execution-log-proposal branch February 16, 2022 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Detections and Resp Security Detection Response Team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants