Debug around ML rule execution #189307

rylnd · 2024-07-26T17:01:59Z

Let's see if we can't find why these rules aren't generating alerts.

Summary

Summarize your PR. If it involves visual changes include a screenshot or gif.

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
Flaky Test Runner was used on any tests changed
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)
This was checked for cross-browser compatibility

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk	Probability	Severity	Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space.	Low	High	Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks.	High	Low	Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled.	Medium	High	Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

This was checked for breaking API changes and was labeled appropriately

Let's see if we can't find why these rules aren't generating alerts.

None of these are showing up in the build. It's not yet clear whether this is a log level / file descriptor issue, or whether our code just isn't being executed.

I'm not seeing these on CI.

Maybe we can see this?

Understanding what happens during the first rule execution (if there is one) might help us to understand why we're not generating alerts that first time.

If the rule is going to eventually succeed, we should see this eventually resolve if the problem lies in the rule executor's handling of the failure.

This should give better granularity on the following: * How long it takes for the ML job to become "started" * How long it takes for the metrics to become available

See previous commit for context.

There's still a chance that the datafeed/job will _no longer_ be ready by the time we hit the failing MKI tests (or maybe the timing issue pops up years from now 😉), but if this makes our tests more consistent we can start to focus on this: better ML integration.

Let's see how long these are pausing; that might indicate an issue.

Despite our job being started, we're now receiving _no_ alerts when before we had some. I think this is because the job is starting, but no anomalies are ready yet. This should validate that hypothesis.

This is less restrictive than the ML helper, which seems to wait for the job to report as having processed records. Let's see if this implementation works for us.

rylnd added 15 commits July 26, 2024 12:00

Debug around ML rule execution

4c0b463

Let's see if we can't find why these rules aren't generating alerts.

More debugging

5ff34e8

None of these are showing up in the build. It's not yet clear whether this is a log level / file descriptor issue, or whether our code just isn't being executed.

Attempt to actually log the execution info in rule preview

5924a05

Try debug-logging at a higher level

8c6df4e

I'm not seeing these on CI.

Log only alerts from ML rules.

13d0256

Log with our execution logger

7bf8523

Maybe we can see this?

Debug our rule statuses as we wait for success

a30c4f9

Understanding what happens during the first rule execution (if there is one) might help us to understand why we're not generating alerts that first time.

Retry metrics for six minutes

2027fdb

If the rule is going to eventually succeed, we should see this eventually resolve if the problem lies in the rule executor's handling of the failure.

Reduce testing rule interval to 30s

87b6e5f

This should give better granularity on the following: * How long it takes for the ML job to become "started" * How long it takes for the metrics to become available

Use the minimum interval of 1m for our test ML rules

4d075dc

See previous commit for context.

Log around our ML waiting utilities

b4159d5

Let's see how long these are pausing; that might indicate an issue.

Wait for anomalies before beginning our test

4853b10

Despite our job being started, we're now receiving _no_ alerts when before we had some. I think this is because the job is starting, but no anomalies are ready yet. This should validate that hypothesis.

Wait for anomalies to be searchable

02a5ad1

This is less restrictive than the ML helper, which seems to wait for the job to report as having processed records. Let's see if this implementation works for us.

Fix missing import

2af7fb5

rylnd closed this Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug around ML rule execution #189307

Debug around ML rule execution #189307

rylnd commented Jul 26, 2024

Debug around ML rule execution #189307

Debug around ML rule execution #189307

Conversation

rylnd commented Jul 26, 2024

Summary

Checklist

Risk Matrix

For maintainers