Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug around ML rule execution #189307

Closed
wants to merge 15 commits into from
Closed

Conversation

rylnd
Copy link
Contributor

@rylnd rylnd commented Jul 26, 2024

Let's see if we can't find why these rules aren't generating alerts.

Summary

Summarize your PR. If it involves visual changes include a screenshot or gif.

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

rylnd added 15 commits July 26, 2024 12:00
Let's see if we can't find why these rules aren't generating alerts.
None of these are showing up in the build. It's not yet clear whether
this is a log level / file descriptor issue, or whether our code just
isn't being executed.
I'm not seeing these on CI.
Maybe we can see this?
Understanding what happens during the first rule execution (if there is
one) might help us to understand why we're not generating alerts that
first time.
If the rule is going to eventually succeed, we should see this
eventually resolve if the problem lies in the rule executor's handling
of the failure.
This should give better granularity on the following:

* How long it takes for the ML job to become "started"
* How long it takes for the metrics to become available
There's still a chance that the datafeed/job will _no longer_ be ready
by the time we hit the failing MKI tests (or maybe the timing issue pops
up years from now 😉), but if this makes our tests more consistent
we can start to focus on this: better ML integration.
Let's see how long these are pausing; that might indicate an issue.
Despite our job being started, we're now receiving _no_ alerts when
before we had some. I think this is because the job is starting, but no
anomalies are ready yet. This should validate that hypothesis.
This is less restrictive than the ML helper, which seems to wait for the
job to report as having processed records. Let's see if this
implementation works for us.
@rylnd rylnd closed this Sep 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant