Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug around ML rule execution #189307

Closed
wants to merge 15 commits into from
Closed

Commits on Jul 26, 2024

  1. Debug around ML rule execution

    Let's see if we can't find why these rules aren't generating alerts.
    rylnd committed Jul 26, 2024
    Configuration menu
    Copy the full SHA
    4c0b463 View commit details
    Browse the repository at this point in the history
  2. More debugging

    None of these are showing up in the build. It's not yet clear whether
    this is a log level / file descriptor issue, or whether our code just
    isn't being executed.
    rylnd committed Jul 26, 2024
    Configuration menu
    Copy the full SHA
    5ff34e8 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    5924a05 View commit details
    Browse the repository at this point in the history
  4. Try debug-logging at a higher level

    I'm not seeing these on CI.
    rylnd committed Jul 26, 2024
    Configuration menu
    Copy the full SHA
    8c6df4e View commit details
    Browse the repository at this point in the history
  5. Log only alerts from ML rules.

    rylnd committed Jul 26, 2024
    Configuration menu
    Copy the full SHA
    13d0256 View commit details
    Browse the repository at this point in the history
  6. Log with our execution logger

    Maybe we can see this?
    rylnd committed Jul 26, 2024
    Configuration menu
    Copy the full SHA
    7bf8523 View commit details
    Browse the repository at this point in the history
  7. Debug our rule statuses as we wait for success

    Understanding what happens during the first rule execution (if there is
    one) might help us to understand why we're not generating alerts that
    first time.
    rylnd committed Jul 26, 2024
    Configuration menu
    Copy the full SHA
    a30c4f9 View commit details
    Browse the repository at this point in the history

Commits on Jul 27, 2024

  1. Retry metrics for six minutes

    If the rule is going to eventually succeed, we should see this
    eventually resolve if the problem lies in the rule executor's handling
    of the failure.
    rylnd committed Jul 27, 2024
    Configuration menu
    Copy the full SHA
    2027fdb View commit details
    Browse the repository at this point in the history
  2. Reduce testing rule interval to 30s

    This should give better granularity on the following:
    
    * How long it takes for the ML job to become "started"
    * How long it takes for the metrics to become available
    rylnd committed Jul 27, 2024
    Configuration menu
    Copy the full SHA
    87b6e5f View commit details
    Browse the repository at this point in the history
  3. Use the minimum interval of 1m for our test ML rules

    See previous commit for context.
    rylnd committed Jul 27, 2024
    Configuration menu
    Copy the full SHA
    4d075dc View commit details
    Browse the repository at this point in the history
  4. Wait for the ML API to report our job as "ready" in FTR

    There's still a chance that the datafeed/job will _no longer_ be ready
    by the time we hit the failing MKI tests (or maybe the timing issue pops
    up years from now 😉), but if this makes our tests more consistent
    we can start to focus on this: better ML integration.
    rylnd committed Jul 27, 2024
    Configuration menu
    Copy the full SHA
    3440144 View commit details
    Browse the repository at this point in the history

Commits on Jul 29, 2024

  1. Log around our ML waiting utilities

    Let's see how long these are pausing; that might indicate an issue.
    rylnd committed Jul 29, 2024
    Configuration menu
    Copy the full SHA
    b4159d5 View commit details
    Browse the repository at this point in the history
  2. Wait for anomalies before beginning our test

    Despite our job being started, we're now receiving _no_ alerts when
    before we had some. I think this is because the job is starting, but no
    anomalies are ready yet. This should validate that hypothesis.
    rylnd committed Jul 29, 2024
    Configuration menu
    Copy the full SHA
    4853b10 View commit details
    Browse the repository at this point in the history
  3. Wait for anomalies to be searchable

    This is less restrictive than the ML helper, which seems to wait for the
    job to report as having processed records. Let's see if this
    implementation works for us.
    rylnd committed Jul 29, 2024
    Configuration menu
    Copy the full SHA
    02a5ad1 View commit details
    Browse the repository at this point in the history
  4. Fix missing import

    rylnd committed Jul 29, 2024
    Configuration menu
    Copy the full SHA
    2af7fb5 View commit details
    Browse the repository at this point in the history