Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Inhibitor.MutesAll() #3933

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Nuckal777
Copy link

For the context please refer to #3932.

I added benchmarks, which consider more than one target alert.
The following benchstat output compares:

  • invoking Inhibitor.Mutes() for each target alert (old.txt)
  • with the Inhibitor.MutesAll() implementation (new.txt).
    Inhibitor.MutesAll() increase the speed in certain cases at the cost of some memory.
goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/inhibit
                                                                        │    old.txt    │               new.txt                 │
                                                                        │    sec/op     │    sec/op     vs base                 │
Mutes/1_inhibition_rule,_1_inhibiting_alert,_1_target_alert-10              768.5n ± 3%    842.5n ± 5%    +9.62% (p=0.001 n=10)
Mutes/10_inhibition_rules,_1_inhibiting_alert,_1_target_alert-10            769.5n ± 2%    834.4n ± 0%    +8.44% (p=0.000 n=10)
Mutes/100_inhibition_rules,_1_inhibiting_alert,_1_target_alert-10           779.3n ± 1%    841.4n ± 1%    +7.96% (p=0.000 n=10)
Mutes/1000_inhibition_rules,_1_inhibiting_alert,_1_target_alert-10          839.5n ± 1%    914.7n ± 4%    +8.96% (p=0.000 n=10)
Mutes/1000_inhibition_rules,_100_inhibiting_alert,_100_target_alert-10     157.39µ ± 1%    54.41µ ± 1%   -65.43% (p=0.000 n=10)
Mutes/10000_inhibition_rules,_1_inhibiting_alert,_1_target_alert-10         897.5n ± 3%    962.6n ± 2%    +7.25% (p=0.000 n=10)
Mutes/10000_inhibition_rules,_100_inhibiting_alert,_100_target_alert-10     63.07µ ± 6%    57.59µ ± 3%    -8.68% (p=0.000 n=10)
Mutes/1_inhibition_rule,_10_inhibiting_alerts,_1_target_alert-10            886.9n ± 1%   1087.0n ± 1%   +22.57% (p=0.000 n=10)
Mutes/1_inhibition_rule,_100_inhibiting_alerts,_1_target_alert-10           1.929µ ± 2%    3.083µ ± 1%   +59.84% (p=0.000 n=10)
Mutes/1_inhibition_rule,_100_inhibiting_alerts,_100_target_alert-10        174.18µ ± 1%    60.44µ ± 1%   -65.30% (p=0.000 n=10)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts,_1_target_alert-10          12.67µ ± 1%    25.82µ ± 1%  +103.87% (p=0.000 n=10)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts,_1000_target_alert-10     12425.7µ ± 1%    582.3µ ± 1%   -95.31% (p=0.000 n=10)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts,_1_target_alert-10         104.4µ ± 1%    278.8µ ± 1%  +167.13% (p=0.000 n=10)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts,_1_target_alert-10       11.89µ ± 2%    24.69µ ± 0%  +107.65% (p=0.000 n=10)
Mutes/10_inhibition_rules,_last_rule_matches,_1_target_alert-10             1.012µ ± 3%    1.125µ ± 1%   +11.22% (p=0.000 n=10)
Mutes/100_inhibition_rules,_last_rule_matches,_1_target_alert-10            3.178µ ± 1%    3.872µ ± 1%   +21.86% (p=0.000 n=10)
Mutes/100_inhibition_rules,_last_rule_matches,_100_target_alerts-10         299.9µ ± 1%    227.9µ ± 1%   -24.00% (p=0.000 n=10)
Mutes/1000_inhibition_rules,_last_rule_matches,_1_target_alert-10           24.62µ ± 3%    31.57µ ± 1%   +28.24% (p=0.000 n=10)
Mutes/1000_inhibition_rules,_last_rule_matches,_1000_target_alert-10        24.29m ± 0%    18.27m ± 0%   -24.77% (p=0.000 n=10)
Mutes/10000_inhibition_rules,_last_rule_matches,_1_target_alert-10          239.6µ ± 6%    306.5µ ± 0%   +27.94% (p=0.000 n=10)
geomean                                                                     18.71µ         17.34µ         -7.36%

In cases where a low count of alerts is involved Inhibitor.MutesAll() is 2-3 times slower, due to the additional memory allocations.
This should still be sufficiently fast given the small scale.
When the count of target alerts is increased Inhibitor.MutesAll() becomes faster.
Invoking it as part of GET /alerts still needs to be done.

Copy link
Contributor

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about how you intend to use MutesAll. I see you have updated the tests and the benchmarks to use the new MutesAll function, but you don't call it anywhere other than tests? Do you intend to call it from MuteStage as well as APIv2? The PR seems incomplete without it?

@Nuckal777
Copy link
Author

Do you intend to call it from MuteStage as well as APIv2?

Yes.

The PR seems incomplete without it?

I thought I check the general interest for the topic before implementing everything. I will add it. Might take a few days.

@Nuckal777
Copy link
Author

I added MutesAll to the api and MuteStage code paths. The CI needs a re-run, network foo during setup.

Copy link
Contributor

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! 👋 Thanks for making these changes! However, I'm afraid it confirms what I suspected. This change has some weird effects on existing code that I don't think should be accepted. For example:

GroupFunc now returns a list of booleans because alertFilter no longer operates on individual alerts, but slices of alerts:

GroupFunc func(func(*dispatch.Route) bool, func([]*types.Alert, time.Time) []bool) (dispatch.AlertGroups, map[model.Fingerprint][]string)

setAlertStatus now accepts a list of label sets, which doesn't make sense as the function sets the status for an individual alert.

setAlertStatusFn func(...prometheus_model.LabelSet)

You then have to build slices of label sets and index into the returned slices in notify/notify.go.

Is it possible to optimize inhibition rules without changing the Mutes interface?

types/types.go Outdated
@@ -472,6 +472,7 @@ func (a *Alert) Merge(o *Alert) *Alert {
// Mutes.
type Muter interface {
Mutes(model.LabelSet) bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need Mutes(model.LabelSet) bool? Can it be deleted?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MuteStage.Exec() currently depends on Inhibitor implementing Muter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof! Yeah, OK. I would recommend exploring other optimizations (only if you want and have the time to do so) which don't involve changing this interface, as changing the interface is quite an intrusive change and causes a number of problems elsewhere in the code.

@Nuckal777
Copy link
Author

Is it possible to optimize inhibition rules without changing the Mutes interface?

I implemented that. In the current state the algorithmic improvement can only be realised when all alerts are passed to MutesAll(). By moving setAlertStatus out of alertFilter the different API handlers can choose how to update it. For getAlertsHandler I used MutesAll() here.

@grobinson-grafana
Copy link
Contributor

Is it possible to optimize inhibition rules without changing the Mutes interface?

I implemented that. In the current state the algorithmic improvement can only be realised when all alerts are passed to MutesAll().

I looked again at the code, but I'm afraid I don't see where the optimization is implemented for Mutes()? Like you said, the performance optimization only works if you pass all alerts to MutesAll(), and there is no performance improvement if you only call Mutes()? Did I misunderstand?

My suggestion was to see if we can optimize the performance of Mutes without having to pass all alerts to MutesAll(). For example, what if the cache is persistent, like matcherCache for silences?

@Nuckal777
Copy link
Author

Nuckal777 commented Sep 6, 2024

there is no performance improvement if you only call Mutes()?

Exactly. The whole speed improvement comes from caching the source cache evaluation of InhibitRules across multiple alerts. In the context of inhibitions, that's the whole set of potential target alerts. For MutesAll() it roughly boils down to:

for _, r := range ih.rules {
    var alerts []*types.Alert // <- reused in the inner loop
    var scacheEval []bool // <- reused in the inner loop
    for i, lset := range lsets {
        // ...
        if inhibitedByFP, eq := r.hasEqualCached(lset, r.SourceMatchers.Matches(lset), alerts, scacheEval); eq {
            ih.marker.SetInhibited(fingerprints[i], inhibitedByFP.String())
        }
        // ...
    }
}

Just using Mutes() flips the two loops. That looks roughly like:

func someOtherFunc() {
    // ...
    for _, a := range alerts {
        inhibitor.Mutes(a.Labels) // <- evaluates the source cache for each a
    }
}

func (ih *Inhibitor) Mutes(lset model.LabelSet) bool {
    // ...
    for _, r := range ih.rules {
        // ...
        if inhibitedByFP, eq := r.hasEqual(lset, r.SourceMatchers.Matches(lset)); eq { // <- always evaluates the source cache matches
            ih.marker.SetInhibited(fp, inhibitedByFP.String())
            return true
        }
    }
}

Caching the source cache evaluations "requires" to pass a list of alerts. If the signature change/addition is undesirable, we can close this PR. It doesn't work without. Technically, the state MutesAll() caches can be externalized into a struct, which implements the Muter interface, e.g.:

type CachingInhibitor {
    base Inhibitor
    var alerts [][]*types.Alert // one cache for each inhibition rule
    var scacheEval [][]bool // one cache for each inhibition rule
}

Externalizing the state would cost more memory, likely requires locking and requires explicitly dropping these caches at some point (likely after calling Mutes() a few times).

My suggestion was to see if we can optimize the performance of Mutes without having to pass all alerts to MutesAll(). For example, what if the cache is persistent, like matcherCache for silences?

I expect something in that direction is possible as well, but it's something for different PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants