Minimize missed rule group evaluations #6129

rajagopalanand · 2024-07-30T18:25:20Z

What this PR does:

Currently once a Ruler instance loads a rule group, it evaluates it continuously. If the instance evaluating the rule group becomes unavailable, there is a high chance for missed evaluations before the instance becomes available again or another instance loads and evaluates the rule group. Ruler instances can become unavailable for variety of reasons including bad underlying nodes and OOM kills. This issue can be exacerbated if the Ruler instance appears as healthy within the cluster ring but is actually in an unhealthy state.

This PR addresses the problem by introducing a check to ensure that the primary Ruler is alive and in a running state. Here’s how it works:

Liveness Check: Non-primary Rulers will perform a liveness check on the primary Ruler for each rule group when syncing rule groups from external storage.
Fallback Mechanism: If the primary Ruler is unresponsive or not in a running state, the non-primary Ruler will assume ownership of the rule group and take over its evaluation.
Relinquish Ownership: If the primary Ruler is alive and running, and if non-primary ruler has ownership of the rule group, then it relinquishes ownership of that rule group by not taking ownership and unloading it from Prometheus rule manager

With this change, the maximum duration of missed evaluations will be limited to the sync interval of the rule groups, reducing the impact of primary Ruler unavailability.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

yeya24 · 2024-07-31T05:18:12Z

docs/configuration/config-file-reference.md

+
+# Enable high availability
+# CLI flag: -ruler.enable-ha
+[enable_ha: <boolean> | default = false]


enable_ha feels like a bad name to me. Can we just replace this with checking replication factor?
If we can't, can we be specific about what HA we want to enable? For example, enable_ha_evaluation

API HA checks only for replication factor. My thought was to use a separate flag in case if someone wants to use one or the other and not both. I can change this to enable_ha_evaluation

yeya24 · 2024-07-31T05:18:37Z

docs/configuration/config-file-reference.md

+
+# Timeout for liveness checks performed during rule sync
+# CLI flag: -ruler.liveness-check-timeout
+[liveness_check_timeout: <duration> | default = 1s]


Is 1s a good default here?

This is a simple/fast check. I thought 1s might be big enough for default. If RF=2, this should check only one Ruler. If RF=3, it will check 2 Rulers. Do you think it needs to be higher/lower?

Do we need this config at all? We cannot just put a sensible timeout?

I removed the config and set the default to 100ms since the gRPC calls are pretty light weight

yeya24 · 2024-07-31T05:18:54Z

pkg/ruler/client_pool_test.go

@@ -5,6 +5,8 @@ import (
 	"net"
 	"testing"

+	"github.com/cortexproject/cortex/pkg/util/services"


Please group import

yeya24 · 2024-07-31T05:19:37Z

pkg/ruler/ruler.go

@@ -273,6 +281,13 @@ type MultiTenantManager interface {
 //	|                              +-----------------+              |
 //	|                                                               |
 //	+---------------------------------------------------------------+
+
+type ruler interface {


Is this nencessary

Please see #6129 (comment)

yeya24 · 2024-07-31T05:22:07Z

pkg/ruler/ruler.go

@@ -488,11 +517,11 @@ func tokenForGroup(g *rulespb.RuleGroupDesc) uint32 {
 	return ringHasher.Sum32()
 }

-func instanceOwnsRuleGroup(r ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, instanceAddr string, forBackup bool) (bool, error) {
+func instanceOwnsRuleGroup(r ruler, rr ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, instanceAddr string, forBackup bool) (bool, error) {


Should we just use *Ruler as the method receiver here?

Reasons for introducing an interface is to narrow down what is available for functions to consume. Existing functions such as filterRuleGroups is not a method to avoid accidentally using Ruler's ring directly. If I pass in Ruler struct/object as-is, then it will expose the ring to those functions which could lead to accidental use of it. This is the reason I introduced a interface so that it could narrow down the scope of what is available to these functions. Of course, someone in the future could just add a method on the interface that exposes the ring. Having said all this, I'm open to suggestions here

Existing functions such as filterRuleGroups is not a method to avoid accidentally using Ruler's ring directly. If I pass in Ruler struct/object as-is, then it will expose the ring to those functions which could lead to accidental use of it

I am not sure I get the concern here. What's the worst case it could go wrong if accessing the Ruler's ring directly? instanceOwnsRuleGroup seems does Read operation to the Ring only so it shouldn't be a big issue.

I changed filterRuleGroups and other functions to methods on the Ruler and removed the interface

yeya24 · 2024-07-31T05:30:06Z

pkg/ruler/ruler.go

+		if r.Config().EnableHA {
+			for i, ruler := range rlrs.Instances {
+				if ruler.Addr == instanceAddr && i == 0 {
+					level.Debug(r.Logger()).Log("msg", "primary taking ownership", "user", g.User, "group", g.Name, "namespace", g.Namespace, "token", hash, "ruler", instanceAddr)


I wonder if it really helps to log token here. Does it help debug HA issues?

It helps issues with troubleshooting evaluation issues. Users can filter logs by token and know what happened with a particular rule group

Users can filter logs by token and know what happened with a particular rule group

Are we going to document this clearly? Otherwise I don't think it is something easy to debug as a user. Tbh I don't how I map a particular rule group to a hash by just looking at the log.
Isn't group and namespace already enough to locate a particular rule group?

yeya24 · 2024-07-31T05:35:20Z

pkg/ruler/ruler.go

+	return &LivenessCheckResponse{State: int32(r.State())}, nil
+}
+
+func nonPrimaryInstanceOwnsRuleGroup(r ruler, g *rulespb.RuleGroupDesc, replicas []string, selfAddress string) bool {


It would be great to have some comments on this function for better readability

yeya24 · 2024-07-31T05:38:26Z

pkg/ruler/ruler.go

@@ -1164,6 +1285,8 @@ func (r *Ruler) getShardedRules(ctx context.Context, userID string, rulesRequest
 		}
 	}
 	// Concurrently fetch rules from all rulers.
+	ctx, cancel := context.WithTimeout(ctx, r.cfg.ListRulesFanoutTimeout)


This will be a behavior change for all Cortex users. Can we just use the API request's own timeout?

If the client sets the timeout, then this timeout/deadline will also be the same and it will fail the overall API request for the client. The desired behavior is not to fail the whole API request but instead be fault tolerant if a subset of Rulers do not respond in a certain amount of time. Ideally, r.cfg.ListRulesFanoutTimeout should be less than the API request's timeout. If we do not want to change existing behavior, we can either set the default timeout to be a much bigger number (max int) or wrap this around the enable_ha_evaluation flag or perhaps there is a better option. Open to recommendations

+1 to bens comment... I think the solution here is not to set a timeout.. instead, we should return as soon as we have a complete answer (quorum?)

Working on this

I'm curious, is this line change related to minimizing missed evaluations or is it unrelated? If unrelated and you are solving another problem I would rather have this in a different pr, just because this can introduce regressions.

With Ruler HA, we can return as soon as we have complete answer but a new integration test I wrote was failing. The test fails because the IP of ruler that gets killed is not removed from the docker network fast enough during the test and this causes the gRPC call to that ruler to hang and the test times out. This scenario can happen in non-test environments as well

+1 to bens comment... I think the solution here is not to set a timeout.. instead, we should return as soon as we have a complete answer (quorum?)

When Ruler API HA is enabled and the rules are backed up, the backed up rules do not contain any state. In other words, it does not contain any alerts. To get a complete answer means, we have to get Rules from 100% of the Rulers.

As an example, 3 AZs, RF = 3, ruler count per AZ = 3 (total = 9). Assume 1 ruler in AZ 2 is unhealthy. What we want is results from 8 rulers. Similarly if we have 2 bad rulers, then we need results from 7. In this scenario, we we can tolerate 2 AZs being down (RF=3) but if we bail as soon as results from 1 AZ are retrieved, then the results will contain a lot of backup rules which does not contain alerts

I moved the timeout inside the concurrency.ForEach so that each Rules() fan-out call will get its own timeout rather than the timeout being set for the whole getShardedRules

Let me know if this is acceptable or if there are other ideas to handle this scenario

yeya24 · 2024-07-31T05:39:55Z

pkg/ruler/ruler.go

@@ -518,6 +563,68 @@ func instanceOwnsRuleGroup(r ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRu
 	return ownsRuleGroup, nil
 }

+func (r *Ruler) LivenessCheck(_ context.Context, request *LivenessCheckRequest) (*LivenessCheckResponse, error) {
+	level.Debug(r.logger).Log("msg", "liveness check request", "request received from", request.RulerAddress)


Seems RulerAddress is only used here for debug logging purpose. Can we remove that?

The whole log is a debug log line and the log is not useful if it does not have the ruler address. If someone wants to troubleshoot, having the ruler address that performed the liveness check would be useful I think.

humm we cannot log it in the client side? wouldn't this have the same effect?

yeya24 · 2024-07-31T05:41:02Z

pkg/ruler/ruler.go

+func (r *Ruler) LivenessCheck(_ context.Context, request *LivenessCheckRequest) (*LivenessCheckResponse, error) {
+	level.Debug(r.logger).Log("msg", "liveness check request", "request received from", request.RulerAddress)
+	if r.lifecycler.ServiceContext().Err() != nil || r.subservices.IsStopped() {
+		return nil, errors.New("ruler's context is canceled and might be stopping soon")


I am unsure about this tbh. When r.lifecycler.ServiceContext().Err() != nil || r.subservices.IsStopped() condition is met, is the gRPC server still working correctly to handle liveness check?

Depending upon how Ruler is configured, r.lifecycler.ServiceContext() can be canceled but gRPC server can be active for a long time. When r.lifecycler.ServiceContext() is canceled, Prometheus will stop evaluating rule groups as soon as possible but if the Ruler instance has not shutdown yet, it will respond to secondary as though it is active and this causes missed evaluations. This check is necessary.

r.subservices.IsStopped() is debatable. I only saw this in my testing in some cases during hard restarts (kill -9) when the gRPC server was already running but ring was not fully operational

alanprot · 2024-08-06T20:15:33Z

pkg/ruler/ruler.go

+	err := concurrency.ForEach(ctx, jobs, len(jobs), func(ctx context.Context, job interface{}) error {
+		addr := job.(string)
+		rulerClient, err := r.GetClientFor(addr)
+		if err != nil {
+			errorChan <- err
+			level.Debug(r.Logger()).Log("msg", "unable to get client for ruler", "ruler addr", addr, "token", rgToken)
+			return nil
+		}
+		level.Debug(r.Logger()).Log("msg", "performing liveness check against", "addr", addr, "for", g.Name, "token", rgToken, "instance addr", selfAddress)
+
+		resp, err := rulerClient.LivenessCheck(ctx, &LivenessCheckRequest{
+			RulerAddress: selfAddress,
+		})
+		if err != nil {
+			errorChan <- err
+			level.Debug(r.Logger()).Log("msg", "liveness check failed", "addr", addr, "for", g.Name, "err", err.Error(), "token", rgToken)
+			return nil
+		}
+		level.Debug(r.Logger()).Log("msg", "liveness check succeeded ", "addr", addr, "for", g.Name, "token", rgToken, "ruler state", services.State(resp.GetState()))
+		responseChan <- resp
+		return nil
+	})
+
+	close(errorChan)
+	close(responseChan)


if one ruler is taking time to reply the liveness, wouldn't it make this whole call to hang?

discussed offline... i just think the timeout could be hardcoded and not a config.

Set the timeout to 100ms and removed the config

alanprot · 2024-08-08T19:54:11Z

pkg/ruler/ruler.go

 	cfg.RingCheckPeriod = 5 * time.Second
+	cfg.LivenessCheckTimeout = 100 * time.Millisecond


can this just be a const? no need to be in the config struct i guess.

alanprot · 2024-08-08T19:55:00Z

pkg/ruler/ruler.go

@@ -220,7 +225,12 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
 	f.BoolVar(&cfg.EnableQueryStats, "ruler.query-stats-enabled", false, "Report the wall time for ruler queries to complete as a per user metric and as an info level log message.")
 	f.BoolVar(&cfg.DisableRuleGroupLabel, "ruler.disable-rule-group-label", false, "Disable the rule_group label on exported metrics")

+	f.BoolVar(&cfg.EnableHAEvaluation, "ruler.enable-ha-evaluation", false, "Enable high availability")
+	f.DurationVar(&cfg.ListRulesFanoutTimeout, "ruler.list-rules-fanout-timeout", 2*time.Minute, "Timeout for fanout calls to other rulers")
+	f.DurationVar(&cfg.LivenessCheckTimeout, "ruler.liveness-check-timeout", 1*time.Second, "Timeout for liveness checks performed during rule sync")


and this should be removed as well?

rapphil · 2024-08-09T18:03:09Z

integration/ruler_test.go

+		expectedNames[i] = ruleName
+
+		if num%2 == 0 {
+			alertCount++


is this variable used anywhere?

rapphil · 2024-08-09T18:07:30Z

integration/ruler_test.go

+	c, err := e2ecortex.NewClient("", "", "", ruler1.HTTPEndpoint(), "user-1")
+	require.NoError(t, err)
+	namespaceNames := []string{"test1", "test2", "test3", "test4", "test5"}
+	namespaceNameCount := make([]int, 5)


Total nit: in case you want to add more users in the tests.

Suggested change

namespaceNameCount := make([]int, 5)

namespaceNameCount := make([]int, len(namespaceNames))

rapphil · 2024-08-09T18:55:00Z

integration/ruler_test.go

+
+	results, err := c.GetPrometheusRules(e2ecortex.RuleFilter{})
+	require.NoError(t, err)
+	require.Equal(t, numRulesGroups, len(results))


I'm a bit confused on how this test is validating HA. Should we validate that the total number of validations across all rules is what is expected?

I updated the test to assert that the rules are evaluated by the remaining rulers

rapphil · 2024-08-09T20:34:02Z

pkg/ruler/ruler.go

+		level.Debug(r.Logger()).Log("msg", "performing liveness check against", "addr", addr, "for", g.Name, "instance addr", selfAddress)
+
+		resp, err := rulerClient.LivenessCheck(ctx, &LivenessCheckRequest{
+			RulerAddress: selfAddress,


any reason why we are passing the address of the current ruler as parameter here? I checked and this is not being used.

It's not being used now but I was using it for troubleshooting/logging on the server side. It could still be useful for troubleshooting

If it is supposed to be used for troubleshooting we can add a debug log statement in the handler no? I think leaving in this state makes me think of YAGNI

I also thing we can remove for now and add when needed.

rapphil · 2024-08-09T21:54:33Z

pkg/ruler/ruler.go

+		} else {
+			// Even if the replication factor is set to a number bigger than 1, only the first ruler evaluates the rule group
+			ownsRuleGroup = rlrs.Instances[0].Addr == instanceAddr
+		}
 	}

 	if ownsRuleGroup && ruleGroupDisabled(g, disabledRuleGroups) {


nit: we could create a function to generate the return in case of ownsRuleGroup = true.

this way we can retur directly from the code with early returns instead of using this nesting which is confusing and hard to reason about.

func (r *Ruler) instanceOwnsRuleGroup(rr ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, instanceAddr string, forBackup bool) (bool, error) { hash := tokenForGroup(g) rlrs, err := rr.Get(hash, RingOp, nil, nil, nil) if err != nil { return false, errors.Wrap(err, "error reading ring to verify rule group ownership") } if forBackup { // Only the second up to the last replica are used as backup for i := 1; i < len(rlrs.Instances); i++ { if rlrs.Instances[i].Addr == instanceAddr { return ownsRuleGroupOrDisable(g, disabledRuleGroups) } } } if rlrs.Instances[0].Addr == instanceAddr { // regardless of ruler HA is enabled or not, in this case this ruler is the primary return ownsRuleGroupOrDisable(g, disabledRuleGroups) } if r.Config().EnableHAEvaluation { for i, ruler := range rlrs.Instances { if i == 0 { // we already checked for primary continue } if ruler.Addr == instanceAddr && r.nonPrimaryInstanceOwnsRuleGroup(g, rlrs.GetAddresses()[:i], instanceAddr) { level.Info(r.Logger()).Log("msg", "non-primary ruler taking ownership", "user", g.User, "group", g.Name, "namespace", g.Namespace, "ruler", instanceAddr) return ownsRuleGroupOrDisable(g, disabledRuleGroups) } } } // means for sure that the instance is not the owner return false, nil } func ownsRuleGroupOrDisable(g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups) if ruleGroupDisabled(g, disabledRuleGroups) { return false, &DisabledRuleGroupErr{Message: fmt.Sprintf("rule group %s, namespace %s, user %s is disabled", g.Name, g.Namespace, g.User)} } return true, nil }

docs/configuration/config-file-reference.md

alanprot · 2024-08-29T00:26:42Z

pkg/ruler/ruler.go

+		rulerClient, err := r.GetClientFor(addr)
+		if err != nil {
+			errorChan <- err
+			level.Debug(r.Logger()).Log("msg", "unable to get client for ruler", "ruler addr", addr)


level here should be higher than debug?

alanprot · 2024-08-29T00:28:30Z

pkg/ruler/ruler.go

+	if ctx.Err() != nil {
+		level.Info(r.logger).Log("msg", "context is canceled. not syncing rules")
+		return
+	}


why we needed to introduce this?

When ruler is shutting down/pod terminating, syncs happen. This is because the manager waits for rule groups to stop finishing evaluation. I think it's better to not sync rule groups when ruler is shutting down

It just weird that u check it here as this context can be cancelled any place in the code after this line as well and what u trying to prevent can happen anyway?

alanprot · 2024-08-29T00:56:35Z

pkg/ruler/ruler.go

-// but only ring passed as parameter.
-func filterBackupRuleGroups(userID string, ruleGroups []*rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, ring ring.ReadRing, instanceAddr string, log log.Logger, ringCheckErrors prometheus.Counter) []*rulespb.RuleGroupDesc {
+// This method must not use r.ring, but only ring passed as parameter
+func (r *Ruler) filterBackupRuleGroups(userID string, ruleGroups []*rulespb.RuleGroupDesc, owned []*rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, ring ring.ReadRing, instanceAddr string, log log.Logger, ringCheckErrors prometheus.Counter) []*rulespb.RuleGroupDesc {


Seems lots of parameters can be removed from this function now that is a method of the Ruler struct?

alanprot · 2024-08-30T22:46:06Z

pkg/ruler/ruler.go

@@ -1189,6 +1300,8 @@ func (r *Ruler) getShardedRules(ctx context.Context, userID string, rulesRequest
 			return errors.Wrapf(err, "unable to get client for ruler %s", addr)
 		}

+		ctx, cancel := context.WithTimeout(ctx, r.cfg.ListRulesFanoutTimeout)


I think this change is not related to this PR?

Can we remove this from here and create another one where we can discuss possible solutions for this??

This is related as we discussed offline. Changed it to remote_timeout under ruler_client section

alanprot · 2024-08-30T22:55:24Z

Approved... other than changing the ruler timeout config to be under the ruler_client section, it LGTM

rajagopalanand · 2024-08-31T17:04:16Z

Approved... other than changing the ruler timeout config to be under the ruler_client section, it LGTM

Pushed up a new commit. Please take a look

CHANGELOG.md

integration/ruler_test.go

yeya24 · 2024-09-03T06:34:18Z

docs/configuration/config-file-reference.md

@@ -4142,6 +4142,10 @@ ruler_client:
  # CLI flag: -ruler.client.tls-insecure-skip-verify
  [tls_insecure_skip_verify: <boolean> | default = false]

+  # Timeout for downstream rulers.


I think we need a doc talking about Ruler HA. How to enable it with various configs/flags.
We added 3 new flags in this PR but I am confused if I need to tune/change them somehow or they should work out of the box.

I agree. I will submit another PR for the doc

yeya24 · 2024-09-03T06:41:06Z

Note that #5862 needs to be merged before this PR. Without enough approvals on the proposal itself, I don't think we can merge the actual implementation.

Signed-off-by: Anand Rajagopal <[email protected]>

…r address from LivenessCheckRequest gRPC call Signed-off-by: Anand Rajagopal <[email protected]>

…ing generic remote_timeout Signed-off-by: Anand Rajagopal <[email protected]>

CHANGELOG.md

Signed-off-by: Anand Rajagopal <[email protected]>

yeya24

Thanks

yeya24

Umm, I think we forgot to mark this as experimental. We can do it in next PR

pull-request-size bot added the size/XL label Jul 30, 2024

rajagopalanand force-pushed the ruler-ha-sync branch 5 times, most recently from 278226a to 6c2871b Compare July 30, 2024 20:30

rajagopalanand marked this pull request as ready for review July 30, 2024 20:34

rajagopalanand force-pushed the ruler-ha-sync branch 3 times, most recently from 927b0f8 to f4a71a8 Compare July 31, 2024 00:30

yeya24 reviewed Jul 31, 2024

View reviewed changes

rajagopalanand force-pushed the ruler-ha-sync branch 2 times, most recently from d7b5962 to 06d61ca Compare August 5, 2024 17:31

alanprot reviewed Aug 6, 2024

View reviewed changes

rajagopalanand force-pushed the ruler-ha-sync branch 2 times, most recently from 3f8f286 to 9305f6b Compare August 7, 2024 18:23

alanprot reviewed Aug 8, 2024

View reviewed changes

rajagopalanand force-pushed the ruler-ha-sync branch from 9305f6b to f6e0671 Compare August 9, 2024 00:19

rapphil reviewed Aug 9, 2024

View reviewed changes

yeya24 mentioned this pull request Aug 9, 2024

Ruler HA - Proposal #5862

Merged

3 tasks

rajagopalanand force-pushed the ruler-ha-sync branch 3 times, most recently from db7034d to 20b6783 Compare August 12, 2024 17:17

rajagopalanand marked this pull request as draft August 12, 2024 19:32

rajagopalanand force-pushed the ruler-ha-sync branch from 20b6783 to 4c7f4c5 Compare August 15, 2024 15:00

rajagopalanand requested review from alanprot, yeya24 and rapphil August 20, 2024 19:31

alanprot reviewed Aug 28, 2024

View reviewed changes

docs/configuration/config-file-reference.md Show resolved Hide resolved

alanprot reviewed Aug 29, 2024

View reviewed changes

rajagopalanand requested a review from alanprot August 30, 2024 01:02

alanprot reviewed Aug 30, 2024

View reviewed changes

alanprot approved these changes Aug 30, 2024

View reviewed changes

rajagopalanand force-pushed the ruler-ha-sync branch from 09624eb to 39598b0 Compare August 31, 2024 16:19

yeya24 reviewed Sep 3, 2024

View reviewed changes

rajagopalanand added 5 commits September 3, 2024 14:53

Minimize missed rule group evaluations

808cc9a

Signed-off-by: Anand Rajagopal <[email protected]>

Adding timeout for each Rules() call

b0be54e

Signed-off-by: Anand Rajagopal <[email protected]>

Modified ruler HA integ test to verify all the rule groups are evaluated

4e576fa

Signed-off-by: Anand Rajagopal <[email protected]>

Refactoring filterRuleGroups, filterBackupRuleGroups and removed rule…

ba3c705

…r address from LivenessCheckRequest gRPC call Signed-off-by: Anand Rajagopal <[email protected]>

Removing a specific config for list rules fanout timeout and introduc…

89092bb

…ing generic remote_timeout Signed-off-by: Anand Rajagopal <[email protected]>

rajagopalanand force-pushed the ruler-ha-sync branch from 39598b0 to 9c3c4ad Compare September 3, 2024 15:19

yeya24 reviewed Sep 4, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Updated change log description and test name

0419ddb

Signed-off-by: Anand Rajagopal <[email protected]>

rajagopalanand force-pushed the ruler-ha-sync branch from 9c3c4ad to 0419ddb Compare September 4, 2024 02:31

rajagopalanand requested a review from yeya24 September 4, 2024 02:37

yeya24 approved these changes Sep 4, 2024

View reviewed changes

rapphil approved these changes Sep 4, 2024

View reviewed changes

yeya24 merged commit 1c9c53b into cortexproject:master Sep 4, 2024
16 checks passed

yeya24 reviewed Sep 4, 2024

View reviewed changes

yeya24 mentioned this pull request Sep 4, 2024

Add documentation for Ruler HA #6194

Open

		cfg.RingCheckPeriod = 5 * time.Second
		cfg.LivenessCheckTimeout = 100 * time.Millisecond

	namespaceNameCount := make([]int, 5)
	namespaceNameCount := make([]int, len(namespaceNames))

Minimize missed rule group evaluations #6129

Minimize missed rule group evaluations #6129

Conversation

rajagopalanand commented Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanprot commented Aug 30, 2024

rajagopalanand commented Aug 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeya24 commented Sep 3, 2024

yeya24 left a comment

Choose a reason for hiding this comment

yeya24 left a comment

Choose a reason for hiding this comment

rajagopalanand commented Jul 30, 2024 •

edited

Loading