Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize missed rule group evaluations #6129

Merged
merged 6 commits into from
Sep 4, 2024

Conversation

rajagopalanand
Copy link
Contributor

@rajagopalanand rajagopalanand commented Jul 30, 2024

What this PR does:

Currently once a Ruler instance loads a rule group, it evaluates it continuously. If the instance evaluating the rule group becomes unavailable, there is a high chance for missed evaluations before the instance becomes available again or another instance loads and evaluates the rule group. Ruler instances can become unavailable for variety of reasons including bad underlying nodes and OOM kills. This issue can be exacerbated if the Ruler instance appears as healthy within the cluster ring but is actually in an unhealthy state.

This PR addresses the problem by introducing a check to ensure that the primary Ruler is alive and in a running state. Here’s how it works:

  1. Liveness Check: Non-primary Rulers will perform a liveness check on the primary Ruler for each rule group when syncing rule groups from external storage.
  2. Fallback Mechanism: If the primary Ruler is unresponsive or not in a running state, the non-primary Ruler will assume ownership of the rule group and take over its evaluation.
  3. Relinquish Ownership: If the primary Ruler is alive and running, and if non-primary ruler has ownership of the rule group, then it relinquishes ownership of that rule group by not taking ownership and unloading it from Prometheus rule manager

With this change, the maximum duration of missed evaluations will be limited to the sync interval of the rule groups, reducing the impact of primary Ruler unavailability.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@rajagopalanand rajagopalanand force-pushed the ruler-ha-sync branch 5 times, most recently from 278226a to 6c2871b Compare July 30, 2024 20:30
@rajagopalanand rajagopalanand marked this pull request as ready for review July 30, 2024 20:34
@rajagopalanand rajagopalanand force-pushed the ruler-ha-sync branch 3 times, most recently from 927b0f8 to f4a71a8 Compare July 31, 2024 00:30

# Enable high availability
# CLI flag: -ruler.enable-ha
[enable_ha: <boolean> | default = false]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_ha feels like a bad name to me. Can we just replace this with checking replication factor?
If we can't, can we be specific about what HA we want to enable? For example, enable_ha_evaluation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API HA checks only for replication factor. My thought was to use a separate flag in case if someone wants to use one or the other and not both. I can change this to enable_ha_evaluation


# Timeout for liveness checks performed during rule sync
# CLI flag: -ruler.liveness-check-timeout
[liveness_check_timeout: <duration> | default = 1s]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1s a good default here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simple/fast check. I thought 1s might be big enough for default. If RF=2, this should check only one Ruler. If RF=3, it will check 2 Rulers. Do you think it needs to be higher/lower?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this config at all? We cannot just put a sensible timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the config and set the default to 100ms since the gRPC calls are pretty light weight

@@ -5,6 +5,8 @@ import (
"net"
"testing"

"github.com/cortexproject/cortex/pkg/util/services"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please group import

@@ -273,6 +281,13 @@ type MultiTenantManager interface {
// | +-----------------+ |
// | |
// +---------------------------------------------------------------+

type ruler interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this nencessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #6129 (comment)

@@ -488,11 +517,11 @@ func tokenForGroup(g *rulespb.RuleGroupDesc) uint32 {
return ringHasher.Sum32()
}

func instanceOwnsRuleGroup(r ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, instanceAddr string, forBackup bool) (bool, error) {
func instanceOwnsRuleGroup(r ruler, rr ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, instanceAddr string, forBackup bool) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just use *Ruler as the method receiver here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasons for introducing an interface is to narrow down what is available for functions to consume. Existing functions such as filterRuleGroups is not a method to avoid accidentally using Ruler's ring directly. If I pass in Ruler struct/object as-is, then it will expose the ring to those functions which could lead to accidental use of it. This is the reason I introduced a interface so that it could narrow down the scope of what is available to these functions. Of course, someone in the future could just add a method on the interface that exposes the ring. Having said all this, I'm open to suggestions here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing functions such as filterRuleGroups is not a method to avoid accidentally using Ruler's ring directly. If I pass in Ruler struct/object as-is, then it will expose the ring to those functions which could lead to accidental use of it

I am not sure I get the concern here. What's the worst case it could go wrong if accessing the Ruler's ring directly? instanceOwnsRuleGroup seems does Read operation to the Ring only so it shouldn't be a big issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed filterRuleGroups and other functions to methods on the Ruler and removed the interface

if r.Config().EnableHA {
for i, ruler := range rlrs.Instances {
if ruler.Addr == instanceAddr && i == 0 {
level.Debug(r.Logger()).Log("msg", "primary taking ownership", "user", g.User, "group", g.Name, "namespace", g.Namespace, "token", hash, "ruler", instanceAddr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it really helps to log token here. Does it help debug HA issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps issues with troubleshooting evaluation issues. Users can filter logs by token and know what happened with a particular rule group

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can filter logs by token and know what happened with a particular rule group

Are we going to document this clearly? Otherwise I don't think it is something easy to debug as a user. Tbh I don't how I map a particular rule group to a hash by just looking at the log.
Isn't group and namespace already enough to locate a particular rule group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

return &LivenessCheckResponse{State: int32(r.State())}, nil
}

func nonPrimaryInstanceOwnsRuleGroup(r ruler, g *rulespb.RuleGroupDesc, replicas []string, selfAddress string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have some comments on this function for better readability

@@ -1164,6 +1285,8 @@ func (r *Ruler) getShardedRules(ctx context.Context, userID string, rulesRequest
}
}
// Concurrently fetch rules from all rulers.
ctx, cancel := context.WithTimeout(ctx, r.cfg.ListRulesFanoutTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a behavior change for all Cortex users. Can we just use the API request's own timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the client sets the timeout, then this timeout/deadline will also be the same and it will fail the overall API request for the client. The desired behavior is not to fail the whole API request but instead be fault tolerant if a subset of Rulers do not respond in a certain amount of time. Ideally, r.cfg.ListRulesFanoutTimeout should be less than the API request's timeout. If we do not want to change existing behavior, we can either set the default timeout to be a much bigger number (max int) or wrap this around the enable_ha_evaluation flag or perhaps there is a better option. Open to recommendations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to bens comment... I think the solution here is not to set a timeout.. instead, we should return as soon as we have a complete answer (quorum?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, is this line change related to minimizing missed evaluations or is it unrelated? If unrelated and you are solving another problem I would rather have this in a different pr, just because this can introduce regressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Ruler HA, we can return as soon as we have complete answer but a new integration test I wrote was failing. The test fails because the IP of ruler that gets killed is not removed from the docker network fast enough during the test and this causes the gRPC call to that ruler to hang and the test times out. This scenario can happen in non-test environments as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to bens comment... I think the solution here is not to set a timeout.. instead, we should return as soon as we have a complete answer (quorum?)

When Ruler API HA is enabled and the rules are backed up, the backed up rules do not contain any state. In other words, it does not contain any alerts. To get a complete answer means, we have to get Rules from 100% of the Rulers.

As an example, 3 AZs, RF = 3, ruler count per AZ = 3 (total = 9). Assume 1 ruler in AZ 2 is unhealthy. What we want is results from 8 rulers. Similarly if we have 2 bad rulers, then we need results from 7. In this scenario, we we can tolerate 2 AZs being down (RF=3) but if we bail as soon as results from 1 AZ are retrieved, then the results will contain a lot of backup rules which does not contain alerts

I moved the timeout inside the concurrency.ForEach so that each Rules() fan-out call will get its own timeout rather than the timeout being set for the whole getShardedRules

Let me know if this is acceptable or if there are other ideas to handle this scenario

@@ -518,6 +563,68 @@ func instanceOwnsRuleGroup(r ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRu
return ownsRuleGroup, nil
}

func (r *Ruler) LivenessCheck(_ context.Context, request *LivenessCheckRequest) (*LivenessCheckResponse, error) {
level.Debug(r.logger).Log("msg", "liveness check request", "request received from", request.RulerAddress)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems RulerAddress is only used here for debug logging purpose. Can we remove that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole log is a debug log line and the log is not useful if it does not have the ruler address. If someone wants to troubleshoot, having the ruler address that performed the liveness check would be useful I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humm we cannot log it in the client side? wouldn't this have the same effect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

func (r *Ruler) LivenessCheck(_ context.Context, request *LivenessCheckRequest) (*LivenessCheckResponse, error) {
level.Debug(r.logger).Log("msg", "liveness check request", "request received from", request.RulerAddress)
if r.lifecycler.ServiceContext().Err() != nil || r.subservices.IsStopped() {
return nil, errors.New("ruler's context is canceled and might be stopping soon")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure about this tbh. When r.lifecycler.ServiceContext().Err() != nil || r.subservices.IsStopped() condition is met, is the gRPC server still working correctly to handle liveness check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending upon how Ruler is configured, r.lifecycler.ServiceContext() can be canceled but gRPC server can be active for a long time. When r.lifecycler.ServiceContext() is canceled, Prometheus will stop evaluating rule groups as soon as possible but if the Ruler instance has not shutdown yet, it will respond to secondary as though it is active and this causes missed evaluations. This check is necessary.

r.subservices.IsStopped() is debatable. I only saw this in my testing in some cases during hard restarts (kill -9) when the gRPC server was already running but ring was not fully operational

@rajagopalanand rajagopalanand force-pushed the ruler-ha-sync branch 2 times, most recently from d7b5962 to 06d61ca Compare August 5, 2024 17:31
Comment on lines 591 to 604
err := concurrency.ForEach(ctx, jobs, len(jobs), func(ctx context.Context, job interface{}) error {
addr := job.(string)
rulerClient, err := r.GetClientFor(addr)
if err != nil {
errorChan <- err
level.Debug(r.Logger()).Log("msg", "unable to get client for ruler", "ruler addr", addr, "token", rgToken)
return nil
}
level.Debug(r.Logger()).Log("msg", "performing liveness check against", "addr", addr, "for", g.Name, "token", rgToken, "instance addr", selfAddress)

resp, err := rulerClient.LivenessCheck(ctx, &LivenessCheckRequest{
RulerAddress: selfAddress,
})
if err != nil {
errorChan <- err
level.Debug(r.Logger()).Log("msg", "liveness check failed", "addr", addr, "for", g.Name, "err", err.Error(), "token", rgToken)
return nil
}
level.Debug(r.Logger()).Log("msg", "liveness check succeeded ", "addr", addr, "for", g.Name, "token", rgToken, "ruler state", services.State(resp.GetState()))
responseChan <- resp
return nil
})

close(errorChan)
close(responseChan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if one ruler is taking time to reply the liveness, wouldn't it make this whole call to hang?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline... i just think the timeout could be hardcoded and not a config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the timeout to 100ms and removed the config

@rajagopalanand rajagopalanand force-pushed the ruler-ha-sync branch 2 times, most recently from 3f8f286 to 9305f6b Compare August 7, 2024 18:23
cfg.RingCheckPeriod = 5 * time.Second
cfg.LivenessCheckTimeout = 100 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this just be a const? no need to be in the config struct i guess.

@@ -220,7 +225,12 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
f.BoolVar(&cfg.EnableQueryStats, "ruler.query-stats-enabled", false, "Report the wall time for ruler queries to complete as a per user metric and as an info level log message.")
f.BoolVar(&cfg.DisableRuleGroupLabel, "ruler.disable-rule-group-label", false, "Disable the rule_group label on exported metrics")

f.BoolVar(&cfg.EnableHAEvaluation, "ruler.enable-ha-evaluation", false, "Enable high availability")
f.DurationVar(&cfg.ListRulesFanoutTimeout, "ruler.list-rules-fanout-timeout", 2*time.Minute, "Timeout for fanout calls to other rulers")
f.DurationVar(&cfg.LivenessCheckTimeout, "ruler.liveness-check-timeout", 1*time.Second, "Timeout for liveness checks performed during rule sync")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this should be removed as well?

expectedNames[i] = ruleName

if num%2 == 0 {
alertCount++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this variable used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

c, err := e2ecortex.NewClient("", "", "", ruler1.HTTPEndpoint(), "user-1")
require.NoError(t, err)
namespaceNames := []string{"test1", "test2", "test3", "test4", "test5"}
namespaceNameCount := make([]int, 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total nit: in case you want to add more users in the tests.

Suggested change
namespaceNameCount := make([]int, 5)
namespaceNameCount := make([]int, len(namespaceNames))


results, err := c.GetPrometheusRules(e2ecortex.RuleFilter{})
require.NoError(t, err)
require.Equal(t, numRulesGroups, len(results))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused on how this test is validating HA. Should we validate that the total number of validations across all rules is what is expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the test to assert that the rules are evaluated by the remaining rulers

level.Debug(r.Logger()).Log("msg", "performing liveness check against", "addr", addr, "for", g.Name, "instance addr", selfAddress)

resp, err := rulerClient.LivenessCheck(ctx, &LivenessCheckRequest{
RulerAddress: selfAddress,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why we are passing the address of the current ruler as parameter here? I checked and this is not being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not being used now but I was using it for troubleshooting/logging on the server side. It could still be useful for troubleshooting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is supposed to be used for troubleshooting we can add a debug log statement in the handler no? I think leaving in this state makes me think of YAGNI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thing we can remove for now and add when needed.

} else {
// Even if the replication factor is set to a number bigger than 1, only the first ruler evaluates the rule group
ownsRuleGroup = rlrs.Instances[0].Addr == instanceAddr
}
}

if ownsRuleGroup && ruleGroupDisabled(g, disabledRuleGroups) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could create a function to generate the return in case of ownsRuleGroup = true.

this way we can retur directly from the code with early returns instead of using this nesting which is confusing and hard to reason about.

func (r *Ruler) instanceOwnsRuleGroup(rr ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, instanceAddr string, forBackup bool) (bool, error) {

	hash := tokenForGroup(g)

	rlrs, err := rr.Get(hash, RingOp, nil, nil, nil)
	if err != nil {
		return false, errors.Wrap(err, "error reading ring to verify rule group ownership")
	}

	if forBackup {
		// Only the second up to the last replica are used as backup
		for i := 1; i < len(rlrs.Instances); i++ {
			if rlrs.Instances[i].Addr == instanceAddr {
				return ownsRuleGroupOrDisable(g, disabledRuleGroups)
			}
		}
	} 
        if rlrs.Instances[0].Addr == instanceAddr {
              // regardless of ruler HA is enabled or not, in this case this ruler is the primary
               return ownsRuleGroupOrDisable(g, disabledRuleGroups)
        } 
	if r.Config().EnableHAEvaluation {
		for i, ruler := range rlrs.Instances {
			if  i == 0 {
                                 // we already checked for primary
                                  continue
			}
			if ruler.Addr == instanceAddr && r.nonPrimaryInstanceOwnsRuleGroup(g, rlrs.GetAddresses()[:i], instanceAddr) {
				level.Info(r.Logger()).Log("msg", "non-primary ruler taking ownership", "user", g.User, "group", g.Name, "namespace", g.Namespace, "ruler", instanceAddr)
				return ownsRuleGroupOrDisable(g, disabledRuleGroups)
			}
		}		
	} 
        // means for sure that the instance is not the owner
	return false, nil
}

func ownsRuleGroupOrDisable(g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups)
	if ruleGroupDisabled(g, disabledRuleGroups) {
		return false, &DisabledRuleGroupErr{Message: fmt.Sprintf("rule group %s, namespace %s, user %s is disabled", g.Name, g.Namespace, g.User)}
	}
        return true, nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@yeya24 yeya24 mentioned this pull request Aug 9, 2024
3 tasks
@rajagopalanand rajagopalanand force-pushed the ruler-ha-sync branch 3 times, most recently from db7034d to 20b6783 Compare August 12, 2024 17:17
@rajagopalanand rajagopalanand marked this pull request as draft August 12, 2024 19:32
rulerClient, err := r.GetClientFor(addr)
if err != nil {
errorChan <- err
level.Debug(r.Logger()).Log("msg", "unable to get client for ruler", "ruler addr", addr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

level here should be higher than debug?

Comment on lines +695 to +694
if ctx.Err() != nil {
level.Info(r.logger).Log("msg", "context is canceled. not syncing rules")
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we needed to introduce this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ruler is shutting down/pod terminating, syncs happen. This is because the manager waits for rule groups to stop finishing evaluation. I think it's better to not sync rule groups when ruler is shutting down

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just weird that u check it here as this context can be cancelled any place in the code after this line as well and what u trying to prevent can happen anyway?

// but only ring passed as parameter.
func filterBackupRuleGroups(userID string, ruleGroups []*rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, ring ring.ReadRing, instanceAddr string, log log.Logger, ringCheckErrors prometheus.Counter) []*rulespb.RuleGroupDesc {
// This method must not use r.ring, but only ring passed as parameter
func (r *Ruler) filterBackupRuleGroups(userID string, ruleGroups []*rulespb.RuleGroupDesc, owned []*rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, ring ring.ReadRing, instanceAddr string, log log.Logger, ringCheckErrors prometheus.Counter) []*rulespb.RuleGroupDesc {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems lots of parameters can be removed from this function now that is a method of the Ruler struct?

@@ -1189,6 +1300,8 @@ func (r *Ruler) getShardedRules(ctx context.Context, userID string, rulesRequest
return errors.Wrapf(err, "unable to get client for ruler %s", addr)
}

ctx, cancel := context.WithTimeout(ctx, r.cfg.ListRulesFanoutTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is not related to this PR?

Can we remove this from here and create another one where we can discuss possible solutions for this??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related as we discussed offline. Changed it to remote_timeout under ruler_client section

@alanprot
Copy link
Member

Approved... other than changing the ruler timeout config to be under the ruler_client section, it LGTM

@rajagopalanand
Copy link
Contributor Author

Approved... other than changing the ruler timeout config to be under the ruler_client section, it LGTM

Pushed up a new commit. Please take a look

CHANGELOG.md Outdated Show resolved Hide resolved
integration/ruler_test.go Outdated Show resolved Hide resolved
@@ -4142,6 +4142,10 @@ ruler_client:
# CLI flag: -ruler.client.tls-insecure-skip-verify
[tls_insecure_skip_verify: <boolean> | default = false]

# Timeout for downstream rulers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a doc talking about Ruler HA. How to enable it with various configs/flags.
We added 3 new flags in this PR but I am confused if I need to tune/change them somehow or they should work out of the box.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I will submit another PR for the doc

@yeya24
Copy link
Contributor

yeya24 commented Sep 3, 2024

Note that #5862 needs to be merged before this PR. Without enough approvals on the proposal itself, I don't think we can merge the actual implementation.

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@yeya24 yeya24 merged commit 1c9c53b into cortexproject:master Sep 4, 2024
16 checks passed
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, I think we forgot to mark this as experimental. We can do it in next PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants