Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler: In stateless mode, ALERTS and ALERTS_FOR_STATE metrics are not being written #5431

Closed
matej-g opened this issue Jun 21, 2022 · 12 comments

Comments

@matej-g
Copy link
Collaborator

matej-g commented Jun 21, 2022

Thanos, Prometheus and Golang version used:
Thanos - latest main (93e7ced)

What happened:
As a part of the process of testing Thanos for Prometheus Alert Generator Compliance (PR: #5315) I realized that the test suite is failing when running ruler in a stateless mode. I traced this to a particular rule, which contains the ALERTS metrics in it's expression. This rule never seemed to be getting into pending / firing state (according to the rules web UI), although it was supposed to be (in contrast, with TSDB ruler, this is working as expected). This is despite the fact that when I queried for the same expression in querier, I could see correct results and that the ALERTS metrics are available for querying. I assume that the rule is never correctly evaluated.

What you expected to happen:
I expect all alerts to be evaluated correctly even in stateless mode.

How to reproduce it (as minimally and precisely as possible):
You can fire up the compatibility test locally (visit the PR - #5315 or if merged look for the test in codebase).

The setup consists of:

  • an alert compliance tester (you can think of it basically as AlertManager in this setup)
  • an ingesting receiver to which metrics are being written
  • a querier talking to the receiver and the stateless ruler
  • stateless ruler talking to querier and writing to receive

Full logs to relevant components:
Even with debug log level, I did not see any relevant output in ruler.

Anything else we need to know:
Few more details included in this comment #5315 (comment)

@matej-g
Copy link
Collaborator Author

matej-g commented Jun 21, 2022

I think an (potentially) important part I left out is that the mentioned rule using ALERTS metrics depends on alerts series from the rules above it in the same group.

@michael-burt
Copy link

ALERTS and ALERTS_FOR_STATE series are not generated by the stateless Ruler. I am curious how you are able to query the ALERTS series via the Querier if that series is not being produced by the stateless Ruler.

I think this PR may be related: #5230

@yeya24 do you know if PR #5230 will cause the stateless Ruler to produce the ALERTS and ALERTS_FOR_STATE series?

@yeya24
Copy link
Contributor

yeya24 commented Jun 22, 2022

ALERTS and ALERTS_FOR_STATE series are not generated by the stateless Ruler. I am curious how you are able to query the ALERTS series via the Querier if that series is not being produced by the stateless Ruler.

ALERTS and ALERTS_FOR_STATE are remote written to Thanos receivers by the stateless ruler. Since the querier points to receivers directly I think it is able to get the series, but some delays might apply due to the stateless ruler remote write queue setting.

@yeya24 do you know if PR #5230 will cause the stateless Ruler to produce the ALERTS and ALERTS_FOR_STATE series?

Not related I think but it depends on the alert compliance test framework maybe. #5230 will help the stateless ruler restore the ALERTS and ALERTS_FOR_STATE info from remote storage when it starts. In the compliance test, since the test begins at a fresh state so there is nothing to restore for the ruler.
I haven't checked the details of the compliance test though, maybe it does something I don't know.

@michael-burt
Copy link

Thank you for the clarification @yeya24

@matej-g
Copy link
Collaborator Author

matej-g commented Jun 23, 2022

I think even if alert compliance tester is taken out of equation, the fact is I cannot see those alerts being correctly evaluated in ruler, even though the metrics are there. So unless this is a user error and I'm doing something wrong during the test, I do believe there's something not working properly in stateless mode. That would also mean this should not be a queuing issue (that would make sense to me if the alert had correct state in ruler but were not getting to alert compliance tester - unless I'm missing the point and we're not talking about alert notifying queue but some other queue).

@matej-g
Copy link
Collaborator Author

matej-g commented Jun 24, 2022

My apologies, so I was mistaken, I cannot query for ALERT metrics even in querier, after running and validating few more tests. I found out the problem is with the fanout queryable. I finally noticed this in the ruler logs (for each rule group):

level=error ts=2022-06-24T17:33:13.030034627Z caller=manager.go:721 component=rules group=PendingAndFiringAndResolved msg="Failed to get Querier" err="unsupported operation with WAL-only storage"

Because we're specifying agent DB as the primary queryable in the fanout storage, calling Querier() on it always results in error because it's not supported for agent. This in turn causes the alerts never to be restored, so the ALERTS and ALERTS_FOR_STATE are actually never appended, because this condition is never true.

I assume then we could remove agent DB from the fanout queryable, but looks like @yeya24 has other plans in #5230 which also seem to fix this (but I'd need to review it more closely). Either way would be nice to add some test cases for ALERTS.

@yeya24
Copy link
Contributor

yeya24 commented Jun 24, 2022

Because we're specifying agent DB as the primary queryable in the fanout storage, calling Querier() on it always results in error because it's not supported for agent. This in turn causes the alerts never to be restored, so the ALERTS and ALERTS_FOR_STATE are actually never appended, because this condition is never true.

I still think that's a separate thing as this error happens only in the restore phase. It only happens at the first evaluation and has nothing to do with the compliance test unless the test requires alerts state restoration.
For the query part during alerts evaluation, we are not using agent DB at all. We are using a custom query function querying Thanos queriers.

For writing ALERTS and ALERTS_FOR_STATE metrics to the receivers, it should be fine as remote write will take care of it. If we found that these metrics are not written to receivers then it might be bugs related to agent mode or remote write, but still not related to the restoration.

@matej-g
Copy link
Collaborator Author

matej-g commented Jun 27, 2022

We can leave aside the compliance test and focus on why ALERTS and ALERTS_FOR_STATE metrics are not being written (which causes the test fail, since one of the alerting rules in the test depends on the ALERTS timeseries).

If I'm reading the Prom rules manager code correctly:
Even though this error happens in the restore phase, it prevents the alert metrics from being written. That custom query is used for rule evaluation, but in RestoreForState for the rule group, we use the fanout storage as queryable for querying. But because we hit the error when getting Queryable (because the agent DB has no queryable), we never reach the code which sets the rule to be restored. So then, during the evaluation, the rule has never restored == true, meaning the alert metrics are never written, because that condition can never be true.

Does that make sense?

@yeya24
Copy link
Contributor

yeya24 commented Jun 27, 2022

Thanks for the investigation, you are right. I didn't notice that unrestored rule will prevent appending new alerts and alerts state samples.
But I am also wondering why it is designed this way. The write path shouldn't be related to how alerts are restored.
Anyway, I'll try to finish that pr asap.

@matej-g matej-g changed the title Ruler: In stateless mode, alerting rules with ALERTS in the expression are not evaluated correctly Ruler: In stateless mode, ALERTS and ALERTS_FOR_STATE metrics are not being written Jun 27, 2022
@ahurtaud
Copy link
Contributor

ahurtaud commented Aug 24, 2022

dupe of #5219 ? or at least same topic / same fix needed :)

@stale
Copy link

stale bot commented Nov 13, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@yeya24
Copy link
Contributor

yeya24 commented Nov 13, 2022

I will close this one as the feature was merged into main already

@yeya24 yeya24 closed this as completed Nov 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants