Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateless Rule: Rollout - Alert For State is lost across restart - Blinking alerts #5219

Closed
ahurtaud opened this issue Mar 7, 2022 · 8 comments

Comments

@ahurtaud
Copy link
Contributor

ahurtaud commented Mar 7, 2022

Thanos, Prometheus and Golang version used:
thanos v0.25

What happened:

I implemented the stateless mode for the rule component.
When my rulers (k8s deployments) rollouts, every alerts already firing before the rollouts with a long for clause (lets say 30minutes) will be in Pending for that amount of time and firing again after 30minutes.

And because of our alertmanager config, alerts are resolved after 5minutes...

What you expected to happen:

I would like the alerts to still have a state for the for clause, maybe with ALERT_FOR_STATE metrics sent to the receivers?

Is it a valid issue? Is there anything I misunderstood / Do others have same issue?

Thanks

@yeya24
Copy link
Contributor

yeya24 commented Mar 7, 2022

If I understand correctly, in Prometheus, the state persistence across restart is by checking ALERTS_FOR_STATE time series. As mentioned by https://ganeshvernekar.com/blog/google-summer-of-code-2018-with-prometheus#1-persist-for-state-of-alerts-1.

I need to double check if this mode supports this by using the upstream rule manager. And we need more E2E tests to verify this.

@yeya24
Copy link
Contributor

yeya24 commented Mar 8, 2022

Yeah, I can confirm this is a bug, or a feature instead I would say. Working on the fix and tests now. We need to implement a Queryable and Querier for HTTP-based Thanos queries. Not sure if I can fix it this week.

Another way to go is to just implement remote read for Thanos Querier? Then we can just connect it using remote storage client.

@ahurtaud
Copy link
Contributor Author

ahurtaud commented Mar 8, 2022

Yeah I know this will not be an "as easy as it looks" hence creating directly a github issue.
I think it is fair to agree that ALERT_FOR_STATE must be queryable through the same query endpoint.
(meaning stateless ruler MUST remote_write to thanos receive (or other) and register it in the query path of that ruler.)

From what I read from Ganesh, When ruler restarts, it queries ALERT_FOR_STATE for each alertname, and applies the same state if found:
something like that:
Screen Shot 2022-03-08 at 09 56 24
Screen Shot 2022-03-08 at 09 56 18

WDYT?

@yeya24
Copy link
Contributor

yeya24 commented Mar 8, 2022

Yeah, i think that's correct. Right now we just need to implement the Querier interface for Thanos Querier so that we can fetch that time series from it.

@yeya24
Copy link
Contributor

yeya24 commented Mar 13, 2022

Created prometheus/prometheus#10443 on upstream prometheus to address one of the issue. After that one is merged then I can open a new pr to fix this.

@stale
Copy link

stale bot commented Jun 12, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 12, 2022
@ahurtaud
Copy link
Contributor Author

referencing #5230 for implementation of the fix.

@yeya24
Copy link
Contributor

yeya24 commented Nov 13, 2022

I will close this one as the feature was merged into main already. Let us know how the feature works

@yeya24 yeya24 closed this as completed Nov 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants