Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_seldon_alert_rules test case is failing, potential race condition #244

Closed
DnPlas opened this issue Feb 23, 2024 · 2 comments · Fixed by #243
Closed

test_seldon_alert_rules test case is failing, potential race condition #244

DnPlas opened this issue Feb 23, 2024 · 2 comments · Fixed by #243
Labels
bug Something isn't working

Comments

@DnPlas
Copy link
Contributor

DnPlas commented Feb 23, 2024

Bug Description

The test case is failing with the following message:

   File "/home/runner/work/seldon-core-operator/seldon-core-operator/tests/integration/test_charm.py", line 204, in test_seldon_alert_rules
    assert up_query_response["data"]["result"][0]["value"][1] == "1"
IndexError: list index out of range

which means that the up_query_response is either empty or missing data/values.

This issue started happening after 1d1a6f5 introduced a new assertion to ensure the up metric is not firing any alerts.

This issue is affecting main and track/1.17

To Reproduce

I was only able to reproduce it in the CI

Environment

on_push CI

Relevant Log Output

Latest CI run

@DnPlas DnPlas added the bug Something isn't working label Feb 23, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5374.

This message was autogenerated

@DnPlas
Copy link
Contributor Author

DnPlas commented Feb 23, 2024

There is a potential race condition between the time it takes for the test case to run the assertion and when prometheus charm has scraped metrics from seldon-controller-manager. In other charms, we have placed retry logic to allow some time to prometheus to scrape metrics and have them available in the prometheus endpoint. #243 is attempting to fix this issue by adding a retry.

DnPlas added a commit that referenced this issue Feb 23, 2024
* tests: add a retry when asserting the up metric

Adding a retry for checking the state of an alert will allow time to prometheus-k8s to scrape
the necessary metrics for a unit, without it we may run into a race condition where the assertion
of the metric is run before prometheus is even able to scrape.
This commit adds a retry logic to avoid this.

Fixes #244
DnPlas added a commit that referenced this issue Feb 23, 2024
* tests: add a retry when asserting the up metric

Adding a retry for checking the state of an alert will allow time to prometheus-k8s to scrape
the necessary metrics for a unit, without it we may run into a race condition where the assertion
of the metric is run before prometheus is even able to scrape.
This commit adds a retry logic to avoid this.

Fixes #244
DnPlas added a commit that referenced this issue Feb 23, 2024
* tests: add a retry when asserting the up metric

Adding a retry for checking the state of an alert will allow time to prometheus-k8s to scrape
the necessary metrics for a unit, without it we may run into a race condition where the assertion
of the metric is run before prometheus is even able to scrape.
This commit adds a retry logic to avoid this.

Fixes #244
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant