-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: correct metrics path for MetricsEndpointProvider #236
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
DnPlas
changed the title
fix: correct metrics path
fix: correct metrics path for MetricsEndpointProvider
Feb 4, 2024
DnPlas
force-pushed
the
KF-1647-fix-alerts-firing
branch
4 times, most recently
from
February 9, 2024 12:46
631aa3c
to
322c188
Compare
orfeas-k
reviewed
Feb 13, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job Daniela, left a comment or two regarding configurability of metrics. I 'm soon deploying and confirming the fix is there!
orfeas-k
approved these changes
Feb 13, 2024
The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564
The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564 skip: fix test
DnPlas
force-pushed
the
KF-1647-fix-alerts-firing
branch
from
February 13, 2024 14:44
a192502
to
d8041a5
Compare
DnPlas
added a commit
that referenced
this pull request
Feb 13, 2024
* fix: correctly configure one scrape job to avoid firig alerts The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
that referenced
this pull request
Feb 14, 2024
* fix: correctly configure one scrape job to avoid firig alerts The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
DnPlas
added a commit
that referenced
this pull request
Feb 14, 2024
* fix: correctly configure one scrape job to avoid firig alerts The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fix: correctly configure one scrape job to avoid firig alerts
The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.
This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.
Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.
Part of canonical/bundle-kubeflow#564
Testing
Please refer to the steps to reproduce in this comment, just deploying this app. After deploying this app and cos-lite, relations and dependencies, and waiting a couple minutes (10 min) none of the alerts should fire (for this app only).
TODO
test case test_prometheus_grafana_integration
track/ckf-1.8