fix: correct metrics path for MetricsEndpointProvider #236

DnPlas · 2024-02-04T03:13:11Z

fix: correctly configure one scrape job to avoid firig alerts

The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564

Testing

Please refer to the steps to reproduce in this comment, just deploying this app. After deploying this app and cos-lite, relations and dependencies, and waiting a couple minutes (10 min) none of the alerts should fire (for this app only).

TODO

refactor integration test case test_prometheus_grafana_integration
create a backport PR for track/ckf-1.8

orfeas-k

Good job Daniela, left a comment or two regarding configurability of metrics. I 'm soon deploying and confirming the fix is there!

config.yaml

src/charm.py

The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564

The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564 skip: fix test

* fix: correctly configure one scrape job to avoid firig alerts The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by #94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job. This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change. Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments. Part of canonical/bundle-kubeflow#564 * tests: add an assertion for checking unit is available The test_prometheus_grafana_integration test case was doing queries to prometheus and checking the request returned successfully and that the application name and model was listed correctly. To make this test case more accurately, we can add an assertion that also checks that the unit is available, this way we avoid issues like the one described in canonical/bundle-kubeflow#564. Part of canonical/bundle-kubeflow#564

DnPlas requested a review from a team as a code owner February 4, 2024 03:13

DnPlas mentioned this pull request Feb 4, 2024

UnitsAvailable alerts are firing constantly canonical/bundle-kubeflow#564

Closed

github-actions bot added the Libraries: Out of sync label Feb 4, 2024

DnPlas changed the title ~~fix: correct metrics path~~ fix: correct metrics path for MetricsEndpointProvider Feb 4, 2024

DnPlas mentioned this pull request Feb 4, 2024

Maximise GH runner space step failing in some repositories CIs canonical/bundle-kubeflow#813

Closed

DnPlas force-pushed the KF-1647-fix-alerts-firing branch 4 times, most recently from 631aa3c to 322c188 Compare February 9, 2024 12:46

orfeas-k reviewed Feb 13, 2024

View reviewed changes

config.yaml Show resolved Hide resolved

src/charm.py Show resolved Hide resolved

orfeas-k approved these changes Feb 13, 2024

View reviewed changes

DnPlas added 2 commits February 13, 2024 15:43

DnPlas force-pushed the KF-1647-fix-alerts-firing branch from a192502 to d8041a5 Compare February 13, 2024 14:44

DnPlas merged commit 1d1a6f5 into main Feb 13, 2024
7 checks passed

DnPlas deleted the KF-1647-fix-alerts-firing branch February 13, 2024 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct metrics path for MetricsEndpointProvider #236

fix: correct metrics path for MetricsEndpointProvider #236

DnPlas commented Feb 4, 2024 •

edited

Loading

orfeas-k left a comment

fix: correct metrics path for MetricsEndpointProvider #236

fix: correct metrics path for MetricsEndpointProvider #236

Conversation

DnPlas commented Feb 4, 2024 • edited Loading

Testing

TODO

orfeas-k left a comment

Choose a reason for hiding this comment

DnPlas commented Feb 4, 2024 •

edited

Loading