Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correct metrics path for MetricsEndpointProvider #236

Merged
merged 2 commits into from
Feb 13, 2024

Conversation

DnPlas
Copy link
Contributor

@DnPlas DnPlas commented Feb 4, 2024

fix: correctly configure one scrape job to avoid firig alerts

The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564

Testing

Please refer to the steps to reproduce in this comment, just deploying this app. After deploying this app and cos-lite, relations and dependencies, and waiting a couple minutes (10 min) none of the alerts should fire (for this app only).

TODO

  • refactor integration test case test_prometheus_grafana_integration
  • create a backport PR for track/ckf-1.8

@DnPlas DnPlas requested a review from a team as a code owner February 4, 2024 03:13
@DnPlas DnPlas changed the title fix: correct metrics path fix: correct metrics path for MetricsEndpointProvider Feb 4, 2024
@DnPlas DnPlas force-pushed the KF-1647-fix-alerts-firing branch 4 times, most recently from 631aa3c to 322c188 Compare February 9, 2024 12:46
Copy link
Contributor

@orfeas-k orfeas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job Daniela, left a comment or two regarding configurability of metrics. I 'm soon deploying and confirming the fix is there!

config.yaml Show resolved Hide resolved
src/charm.py Show resolved Hide resolved
The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564
The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564

skip: fix test
@DnPlas DnPlas merged commit 1d1a6f5 into main Feb 13, 2024
7 checks passed
@DnPlas DnPlas deleted the KF-1647-fix-alerts-firing branch February 13, 2024 14:44
DnPlas added a commit that referenced this pull request Feb 13, 2024
* fix: correctly configure one scrape job to avoid firig alerts

The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit that referenced this pull request Feb 14, 2024
* fix: correctly configure one scrape job to avoid firig alerts

The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
DnPlas added a commit that referenced this pull request Feb 14, 2024
* fix: correctly configure one scrape job to avoid firig alerts

The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by #94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.

This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.

Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.

Part of canonical/bundle-kubeflow#564

* tests: add an assertion for checking unit is available

The test_prometheus_grafana_integration test case was doing queries to prometheus
and checking the request returned successfully and that the application name and model
was listed correctly. To make this test case more accurately, we can add an assertion that
also checks that the unit is available, this way we avoid issues like the one described in
canonical/bundle-kubeflow#564.

Part of canonical/bundle-kubeflow#564
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants