Add hive_cluster_deployment_provision_underway_install_restarts metric #1275

abhinavdahiya · 2021-02-03T18:16:55Z

provision_underway_collector.go: allow setting min duration for provision time, add metric for install restarts

hive_cluster_deployment_provision_underway_seconds metric currently reports the elapsed time for all clusterdeployments
that provisioning. To allow for reduce the metrics reported and still ensure the metric can be used to alert installs taking
too long, the collector now allows setting minDuration. The minDuration ensures the collector only cares about provisioning times
greater than minDuration.

Another metric that's important for tracking failing installs is the number of restarts for installs. This allows catching failures
that fail really quickly, like permissions, compared to underway_seconds where the installs might have to restart for a very large value
to be alerted.

so a new metric hive_cluster_deployment_provision_underway_install_restarts is added that tracks the number of restarts for a cluster that
is still provisioning.

controller/metrics/metrics.go: report metrics for only at least 1 hour elapsed time and 1 install restarts

track clusters in hive_cluster_deployment_provision_underway_seconds which have been provisioning for atleast
1 hour.
track clusters in hive_cluster_deployment_provision_underway_install_restarts which have atleast 1 restarts.

This makes sure the amount of metrics collected are smaller but still provide most value in alerting failed installs.

xref: https://issues.redhat.com/browse/HIVE-1341

/assign @dgoodwin
/cc @suhanime

dgoodwin · 2021-02-04T12:22:49Z

pkg/controller/metrics/metrics.go

@@ -151,7 +151,8 @@ func Add(mgr manager.Manager) error {
 		Client:   mgr.GetClient(),
 		Interval: 2 * time.Minute,
 	}
-	metrics.Registry.MustRegister(newProvisioningUnderwayCollector(mgr.GetClient()))
+	metrics.Registry.MustRegister(newProvisioningUnderwayCollectorWithMinDuration(mgr.GetClient(), 1*time.Hour))
+	metrics.Registry.MustRegister(newProvisioningUnderwayInstallRestartsCollectorWithMinRestarts(mgr.GetClient(), 2))


My vote would be to start reporting after 1 restart, and this would more closely match when the min restarts start showing up. They're limiting at 3 in OSD so 1 feels a little better for the min.

sure, fixed to 1

dgoodwin · 2021-02-04T12:24:45Z

pkg/controller/metrics/provision_underway_collector.go

+		nil,
+	)
+)
+
 func newProvisioningUnderwayCollector(client client.Client) prometheus.Collector {


Do we need to keep this old function? Seems like we could just transition fully to the new.

updated to use the previous function to setup the collect instead of a new one, instead of dropping in favor of new one

dgoodwin · 2021-02-04T12:25:22Z

pkg/controller/metrics/provision_underway_collector.go

@@ -30,6 +30,10 @@ var (
 type provisioningUnderwayCollector struct {
 	client client.Client

+	// minDuration is the minimum duration afer which clusters provisioning
+	// will start reporting the metric. Only used when non-zero.


"Only used when non-zero", could you clarify that if left zero the metric is always reported? When first read I thought it meant we would never report it if it was zero.

dgoodwin · 2021-02-04T12:37:55Z

pkg/controller/metrics/provision_underway_collector.go

+	client client.Client
+
+	// minRestarts is the minimum restarts after which clusters provisioning
+	// will start reporting the metric. Only used when non-zero.


Same here, if zero will always report.

dgoodwin · 2021-02-04T12:39:59Z

pkg/controller/metrics/provision_underway_collector.go

+				}
+				break
+			}
+		}


Probably worth pulling this out to a little function to share between the collectors. Might be subtle fixes in there in future we'd miss in both.

updated to keep the condition extraction in a separate function.

dgoodwin · 2021-02-04T12:44:53Z

pkg/controller/metrics/provision_underway_collector_test.go

+			assert.Equal(t, test.expected, got)
+		})
+	}
+}


Bonus! I've read in some places not to unit test your metrics but I think in this case, with a custom collector, it makes sense.

I've read in some places not to unit test your metrics

it's code that can and will have assumptions, so unit tests make sense to make sure the assumptions don't change underneath you. So i don't get why we would not unit test collectors... unit tests also just help me understand how things work, these are better than documentation in my case :P

Yeah maybe their stance is more related to simpler instrumentation just reporting up a value. IIRC the reasoning was to not make it overly difficult to add metrics, so people would be more likely to do it.

Unit tests are always a plus in my book

bmeng · 2021-02-05T01:22:01Z

pkg/controller/metrics/provision_underway_collector.go

+	)
+)
+
+func newProvisioningUnderwayCollector(client client.Client, minimum time.Duration) prometheus.Collector {


Should be better to rename this to newProvisioningUnderwaySecondsCollector?

👍

Otherwise LGTM unless @suhanime has any feedback.

+1 to this suggestion

suhanime

The changes look good - just some minor nits

suhanime · 2021-02-05T16:03:59Z

pkg/controller/metrics/provision_underway_collector.go

@@ -30,6 +30,11 @@ var (
 type provisioningUnderwayCollector struct {
 	client client.Client

+	// minDuration, when non-zero, is the minimum duration afer which clusters provisioning
+	// will start becomming part of thismetric. When set to zero, all clusters provisioning


nit - but we tend to ignore doc errors later on
will start becoming part of this metric*

suhanime · 2021-02-05T16:11:05Z

pkg/controller/metrics/provision_underway_collector.go

+	)
+)
+
+func newProvisioningUnderwayCollector(client client.Client, minimum time.Duration) prometheus.Collector {


+1 to this suggestion

suhanime · 2021-02-05T16:15:19Z

pkg/controller/metrics/provision_underway_collector.go

+	metricClusterDeploymentProvisionUnderwayInstallRestarts *prometheus.Desc
+}
+
+// collects the metrics for provisioningUnderwayCollector


This doc is now confusing. We can even scrape it since the func definition is enough to understand

suhanime · 2021-02-05T16:22:35Z

pkg/controller/metrics/provision_underway_collector_test.go

+			assert.Equal(t, test.expected, got)
+		})
+	}
+}


Unit tests are always a plus in my book

…sion time, add metric for install restarts hive_cluster_deployment_provision_underway_seconds metric currently reports the elapsed time for all clusterdeployments that provisioning. To allow for reduce the metrics reported and still ensure the metric can be used to alert installs taking too long, the collector now allows setting minDuration. The minDuration ensures the collector only cares about provisioning times greater than minDuration. Another metric that's important for tracking failing installs is the number of restarts for installs. This allows catching failures that fail really quickly, like permissions, compared to underway_seconds where the installs might have to restart for a very large value to be alerted. so a new metric hive_cluster_deployment_provision_underway_install_restarts is added that tracks the number of restarts for a cluster that is still provisioning.

…r elapsed time and 1 install restarts track clusters in hive_cluster_deployment_provision_underway_seconds which have been provisioning for atleast 1 hour. track clusters in hive_cluster_deployment_provision_underway_install_restarts which have atleast 1 restarts. This makes sure the amount of metrics collected are smaller but still provide most value in alerting failed installs.

abhinavdahiya · 2021-02-08T17:24:27Z

@dgoodwin @suhanime fixed the last nits , ready for another review..

suhanime · 2021-02-08T17:31:23Z

/lgtm

openshift-ci-robot · 2021-02-08T17:31:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, suhanime

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [suhanime]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-02-08T21:25:50Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot assigned dgoodwin Feb 3, 2021

openshift-ci-robot requested a review from suhanime February 3, 2021 18:16

dgoodwin requested changes Feb 4, 2021

View reviewed changes

dgoodwin mentioned this pull request Feb 4, 2021

add new prometheus metric to indicate that the cluster provision failed after all the retries #1252

Closed

abhinavdahiya force-pushed the add_install_restarts_metric branch from 67e977d to 77dd3a6 Compare February 4, 2021 18:44

abhinavdahiya requested a review from dgoodwin February 4, 2021 18:49

bmeng reviewed Feb 5, 2021

View reviewed changes

suhanime reviewed Feb 5, 2021

View reviewed changes

abhinavdahiya added 2 commits February 8, 2021 09:23

abhinavdahiya force-pushed the add_install_restarts_metric branch from 77dd3a6 to dd28eaf Compare February 8, 2021 17:23

openshift-ci-robot assigned suhanime Feb 8, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2021

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2021

openshift-merge-robot merged commit c731660 into openshift:master Feb 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hive_cluster_deployment_provision_underway_install_restarts metric #1275

Add hive_cluster_deployment_provision_underway_install_restarts metric #1275

abhinavdahiya commented Feb 3, 2021 •

edited

Loading

dgoodwin Feb 4, 2021

abhinavdahiya Feb 4, 2021

dgoodwin Feb 4, 2021

abhinavdahiya Feb 4, 2021

dgoodwin Feb 4, 2021

abhinavdahiya Feb 4, 2021

dgoodwin Feb 4, 2021

abhinavdahiya Feb 4, 2021

dgoodwin Feb 4, 2021

abhinavdahiya Feb 4, 2021

dgoodwin Feb 4, 2021

abhinavdahiya Feb 4, 2021 •

edited

Loading

dgoodwin Feb 5, 2021

suhanime Feb 5, 2021

bmeng Feb 5, 2021

dgoodwin Feb 5, 2021

suhanime Feb 5, 2021

suhanime left a comment

suhanime Feb 5, 2021

suhanime Feb 5, 2021

suhanime Feb 5, 2021

suhanime Feb 5, 2021

abhinavdahiya commented Feb 8, 2021

suhanime commented Feb 8, 2021

openshift-ci-robot commented Feb 8, 2021

openshift-bot commented Feb 8, 2021

Add hive_cluster_deployment_provision_underway_install_restarts metric #1275

Add hive_cluster_deployment_provision_underway_install_restarts metric #1275

Conversation

abhinavdahiya commented Feb 3, 2021 • edited Loading

provision_underway_collector.go: allow setting min duration for provision time, add metric for install restarts

controller/metrics/metrics.go: report metrics for only at least 1 hour elapsed time and 1 install restarts

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavdahiya Feb 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suhanime left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavdahiya commented Feb 8, 2021

suhanime commented Feb 8, 2021

openshift-ci-robot commented Feb 8, 2021

openshift-bot commented Feb 8, 2021

abhinavdahiya commented Feb 3, 2021 •

edited

Loading

abhinavdahiya Feb 4, 2021 •

edited

Loading