Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hive_cluster_deployment_provision_underway_install_restarts metric #1275

Conversation

abhinavdahiya
Copy link
Contributor

@abhinavdahiya abhinavdahiya commented Feb 3, 2021

provision_underway_collector.go: allow setting min duration for provision time, add metric for install restarts

hive_cluster_deployment_provision_underway_seconds metric currently reports the elapsed time for all clusterdeployments
that provisioning. To allow for reduce the metrics reported and still ensure the metric can be used to alert installs taking
too long, the collector now allows setting minDuration. The minDuration ensures the collector only cares about provisioning times
greater than minDuration.

Another metric that's important for tracking failing installs is the number of restarts for installs. This allows catching failures
that fail really quickly, like permissions, compared to underway_seconds where the installs might have to restart for a very large value
to be alerted.

so a new metric hive_cluster_deployment_provision_underway_install_restarts is added that tracks the number of restarts for a cluster that
is still provisioning.

controller/metrics/metrics.go: report metrics for only at least 1 hour elapsed time and 1 install restarts

track clusters in hive_cluster_deployment_provision_underway_seconds which have been provisioning for atleast
1 hour.
track clusters in hive_cluster_deployment_provision_underway_install_restarts which have atleast 1 restarts.

This makes sure the amount of metrics collected are smaller but still provide most value in alerting failed installs.

xref: https://issues.redhat.com/browse/HIVE-1341

/assign @dgoodwin
/cc @suhanime

@@ -151,7 +151,8 @@ func Add(mgr manager.Manager) error {
Client: mgr.GetClient(),
Interval: 2 * time.Minute,
}
metrics.Registry.MustRegister(newProvisioningUnderwayCollector(mgr.GetClient()))
metrics.Registry.MustRegister(newProvisioningUnderwayCollectorWithMinDuration(mgr.GetClient(), 1*time.Hour))
metrics.Registry.MustRegister(newProvisioningUnderwayInstallRestartsCollectorWithMinRestarts(mgr.GetClient(), 2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote would be to start reporting after 1 restart, and this would more closely match when the min restarts start showing up. They're limiting at 3 in OSD so 1 feels a little better for the min.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, fixed to 1

nil,
)
)

func newProvisioningUnderwayCollector(client client.Client) prometheus.Collector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep this old function? Seems like we could just transition fully to the new.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use the previous function to setup the collect instead of a new one, instead of dropping in favor of new one

@@ -30,6 +30,10 @@ var (
type provisioningUnderwayCollector struct {
client client.Client

// minDuration is the minimum duration afer which clusters provisioning
// will start reporting the metric. Only used when non-zero.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Only used when non-zero", could you clarify that if left zero the metric is always reported? When first read I thought it meant we would never report it if it was zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

client client.Client

// minRestarts is the minimum restarts after which clusters provisioning
// will start reporting the metric. Only used when non-zero.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, if zero will always report.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

}
break
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth pulling this out to a little function to share between the collectors. Might be subtle fixes in there in future we'd miss in both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to keep the condition extraction in a separate function.

assert.Equal(t, test.expected, got)
})
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus! I've read in some places not to unit test your metrics but I think in this case, with a custom collector, it makes sense.

Copy link
Contributor Author

@abhinavdahiya abhinavdahiya Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read in some places not to unit test your metrics

it's code that can and will have assumptions, so unit tests make sense to make sure the assumptions don't change underneath you. So i don't get why we would not unit test collectors... unit tests also just help me understand how things work, these are better than documentation in my case :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe their stance is more related to simpler instrumentation just reporting up a value. IIRC the reasoning was to not make it overly difficult to add metrics, so people would be more likely to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests are always a plus in my book

)
)

func newProvisioningUnderwayCollector(client client.Client, minimum time.Duration) prometheus.Collector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be better to rename this to newProvisioningUnderwaySecondsCollector?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Otherwise LGTM unless @suhanime has any feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this suggestion

Copy link
Contributor

@suhanime suhanime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good - just some minor nits

@@ -30,6 +30,11 @@ var (
type provisioningUnderwayCollector struct {
client client.Client

// minDuration, when non-zero, is the minimum duration afer which clusters provisioning
// will start becomming part of thismetric. When set to zero, all clusters provisioning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - but we tend to ignore doc errors later on
will start becoming part of this metric*

)
)

func newProvisioningUnderwayCollector(client client.Client, minimum time.Duration) prometheus.Collector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this suggestion

metricClusterDeploymentProvisionUnderwayInstallRestarts *prometheus.Desc
}

// collects the metrics for provisioningUnderwayCollector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is now confusing. We can even scrape it since the func definition is enough to understand

assert.Equal(t, test.expected, got)
})
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests are always a plus in my book

…sion time, add metric for install restarts

hive_cluster_deployment_provision_underway_seconds metric currently reports the elapsed time for all clusterdeployments
that provisioning. To allow for reduce the metrics reported and still ensure the metric can be used to alert installs taking
too long, the collector now allows setting minDuration. The minDuration ensures the collector only cares about provisioning times
greater than minDuration.

Another metric that's important for tracking failing installs is the number of restarts for installs. This allows catching failures
that fail really quickly, like permissions, compared to underway_seconds where the installs might have to restart for a very large value
to be alerted.

so a new metric hive_cluster_deployment_provision_underway_install_restarts is added that tracks the number of restarts for a cluster that
is still provisioning.
…r elapsed time and 1 install restarts

track clusters in hive_cluster_deployment_provision_underway_seconds which have been provisioning for atleast
1 hour.
track clusters in hive_cluster_deployment_provision_underway_install_restarts which have atleast 1 restarts.

This makes sure the amount of metrics collected are smaller but still provide most value in alerting failed installs.
@abhinavdahiya
Copy link
Contributor Author

@dgoodwin @suhanime fixed the last nits , ready for another review..

@suhanime
Copy link
Contributor

suhanime commented Feb 8, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, suhanime

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2021
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit c731660 into openshift:master Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants