Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use modifier for topk to improve dashboard performance #2590

Merged
merged 2 commits into from
Jan 22, 2024

Conversation

rahulguptajss
Copy link
Contributor

@rahulguptajss rahulguptajss commented Jan 19, 2024

Below is the summary of changes

  1. Updated topk to use a modifier, leaving some complex queries unchanged.
  2. Fixed variable mapping in Prometheus.
  3. Corrected topk text in several places.
  4. Decided not to change the custom "All" value to .* in this PR after considering the potential impact on cases where a dropdown filter is dependent on a previous filter query. There may be instances where the dropdown is empty, but "All" is still passed, which could lead to incorrect results, especially in the case of flexgroup. We shall handle it in seperate PR.

Below are the pending topk queries which are using hidden vars:

dashboard=cmode/cdot.json path=panels[1].targets[0] use old topk. expr=topk($TopResources, volume_read_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$TopVolumeAvgThroughput"})+topk($TopResources, volume_write_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$TopVolumeAvgThroughput"})
dashboard=cmode/cluster.json path=panels[11].targets[0] use old topk. expr=topk($TopResources, sum(node_volume_avg_latency{datacenter=~"$Datacenter",cluster=~"$Cluster", node=~"$TopVolumeAvgLatency"}) by (node))
dashboard=cmode/cluster.json path=panels[12].targets[0] use old topk. expr=topk($TopResources, sum(node_volume_read_data{datacenter=~"$Datacenter",cluster=~"$Cluster", node=~"$TopVolumeTotalData"}) by (node) + sum(node_volume_write_data{datacenter=~"$Datacenter",cluster=~"$Cluster", node=~"$TopVolumeTotalData"}) by(node))
dashboard=cmode/cluster.json path=panels[13].targets[0] use old topk. expr=topk($TopResources, sum by (node)(node_volume_total_ops{datacenter=~"$Datacenter",cluster=~"$Cluster", node=~"$TopVolumeTotalOps"}))
dashboard=cmode/external_service_op.json path=panels[2].targets[0] use old topk. expr=topk($TopResources, avg by (operation, service_name, svm, cluster) (external_service_op_request_latency{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopRequestLatency"}))
dashboard=cmode/external_service_op.json path=panels[4].targets[0] use old topk. expr=topk($TopResources, sum by (operation, service_name, svm, cluster) (external_service_op_num_not_found_responses{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopNotFoundResponse"}))
dashboard=cmode/external_service_op.json path=panels[5].targets[0] use old topk. expr=topk($TopResources, sum by (operation, service_name, svm, cluster) (external_service_op_num_request_failures{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopRequestFailed"}))
dashboard=cmode/external_service_op.json path=panels[6].targets[0] use old topk. expr=topk($TopResources, sum by (operation, service_name, svm, cluster) (external_service_op_num_requests_sent{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopRequestSent"}))
dashboard=cmode/external_service_op.json path=panels[7].targets[0] use old topk. expr=topk($TopResources, sum by (operation, service_name, svm, cluster) (external_service_op_num_responses_received{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopRequestReceived"}))
dashboard=cmode/external_service_op.json path=panels[8].targets[0] use old topk. expr=topk($TopResources, sum by (operation, service_name, svm, cluster) (external_service_op_num_successful_responses{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopSuccessResponse"}))
dashboard=cmode/external_service_op.json path=panels[9].targets[0] use old topk. expr=topk($TopResources, sum by (operation, service_name, svm, cluster) (external_service_op_num_timeouts{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopTimeout"}))
dashboard=cmode/flexgroup.json path=panels[3].targets[0] use old topk. expr=topk($TopResources, volume_read_data{datacenter=~"$Datacenter",cluster=~"$Cluster",aggr=~"$Aggregate",volume=~"$TopVolumeAvgThroughput"})+topk($TopResources, volume_write_data{datacenter=~"$Datacenter",cluster=~"$Cluster",aggr=~"$Aggregate",volume=~"$TopVolumeAvgThroughput"})
dashboard=cmode/node.json path=panels[2].targets[0] use old topk. expr=topk($TopResources, nic_rx_bytes+nic_tx_bytes{datacenter=~"$Datacenter",cluster=~"$Cluster",node=~"$Node",nic=~"$TopNicXPut"})
dashboard=cmode/node.json path=panels[3].targets[0] use old topk. expr=topk($TopResources, fcp_read_data+fcp_write_data+fcp_nvmf_read_data+fcp_nvmf_write_data{datacenter=~"$Datacenter",cluster=~"$Cluster",node=~"$Node",port=~"$TopFCUtilXPut"})
dashboard=cmode/volume.json path=panels[5].targets[0] use old topk. expr=topk($TopResources, volume_read_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$TopVolumeAvgThroughput"})+topk($TopResources, volume_write_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$TopVolumeAvgThroughput"})
dashboard=storagegrid/overview.json path=panels[0].targets[0] use old topk. expr=topk($TopResources, avg by(cluster,tenant,datacenter)(storagegrid_tenant_usage_data_bytes{datacenter=~"$Datacenter",cluster=~"$Cluster",tenant=~"$TopTenantUsageBytes"}))
dashboard=storagegrid/overview.json path=panels[1].targets[0] use old topk. expr=topk($TopResources,storagegrid_node_cpu_utilization_percentage{datacenter=~"$Datacenter",cluster=~"$Cluster",node=~"$TopNodeByCPU"})

t.Errorf(`%s should not have unit=%s expected=%s %s path=%s title="%s"`,
metric, unit, defaultLatencyUnit, location[0].dashboard, location[0].path, location[0].title)
if unit != expectedGrafanaUnit && !v.skipValidate {
if !(strings.EqualFold(metric, "qos_detail_resource_latency") || strings.EqualFold(metric, "qos_detail_service_time_latency")) && unit == "percent" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can reuse the map with something like this:

// Check if this metric is in the allowedSuffix map and has a matching unit
if slices.Contains(allowedSuffix[metric], unit) {
  continue
}

@@ -783,6 +780,10 @@ func checkTopKRange(t *testing.T, path string, data []byte) {
}
}
}

if strings.Contains(expr.expr, "$__range") && strings.Contains(expr.expr, "@ end()") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the query does not require whitespace, probably safer to go with this check to reduce false positives

noWhitespace := strings.ReplaceAll(expr.expr, " ", "")
if strings.Contains(noWhitespace, "$__range@end()") {
hasRange = true
}

@cgrinds cgrinds merged commit fbb3c25 into main Jan 22, 2024
10 checks passed
@cgrinds cgrinds deleted the rg2-modifier-dashboard branch January 22, 2024 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve dashboard performance / consider simplified top n calculation
3 participants