feat: use modifier for topk to improve dashboard performance #2590

rahulguptajss · 2024-01-19T12:44:51Z

Below is the summary of changes

Updated topk to use a modifier, leaving some complex queries unchanged.
Fixed variable mapping in Prometheus.
Corrected topk text in several places.
Decided not to change the custom "All" value to .* in this PR after considering the potential impact on cases where a dropdown filter is dependent on a previous filter query. There may be instances where the dropdown is empty, but "All" is still passed, which could lead to incorrect results, especially in the case of flexgroup. We shall handle it in seperate PR.

Below are the pending topk queries which are using hidden vars:

dashboard=cmode/cdot.json path=panels[1].targets[0] use old topk. expr=topk($TopResources, volume_read_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$TopVolumeAvgThroughput"})+topk($TopResources, volume_write_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$TopVolumeAvgThroughput"})
dashboard=cmode/cluster.json path=panels[11].targets[0] use old topk. expr=topk($TopResources, sum(node_volume_avg_latency{datacenter=~"$Datacenter",cluster=~"$Cluster", node=~"$TopVolumeAvgLatency"}) by (node))
dashboard=cmode/cluster.json path=panels[12].targets[0] use old topk. expr=topk($TopResources, sum(node_volume_read_data{datacenter=~"$Datacenter",cluster=~"$Cluster", node=~"$TopVolumeTotalData"}) by (node) + sum(node_volume_write_data{datacenter=~"$Datacenter",cluster=~"$Cluster", node=~"$TopVolumeTotalData"}) by(node))
dashboard=cmode/cluster.json path=panels[13].targets[0] use old topk. expr=topk($TopResources, sum by (node)(node_volume_total_ops{datacenter=~"$Datacenter",cluster=~"$Cluster", node=~"$TopVolumeTotalOps"}))
dashboard=cmode/external_service_op.json path=panels[2].targets[0] use old topk. expr=topk($TopResources, avg by (operation, service_name, svm, cluster) (external_service_op_request_latency{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopRequestLatency"}))
dashboard=cmode/external_service_op.json path=panels[4].targets[0] use old topk. expr=topk($TopResources, sum by (operation, service_name, svm, cluster) (external_service_op_num_not_found_responses{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopNotFoundResponse"}))
dashboard=cmode/external_service_op.json path=panels[5].targets[0] use old topk. expr=topk($TopResources, sum by (operation, service_name, svm, cluster) (external_service_op_num_request_failures{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopRequestFailed"}))
dashboard=cmode/external_service_op.json path=panels[6].targets[0] use old topk. expr=topk($TopResources, sum by (operation, service_name, svm, cluster) (external_service_op_num_requests_sent{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopRequestSent"}))
dashboard=cmode/external_service_op.json path=panels[7].targets[0] use old topk. expr=topk($TopResources, sum by (operation, service_name, svm, cluster) (external_service_op_num_responses_received{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopRequestReceived"}))
dashboard=cmode/external_service_op.json path=panels[8].targets[0] use old topk. expr=topk($TopResources, sum by (operation, service_name, svm, cluster) (external_service_op_num_successful_responses{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopSuccessResponse"}))
dashboard=cmode/external_service_op.json path=panels[9].targets[0] use old topk. expr=topk($TopResources, sum by (operation, service_name, svm, cluster) (external_service_op_num_timeouts{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",service_name=~"$ServiceName",operation=~"$Operation",key=~"$TopTimeout"}))
dashboard=cmode/flexgroup.json path=panels[3].targets[0] use old topk. expr=topk($TopResources, volume_read_data{datacenter=~"$Datacenter",cluster=~"$Cluster",aggr=~"$Aggregate",volume=~"$TopVolumeAvgThroughput"})+topk($TopResources, volume_write_data{datacenter=~"$Datacenter",cluster=~"$Cluster",aggr=~"$Aggregate",volume=~"$TopVolumeAvgThroughput"})
dashboard=cmode/node.json path=panels[2].targets[0] use old topk. expr=topk($TopResources, nic_rx_bytes+nic_tx_bytes{datacenter=~"$Datacenter",cluster=~"$Cluster",node=~"$Node",nic=~"$TopNicXPut"})
dashboard=cmode/node.json path=panels[3].targets[0] use old topk. expr=topk($TopResources, fcp_read_data+fcp_write_data+fcp_nvmf_read_data+fcp_nvmf_write_data{datacenter=~"$Datacenter",cluster=~"$Cluster",node=~"$Node",port=~"$TopFCUtilXPut"})
dashboard=cmode/volume.json path=panels[5].targets[0] use old topk. expr=topk($TopResources, volume_read_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$TopVolumeAvgThroughput"})+topk($TopResources, volume_write_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$TopVolumeAvgThroughput"})
dashboard=storagegrid/overview.json path=panels[0].targets[0] use old topk. expr=topk($TopResources, avg by(cluster,tenant,datacenter)(storagegrid_tenant_usage_data_bytes{datacenter=~"$Datacenter",cluster=~"$Cluster",tenant=~"$TopTenantUsageBytes"}))
dashboard=storagegrid/overview.json path=panels[1].targets[0] use old topk. expr=topk($TopResources,storagegrid_node_cpu_utilization_percentage{datacenter=~"$Datacenter",cluster=~"$Cluster",node=~"$TopNodeByCPU"})

cgrinds · 2024-01-19T16:32:25Z

cmd/tools/grafana/dashboard_test.go

-						t.Errorf(`%s should not have unit=%s expected=%s %s path=%s title="%s"`,
-							metric, unit, defaultLatencyUnit, location[0].dashboard, location[0].path, location[0].title)
+					if unit != expectedGrafanaUnit && !v.skipValidate {
+						if !(strings.EqualFold(metric, "qos_detail_resource_latency") || strings.EqualFold(metric, "qos_detail_service_time_latency")) && unit == "percent" {


I think we can reuse the map with something like this:

// Check if this metric is in the allowedSuffix map and has a matching unit if slices.Contains(allowedSuffix[metric], unit) { continue }

cgrinds · 2024-01-19T16:40:32Z

cmd/tools/grafana/dashboard_test.go

@@ -783,6 +780,10 @@ func checkTopKRange(t *testing.T, path string, data []byte) {
 				}
 			}
 		}
+
+		if strings.Contains(expr.expr, "$__range") && strings.Contains(expr.expr, "@ end()") {


Since the query does not require whitespace, probably safer to go with this check to reduce false positives

noWhitespace := strings.ReplaceAll(expr.expr, " ", "")
if strings.Contains(noWhitespace, "$__range@end()") {
hasRange = true
}

cmd/tools/grafana/dashboard_test.go

feat: use modifier for topk to improve dashboard performance

1753ec0

rahulguptajss requested review from cgrinds and Hardikl as code owners January 19, 2024 12:44

cla-bot bot added the cla-signed label Jan 19, 2024

cgrinds reviewed Jan 19, 2024

View reviewed changes

feat: address review comments

a39190d

Hardikl approved these changes Jan 22, 2024

View reviewed changes

cmd/tools/grafana/dashboard_test.go Show resolved Hide resolved

cgrinds merged commit fbb3c25 into main Jan 22, 2024
10 checks passed

cgrinds deleted the rg2-modifier-dashboard branch January 22, 2024 14:08

rahulguptajss linked an issue Jan 22, 2024 that may be closed by this pull request

Improve dashboard performance / consider simplified top n calculation #2549

Closed

rahulguptajss mentioned this pull request Jan 23, 2024

Harvest dashboards should use a custom all value of .* where possible #2599

Closed

1 task

rahulguptajss mentioned this pull request May 7, 2024

hidden topk var removal from Grafana dashboards #2879

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use modifier for topk to improve dashboard performance #2590

feat: use modifier for topk to improve dashboard performance #2590

rahulguptajss commented Jan 19, 2024 •

edited

Loading

cgrinds Jan 19, 2024

cgrinds Jan 19, 2024

feat: use modifier for topk to improve dashboard performance #2590

feat: use modifier for topk to improve dashboard performance #2590

Conversation

rahulguptajss commented Jan 19, 2024 • edited Loading

cgrinds Jan 19, 2024

Choose a reason for hiding this comment

cgrinds Jan 19, 2024

Choose a reason for hiding this comment

rahulguptajss commented Jan 19, 2024 •

edited

Loading