Degraded mode correctness metrics #2049

sawsa307 · 2023-03-30T19:05:59Z

Add metrics that reports the number of endpoints differ between current endpoint calculation and degraded mode calculation. It is not emitted when current endpoint calculation returns any errors because the returned map will be empty, while in degraded mode calculation, errors are handled and a non-empty map is returned.

k8s-ci-robot · 2023-03-30T19:06:08Z

Hi @sawsa307. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sawsa307 · 2023-03-30T19:06:09Z

/assign @swetharepakula

sawsa307 · 2023-03-30T19:06:24Z

/assign @bowei

sawsa307 · 2023-03-30T19:08:12Z

cc @gauravkghildiyal

bowei · 2023-04-25T18:16:55Z

/ok-to-test

bowei · 2023-04-27T16:21:38Z

pkg/neg/metrics/metrics.go

+			Subsystem: negControllerSubsystem,
+			Name:      degradedModeCorrectnessKey,
+			Help:      "Number of endpoints differed between current endpoint calculation and degraded mode calculation",
+			// custom buckets - [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, +Inf]


Let's make this more fine grained. Can we in increase +Inf range to 20k endpoints, set number of buckets to 20.

Updated. Thanks!

bowei · 2023-04-27T16:22:42Z

pkg/neg/syncers/transaction.go

@@ -280,8 +280,9 @@ func (s *transactionSyncer) syncInternalImpl() error {
 			if len(notInDegraded) == 0 && len(onlyInDegraded) == 0 {
 				s.resetErrorState()
 			}
+		} else {
+			computeDegradedModeCorrectness(notInDegraded, onlyInDegraded, string(s.NegSyncerKey.NegType))


Should we always publish instead of only publish when > 0?

As a histogram, each data point corresponds to ? What is the units/denominator of the metric?

Should we always publish instead of only publish when > 0?

We are trying to collect metrics when the normal calculation doesn't run into error/yields non-empty result, so we can make a meaningful comparison. It is possible that both results are the same or they are different. We are trying to make sure we are not doing comparison against something that is "meaningless" empty(normal calculation runs into error)
We end up in this line with both=0. When we are not in error state, and normal calculation doesn't run into error and produce endpoints, and it is possible that the normal calculation result is the same as the degraded mode result.
We are missing one condition, when we are in error state and the normal calculation doesn't run into error, so I'll change the condition here.

As a histogram, each data point corresponds to ? What is the units/denominator of the metric?

Each data point corresponds to the difference between normal and degraded mode endpoint calculations WHEN the normal calculation doesn't run into error. The unit is #endpoint, and the denominator is total count of endpoints. We are not recording the denominator here because all we care is if the difference is 0.

Add metrics that reports the number of endpoint differed between current endpoint calculation and degraded mode calculation. It is not emitted when current endpoint calculation returns any errors because the returned map will be empty, while in degraded mode calculation, errors are handled and a non-empty map is returned

bowei · 2023-04-27T17:29:08Z

/lgtm
/approve

k8s-ci-robot · 2023-04-27T17:29:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, sawsa307

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bowei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from aojea and code-elinka March 30, 2023 19:06

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 30, 2023

k8s-ci-robot assigned swetharepakula Mar 30, 2023

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 30, 2023

k8s-ci-robot assigned bowei Mar 30, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2023

sawsa307 force-pushed the degraded-mode-correctness-metrics branch from b5e0b63 to 5a68287 Compare April 10, 2023 23:18

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 10, 2023

sawsa307 force-pushed the degraded-mode-correctness-metrics branch from 5a68287 to 230e4f9 Compare April 17, 2023 18:29

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 17, 2023

sawsa307 force-pushed the degraded-mode-correctness-metrics branch from 230e4f9 to 9ad5026 Compare April 18, 2023 03:20

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 18, 2023

sawsa307 force-pushed the degraded-mode-correctness-metrics branch 3 times, most recently from 9ab5138 to 5c46e90 Compare April 25, 2023 16:39

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 25, 2023

bowei reviewed Apr 27, 2023

View reviewed changes

sawsa307 force-pushed the degraded-mode-correctness-metrics branch from 5c46e90 to 85fce4d Compare April 27, 2023 17:05

sawsa307 force-pushed the degraded-mode-correctness-metrics branch from 85fce4d to 040ba0c Compare April 27, 2023 17:08

sawsa307 requested a review from bowei April 27, 2023 17:11

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2023

k8s-ci-robot merged commit 70a2f17 into kubernetes:master Apr 27, 2023

sawsa307 deleted the degraded-mode-correctness-metrics branch September 2, 2023 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Degraded mode correctness metrics #2049

Degraded mode correctness metrics #2049

sawsa307 commented Mar 30, 2023 •

edited

Loading

k8s-ci-robot commented Mar 30, 2023

sawsa307 commented Mar 30, 2023

sawsa307 commented Mar 30, 2023

sawsa307 commented Mar 30, 2023

bowei commented Apr 25, 2023

bowei Apr 27, 2023

sawsa307 Apr 27, 2023

bowei Apr 27, 2023

sawsa307 Apr 27, 2023 •

edited

Loading

bowei commented Apr 27, 2023

k8s-ci-robot commented Apr 27, 2023

Degraded mode correctness metrics #2049

Degraded mode correctness metrics #2049

Conversation

sawsa307 commented Mar 30, 2023 • edited Loading

k8s-ci-robot commented Mar 30, 2023

sawsa307 commented Mar 30, 2023

sawsa307 commented Mar 30, 2023

sawsa307 commented Mar 30, 2023

bowei commented Apr 25, 2023

bowei Apr 27, 2023

Choose a reason for hiding this comment

sawsa307 Apr 27, 2023

Choose a reason for hiding this comment

bowei Apr 27, 2023

Choose a reason for hiding this comment

sawsa307 Apr 27, 2023 • edited Loading

Choose a reason for hiding this comment

bowei commented Apr 27, 2023

k8s-ci-robot commented Apr 27, 2023

sawsa307 commented Mar 30, 2023 •

edited

Loading

sawsa307 Apr 27, 2023 •

edited

Loading