Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Degraded mode correctness metrics #2049

Merged

Conversation

sawsa307
Copy link
Contributor

@sawsa307 sawsa307 commented Mar 30, 2023

Add metrics that reports the number of endpoints differ between current endpoint calculation and degraded mode calculation. It is not emitted when current endpoint calculation returns any errors because the returned map will be empty, while in degraded mode calculation, errors are handled and a non-empty map is returned.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 30, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @sawsa307. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sawsa307
Copy link
Contributor Author

/assign @swetharepakula

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 30, 2023
@sawsa307
Copy link
Contributor Author

/assign @bowei

@sawsa307
Copy link
Contributor Author

cc @gauravkghildiyal

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2023
@sawsa307 sawsa307 force-pushed the degraded-mode-correctness-metrics branch from b5e0b63 to 5a68287 Compare April 10, 2023 23:18
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 10, 2023
@sawsa307 sawsa307 force-pushed the degraded-mode-correctness-metrics branch from 5a68287 to 230e4f9 Compare April 17, 2023 18:29
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 17, 2023
@sawsa307 sawsa307 force-pushed the degraded-mode-correctness-metrics branch from 230e4f9 to 9ad5026 Compare April 18, 2023 03:20
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 18, 2023
@sawsa307 sawsa307 force-pushed the degraded-mode-correctness-metrics branch 3 times, most recently from 9ab5138 to 5c46e90 Compare April 25, 2023 16:39
@bowei
Copy link
Member

bowei commented Apr 25, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 25, 2023
Subsystem: negControllerSubsystem,
Name: degradedModeCorrectnessKey,
Help: "Number of endpoints differed between current endpoint calculation and degraded mode calculation",
// custom buckets - [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, +Inf]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this more fine grained. Can we in increase +Inf range to 20k endpoints, set number of buckets to 20.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

@@ -280,8 +280,9 @@ func (s *transactionSyncer) syncInternalImpl() error {
if len(notInDegraded) == 0 && len(onlyInDegraded) == 0 {
s.resetErrorState()
}
} else {
computeDegradedModeCorrectness(notInDegraded, onlyInDegraded, string(s.NegSyncerKey.NegType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we always publish instead of only publish when > 0?

As a histogram, each data point corresponds to ? What is the units/denominator of the metric?

Copy link
Contributor Author

@sawsa307 sawsa307 Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we always publish instead of only publish when > 0?

We are trying to collect metrics when the normal calculation doesn't run into error/yields non-empty result, so we can make a meaningful comparison. It is possible that both results are the same or they are different. We are trying to make sure we are not doing comparison against something that is "meaningless" empty(normal calculation runs into error)
We end up in this line with both=0. When we are not in error state, and normal calculation doesn't run into error and produce endpoints, and it is possible that the normal calculation result is the same as the degraded mode result.
We are missing one condition, when we are in error state and the normal calculation doesn't run into error, so I'll change the condition here.

As a histogram, each data point corresponds to ? What is the units/denominator of the metric?

Each data point corresponds to the difference between normal and degraded mode endpoint calculations WHEN the normal calculation doesn't run into error. The unit is #endpoint, and the denominator is total count of endpoints. We are not recording the denominator here because all we care is if the difference is 0.

@sawsa307 sawsa307 force-pushed the degraded-mode-correctness-metrics branch from 5c46e90 to 85fce4d Compare April 27, 2023 17:05
Add metrics that reports the number of endpoint differed between current
endpoint calculation and degraded mode calculation. It is not emitted
when current endpoint calculation returns any errors because the
returned map will be empty, while in degraded mode calculation, errors
are handled and a non-empty map is returned
@sawsa307 sawsa307 force-pushed the degraded-mode-correctness-metrics branch from 85fce4d to 040ba0c Compare April 27, 2023 17:08
@sawsa307 sawsa307 requested a review from bowei April 27, 2023 17:11
@bowei
Copy link
Member

bowei commented Apr 27, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, sawsa307

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2023
@k8s-ci-robot k8s-ci-robot merged commit 70a2f17 into kubernetes:master Apr 27, 2023
@sawsa307 sawsa307 deleted the degraded-mode-correctness-metrics branch September 2, 2023 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants