Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: use concurrent map when storing metrics #2510

Merged
merged 2 commits into from
Nov 5, 2024

Conversation

rarruda
Copy link
Contributor

@rarruda rarruda commented Sep 26, 2024

What this PR does / why we need it: In busy/large clusters, will prevent timeouts from long living locks/concurrency issues, as the writing to the map takes overly long, blocking the metrics-reading thread and as the lock doesn't get released in a timely manner, timing out the request.

In these graphs you see:

  1. VMagent scrape duration (yellow: with this patch applied, green: no patch)
  2. VMagent samples scraped: (yellow: with this patch applied, green: no patch)
  3. Heatmap from selfMetrics with patch applied
  4. Heatmap from selfMetrics no patch
  5. Number of pods in LOG scale, max is close to 16k
  6. Number of nodes in cluster
    Screenshot from 2024-09-26 12-48-47

You can see that with this patch we reduce latency significantly under all scenarios in our cluster.

Inspired by previous PR at #1028

How does this change affect the cardinality of KSM: does not change cardinality

Which issue(s) this PR fixes:
Fixes #995

In busy/large clusters, will prevent timeouts from long living
locks/concurrency issues, as the writing to the map takes overly long,
blocking the metrics-reading thread and as the lock doesn't get released
in a timely manner, timing out the request.

Inpired by previous PR at #1028
Copy link

linux-foundation-easycla bot commented Sep 26, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 26, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @rarruda!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 26, 2024
@mrueg
Copy link
Member

mrueg commented Sep 30, 2024

Thanks for your contribution!
/lgtm
and
/hold
for others to review as well.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 30, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrueg, rarruda

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2024
@dgrisonnet
Copy link
Member

/triage accepted
/assign

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 3, 2024
@dgrisonnet
Copy link
Member

The changes looks good to me, thank you @rarruda for pushing that improvement :)

I'll update our perfs-tests and see how much we improve with this.

@dgrisonnet
Copy link
Member

Please refrain from merging until we have the results.

@rarruda
Copy link
Contributor Author

rarruda commented Oct 6, 2024

Note: for clusters with very little load, and/or sequential read/write access patterns (synthetic data getting generated, and only after then read), I would expect that the original implementation having one mutex for the Map would perform better.

But that's not real world performance.

In our (real world, production) case, as seen by the graphs, the locking contention gets to be too great as too many things are happening in parallel.

More fine grained locking has an overhead associated with it, but the benefit only is apparent under heavy multithreaded read/write loads. The marginal cost of extra locking/atomic operations in sync.Map might be measurable in small clusters. (I have not done much testing in small clusters).

As work arounds, we also tried sharding, clustering, dividing up metrics to go do different instances, but those were really just palliative measures for the underlying lock contention. It increased complexity in our setup significantly, and felt like ugly hacks.

@mrueg mrueg added this to the v2.14.0 milestone Oct 8, 2024
@mrueg
Copy link
Member

mrueg commented Oct 15, 2024

@dgrisonnet do you have any results from the perf tests you can share here already?

@dgrisonnet
Copy link
Member

I need to fix the tests, kms doesn't seem to be deploying properly in their CI cluster: kubernetes/perf-tests#2920, but I don't have much time on my hands to look at it right now.

@rarruda
Copy link
Contributor Author

rarruda commented Oct 29, 2024

Anything I can do to help?

@CatherineF-dev
Copy link
Contributor

CatherineF-dev commented Oct 30, 2024

@dgrisonnet
Copy link
Member

I don't have enough times on my hand right now to fix the perf-tests, so I'll unblock the PR based on the data you've already shared that shows the improvements provided by this change.

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 5, 2024
@k8s-ci-robot k8s-ci-robot merged commit dfb688c into kubernetes:main Nov 5, 2024
13 checks passed
@rarruda rarruda deleted the fix/reduce_locking branch November 7, 2024 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kube-state-metrics API scraping timeout
5 participants