perf: use concurrent map when storing metrics #2510

rarruda · 2024-09-26T10:55:55Z

What this PR does / why we need it: In busy/large clusters, will prevent timeouts from long living locks/concurrency issues, as the writing to the map takes overly long, blocking the metrics-reading thread and as the lock doesn't get released in a timely manner, timing out the request.

In these graphs you see:

VMagent scrape duration (yellow: with this patch applied, green: no patch)
VMagent samples scraped: (yellow: with this patch applied, green: no patch)
Heatmap from selfMetrics with patch applied
Heatmap from selfMetrics no patch
Number of pods in LOG scale, max is close to 16k
Number of nodes in cluster

You can see that with this patch we reduce latency significantly under all scenarios in our cluster.

Inspired by previous PR at #1028

How does this change affect the cardinality of KSM: does not change cardinality

Which issue(s) this PR fixes:
Fixes #995

In busy/large clusters, will prevent timeouts from long living locks/concurrency issues, as the writing to the map takes overly long, blocking the metrics-reading thread and as the lock doesn't get released in a timely manner, timing out the request. Inpired by previous PR at #1028

linux-foundation-easycla · 2024-09-26T10:56:00Z

The committers listed above are authorized under a signed CLA.

✅ login: rarruda / name: Renato Arruda (f2a7639, 3e2d1e9)

k8s-ci-robot · 2024-09-26T10:56:04Z

Welcome @rarruda!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

mrueg · 2024-09-30T17:50:09Z

Thanks for your contribution!
/lgtm
and
/hold
for others to review as well.

k8s-ci-robot · 2024-09-30T17:50:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrueg, rarruda

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mrueg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dgrisonnet · 2024-10-03T16:48:15Z

/triage accepted
/assign

dgrisonnet · 2024-10-03T17:50:30Z

The changes looks good to me, thank you @rarruda for pushing that improvement :)

I'll update our perfs-tests and see how much we improve with this.

dgrisonnet · 2024-10-03T17:50:48Z

Please refrain from merging until we have the results.

rarruda · 2024-10-06T11:57:35Z

Note: for clusters with very little load, and/or sequential read/write access patterns (synthetic data getting generated, and only after then read), I would expect that the original implementation having one mutex for the Map would perform better.

But that's not real world performance.

In our (real world, production) case, as seen by the graphs, the locking contention gets to be too great as too many things are happening in parallel.

More fine grained locking has an overhead associated with it, but the benefit only is apparent under heavy multithreaded read/write loads. The marginal cost of extra locking/atomic operations in sync.Map might be measurable in small clusters. (I have not done much testing in small clusters).

As work arounds, we also tried sharding, clustering, dividing up metrics to go do different instances, but those were really just palliative measures for the underlying lock contention. It increased complexity in our setup significantly, and felt like ugly hacks.

mrueg · 2024-10-15T08:20:29Z

@dgrisonnet do you have any results from the perf tests you can share here already?

dgrisonnet · 2024-10-15T10:49:52Z

I need to fix the tests, kms doesn't seem to be deploying properly in their CI cluster: kubernetes/perf-tests#2920, but I don't have much time on my hands to look at it right now.

rarruda · 2024-10-29T13:05:34Z

Anything I can do to help?

CatherineF-dev · 2024-10-30T12:06:41Z

@MasayaAoyama arruda I think the current blocker is this test failure kubernetes/perf-tests#2920

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/perf-tests/2920/pull-perf-tests-clusterloader2/1842133505027346432

dgrisonnet · 2024-11-05T19:47:06Z

I don't have enough times on my hand right now to fix the perf-tests, so I'll unblock the PR based on the data you've already shared that shows the improvements provided by this change.

/unhold

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 26, 2024

k8s-ci-robot requested review from logicalhan and mrueg September 26, 2024 10:56

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 26, 2024

lint

f2a7639

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2024

k8s-ci-robot assigned mrueg Sep 30, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 30, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2024

k8s-ci-robot assigned dgrisonnet Oct 3, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 3, 2024

mrueg added this to the v2.14.0 milestone Oct 8, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 5, 2024

k8s-ci-robot merged commit dfb688c into kubernetes:main Nov 5, 2024
13 checks passed

rarruda deleted the fix/reduce_locking branch November 7, 2024 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: use concurrent map when storing metrics #2510

perf: use concurrent map when storing metrics #2510

rarruda commented Sep 26, 2024 •

edited

Loading

linux-foundation-easycla bot commented Sep 26, 2024 •

edited

Loading

k8s-ci-robot commented Sep 26, 2024

mrueg commented Sep 30, 2024

k8s-ci-robot commented Sep 30, 2024

dgrisonnet commented Oct 3, 2024

dgrisonnet commented Oct 3, 2024

dgrisonnet commented Oct 3, 2024

rarruda commented Oct 6, 2024 •

edited

Loading

mrueg commented Oct 15, 2024

dgrisonnet commented Oct 15, 2024

rarruda commented Oct 29, 2024

CatherineF-dev commented Oct 30, 2024 •

edited

Loading

dgrisonnet commented Nov 5, 2024

perf: use concurrent map when storing metrics #2510

perf: use concurrent map when storing metrics #2510

Conversation

rarruda commented Sep 26, 2024 • edited Loading

linux-foundation-easycla bot commented Sep 26, 2024 • edited Loading

k8s-ci-robot commented Sep 26, 2024

mrueg commented Sep 30, 2024

k8s-ci-robot commented Sep 30, 2024

dgrisonnet commented Oct 3, 2024

dgrisonnet commented Oct 3, 2024

dgrisonnet commented Oct 3, 2024

rarruda commented Oct 6, 2024 • edited Loading

mrueg commented Oct 15, 2024

dgrisonnet commented Oct 15, 2024

rarruda commented Oct 29, 2024

CatherineF-dev commented Oct 30, 2024 • edited Loading

dgrisonnet commented Nov 5, 2024

rarruda commented Sep 26, 2024 •

edited

Loading

linux-foundation-easycla bot commented Sep 26, 2024 •

edited

Loading

rarruda commented Oct 6, 2024 •

edited

Loading

CatherineF-dev commented Oct 30, 2024 •

edited

Loading