-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: use concurrent map when storing metrics #2510
Conversation
In busy/large clusters, will prevent timeouts from long living locks/concurrency issues, as the writing to the map takes overly long, blocking the metrics-reading thread and as the lock doesn't get released in a timely manner, timing out the request. Inpired by previous PR at #1028
Welcome @rarruda! |
Thanks for your contribution! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mrueg, rarruda The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/triage accepted |
The changes looks good to me, thank you @rarruda for pushing that improvement :) I'll update our perfs-tests and see how much we improve with this. |
Please refrain from merging until we have the results. |
Note: for clusters with very little load, and/or sequential read/write access patterns (synthetic data getting generated, and only after then read), I would expect that the original implementation having one mutex for the Map would perform better. But that's not real world performance. In our (real world, production) case, as seen by the graphs, the locking contention gets to be too great as too many things are happening in parallel. More fine grained locking has an overhead associated with it, but the benefit only is apparent under heavy multithreaded read/write loads. The marginal cost of extra locking/atomic operations in As work arounds, we also tried sharding, clustering, dividing up metrics to go do different instances, but those were really just palliative measures for the underlying lock contention. It increased complexity in our setup significantly, and felt like ugly hacks. |
@dgrisonnet do you have any results from the perf tests you can share here already? |
I need to fix the tests, kms doesn't seem to be deploying properly in their CI cluster: kubernetes/perf-tests#2920, but I don't have much time on my hands to look at it right now. |
Anything I can do to help? |
@MasayaAoyama arruda I think the current blocker is this test failure kubernetes/perf-tests#2920 |
I don't have enough times on my hand right now to fix the perf-tests, so I'll unblock the PR based on the data you've already shared that shows the improvements provided by this change. /unhold |
What this PR does / why we need it: In busy/large clusters, will prevent timeouts from long living locks/concurrency issues, as the writing to the map takes overly long, blocking the metrics-reading thread and as the lock doesn't get released in a timely manner, timing out the request.
In these graphs you see:
You can see that with this patch we reduce latency significantly under all scenarios in our cluster.
Inspired by previous PR at #1028
How does this change affect the cardinality of KSM: does not change cardinality
Which issue(s) this PR fixes:
Fixes #995