-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf regression #3790
Comments
I really doubt that my LB change had anything to do with metadata, but I haven't looked at it in a while (I would take a look at the change and see if building the EDF scheduler somehow involves metadata). I think the main difference is on every host change we are rebuilding the EDF scheduler now when we did not have one at all before for the LR LB. There are definitely ways that we can make this code faster such as doing incremental changes, but the best solution for right now might be to add an LB option that disables weighting completely such that the EDF scheduler is never made. I think this would basically bring it back to the behavior before my change (though of course you can't using weighting at that point). |
Hmm, i did side-by-side runs: Before 09c5d35:
After:
so yeah, the calls to |
After reading #2874 we decided to relax our — possibly quite aggressive — healtchecks and things are back to normal. We'll explore if permanently relaxing the health checks is reasonable, otherwise we might put some work into #2874. @mattklein123: if you are curious, we were running healthchecks on > 2k endpoints every 10 seconds, which was triggering the subsets recomputation, which ended up using way too many cycles. Thanks for the quick reply! |
@mattklein123 it's probably worth noting that we're not using weighting |
As I said I think there are quite a few things we can do here to improve performance. If someone is interested in working on it we can discuss. The other option is to add an option to disable weighting support entirely which will remove the EDF rebuilds. cc @brian-pane |
@mattklein123 thanks! I am interested in working on this, because it's blocking one of our main use-cases — lmk what's the best channel to discuss the potential approaches. |
@rgs1 I can put some notes in here sometime today. TBH though, if you aren't using weighting, I would just add an option to disable weighting in the LR LB, which will just completely disable the EDF builds and take perf back to what it was before... |
@mattklein123 cool, let me take a look 09c5d35 and explore bringing |
@mattklein123 hmm, it looks like From a quick look at the code: https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/subset_lb.cc#L202 it looks like this could be triggering the high cpu usage that we are seeing. The more I look at this, the less I think it's related to 09c5d35. |
Hmm, it's possible that unused subsets aren't being removed too... |
Looks like subsets that shouldn't exist anymore aren't being removed:
Why didn't the gone subsets — for metadata that isn't there anymore — get removed? |
Ok, added a test with what I am expecting to happen and it fails: I'll open a new issue for this, I think we can close this one for now. Oor, we can leave it open to track the perf work is still relevant but different than subsets not being removed which is an actual bug apparently. |
Follow-up in #3803. Thanks! |
@rgs1 I think there are different issues here. For sure there is going to be a perf delta with the LR LB with the referenced change since we now create an EDF schedule on each refresh that you probably never use. So depending on your use case, it might still be worth it to do a PR to add a config option to turn off weighting support. |
@mattklein123 yeah agreed, I just meant that if we are really leaking LB subsets, then that's much worse than the perf tax we get with the EDF scheduler being recreated (and will actually amplify that effect, by a lot). |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions. |
We are heavy users of endpoints with metadata and after we started running a build that includes 09c5d35 we are seeing
EdfLoadBalancerBase::refresh()
consume > 10% of CPU on a totally idle envoy instance. From perf top:On a production instance, we've seen this going > 80% when endpoints were churning (our biggest cluster has > 2000 endpoints, each with 2 or 3 metadata strings).
If we revert 09c5d35, the calls to
EdfLoadBalancerBase::refresh()
aren't noticeable anymore.@mattklein123 have you seen anything similar?
cc: @derekargueta
The text was updated successfully, but these errors were encountered: