Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf regression #3790

Closed
rgs1 opened this issue Jul 3, 2018 · 17 comments
Closed

perf regression #3790

rgs1 opened this issue Jul 3, 2018 · 17 comments
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently

Comments

@rgs1
Copy link
Member

rgs1 commented Jul 3, 2018

We are heavy users of endpoints with metadata and after we started running a build that includes 09c5d35 we are seeing EdfLoadBalancerBase::refresh() consume > 10% of CPU on a totally idle envoy instance. From perf top:

  13.59%  envoy                                    [.] Envoy::Upstream::EdfLoadBalancerBase::refresh(unsigned int)::{lambda(Envoy::Upstream::ZoneAwareLoadBalancerBase::HostsSourc
   8.63%  envoy                                    [.] Envoy::Config::Metadata::metadataValue
   4.11%  envoy                                    [.] std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > >::emplace_back
   3.83%  [kernel]                                 [k] do_syscall_64

On a production instance, we've seen this going > 80% when endpoints were churning (our biggest cluster has > 2000 endpoints, each with 2 or 3 metadata strings).

If we revert 09c5d35, the calls to EdfLoadBalancerBase::refresh() aren't noticeable anymore.

@mattklein123 have you seen anything similar?

cc: @derekargueta

@mattklein123 mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label Jul 3, 2018
@mattklein123
Copy link
Member

I really doubt that my LB change had anything to do with metadata, but I haven't looked at it in a while (I would take a look at the change and see if building the EDF scheduler somehow involves metadata).

I think the main difference is on every host change we are rebuilding the EDF scheduler now when we did not have one at all before for the LR LB. There are definitely ways that we can make this code faster such as doing incremental changes, but the best solution for right now might be to add an LB option that disables weighting completely such that the EDF scheduler is never made. I think this would basically bring it back to the behavior before my change (though of course you can't using weighting at that point).

@rgs1
Copy link
Member Author

rgs1 commented Jul 4, 2018

Hmm, i did side-by-side runs:

Before 09c5d35:

   7.45%  envoy                                    [.] Envoy::Config::Metadata::metadataValue
   3.10%  perf-15907.map                           [.] 0x00007f5c117f0c04
   3.06%  envoy                                    [.] std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > >::emplace_back
   2.28%  envoy                                    [.] google::protobuf::internal::MapFieldBase::SyncMapWithRepeatedField
   2.27%  [kernel]                                 [k] do_syscall_64
   2.17%  envoy                                    [.] operator new[]
   2.00%  [kernel]                                 [k] audit_filter_syscall
   2.00%  [kernel]                                 [k] __lock_text_start
   1.97%  envoy                                    [.] std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>

After:

  8.69%  envoy                                    [.] Envoy::Config::Metadata::metadataValue                                                                                    
   6.49%  envoy                                    [.] Envoy::Upstream::EdfLoadBalancerBase::refresh(unsigned int)::{lambda(Envoy::Upstream::ZoneAwareLoadBalancerBase::HostsSourc
   2.73%  perf-129552.map                          [.] 0x00007fdb9978357c
   2.53%  envoy                                    [.] google::protobuf::internal::MapFieldBase::SyncMapWithRepeatedField                                                        
   2.31%  [kernel]                                 [k] do_syscall_64
   2.30%  envoy                                    [.] std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > >::emplace_back
   2.09%  [kernel]                                 [k] established_get_first.isra.32
   1.84%  envoy                                    [.] operator new[]
   1.73%  envoy                                    [.] Envoy::Upstream::SubsetLoadBalancer::hostMatches                       

so yeah, the calls to metadataValue() didn't change that much. But the calls to EdfLoadBalancerBase::refresh() are new and adding up... But there might be something else
at play too. I'll keep looking.

@rgs1
Copy link
Member Author

rgs1 commented Jul 4, 2018

After reading #2874 we decided to relax our — possibly quite aggressive — healtchecks and things are back to normal. We'll explore if permanently relaxing the health checks is reasonable, otherwise we might put some work into #2874.

@mattklein123: if you are curious, we were running healthchecks on > 2k endpoints every 10 seconds, which was triggering the subsets recomputation, which ended up using way too many cycles. Thanks for the quick reply!

@brian-pane
Copy link
Contributor

@mattklein123 it's probably worth noting that we're not using weighting

@mattklein123
Copy link
Member

As I said I think there are quite a few things we can do here to improve performance. If someone is interested in working on it we can discuss. The other option is to add an option to disable weighting support entirely which will remove the EDF rebuilds. cc @brian-pane

@rgs1
Copy link
Member Author

rgs1 commented Jul 5, 2018

@mattklein123 thanks! I am interested in working on this, because it's blocking one of our main use-cases — lmk what's the best channel to discuss the potential approaches.

@mattklein123
Copy link
Member

@rgs1 I can put some notes in here sometime today. TBH though, if you aren't using weighting, I would just add an option to disable weighting in the LR LB, which will just completely disable the EDF builds and take perf back to what it was before...

@rgs1
Copy link
Member Author

rgs1 commented Jul 5, 2018

@mattklein123 cool, let me take a look 09c5d35 and explore bringing upstream.weight_enabled back. Thanks!

@rgs1
Copy link
Member Author

rgs1 commented Jul 6, 2018

@mattklein123 hmm, it looks like envoy.cluster.$cluster.lb_subsets_created is growing unbounded and after a day or two of uptime it goes over the thousands... which is weird since we should only have have 6 subsets at the most. The growth in lb_subsets_created is aligned with deploys, so maybe there's something with our EDS service (though I don't think so).

From a quick look at the code:

https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/subset_lb.cc#L202

it looks like this could be triggering the high cpu usage that we are seeing. The more I look at this, the less I think it's related to 09c5d35.

@rgs1
Copy link
Member Author

rgs1 commented Jul 6, 2018

Hmm, it's possible that unused subsets aren't being removed too...

@rgs1
Copy link
Member Author

rgs1 commented Jul 6, 2018

Looks like subsets that shouldn't exist anymore aren't being removed:

$ curl -s localhost:9901/stats | grep subset | grep myapp | sort -nrk 2
cluster.myapp.lb_subsets_selected: 37219203
cluster.myapp.lb_subsets_fallback: 934700
cluster.myapp.lb_subsets_created: 396
cluster.myapp.lb_subsets_active: 396
cluster.myapp.lb_subsets_removed: 0

Why didn't the gone subsets — for metadata that isn't there anymore — get removed?

@rgs1
Copy link
Member Author

rgs1 commented Jul 6, 2018

Ok, added a test with what I am expecting to happen and it fails:

#3802

I'll open a new issue for this, I think we can close this one for now. Oor, we can leave it open to track the perf work is still relevant but different than subsets not being removed which is an actual bug apparently.

@rgs1
Copy link
Member Author

rgs1 commented Jul 6, 2018

Follow-up in #3803. Thanks!

@mattklein123
Copy link
Member

@rgs1 I think there are different issues here. For sure there is going to be a perf delta with the LR LB with the referenced change since we now create an EDF schedule on each refresh that you probably never use. So depending on your use case, it might still be worth it to do a PR to add a config option to turn off weighting support.

@rgs1
Copy link
Member Author

rgs1 commented Jul 6, 2018

@mattklein123 yeah agreed, I just meant that if we are really leaking LB subsets, then that's much worse than the perf tax we get with the EDF scheduler being recreated (and will actually amplify that effect, by a lot).

@stale
Copy link

stale bot commented Aug 5, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 5, 2018
@stale
Copy link

stale bot commented Aug 12, 2018

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

3 participants