perf regression #3790

rgs1 · 2018-07-03T23:16:29Z

We are heavy users of endpoints with metadata and after we started running a build that includes 09c5d35 we are seeing EdfLoadBalancerBase::refresh() consume > 10% of CPU on a totally idle envoy instance. From perf top:

  13.59%  envoy                                    [.] Envoy::Upstream::EdfLoadBalancerBase::refresh(unsigned int)::{lambda(Envoy::Upstream::ZoneAwareLoadBalancerBase::HostsSourc
   8.63%  envoy                                    [.] Envoy::Config::Metadata::metadataValue
   4.11%  envoy                                    [.] std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > >::emplace_back
   3.83%  [kernel]                                 [k] do_syscall_64

On a production instance, we've seen this going > 80% when endpoints were churning (our biggest cluster has > 2000 endpoints, each with 2 or 3 metadata strings).

If we revert 09c5d35, the calls to EdfLoadBalancerBase::refresh() aren't noticeable anymore.

@mattklein123 have you seen anything similar?

cc: @derekargueta

The text was updated successfully, but these errors were encountered:

mattklein123 · 2018-07-03T23:35:28Z

I really doubt that my LB change had anything to do with metadata, but I haven't looked at it in a while (I would take a look at the change and see if building the EDF scheduler somehow involves metadata).

I think the main difference is on every host change we are rebuilding the EDF scheduler now when we did not have one at all before for the LR LB. There are definitely ways that we can make this code faster such as doing incremental changes, but the best solution for right now might be to add an LB option that disables weighting completely such that the EDF scheduler is never made. I think this would basically bring it back to the behavior before my change (though of course you can't using weighting at that point).

rgs1 · 2018-07-04T00:07:53Z

Hmm, i did side-by-side runs:

Before 09c5d35:

   7.45%  envoy                                    [.] Envoy::Config::Metadata::metadataValue
   3.10%  perf-15907.map                           [.] 0x00007f5c117f0c04
   3.06%  envoy                                    [.] std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > >::emplace_back
   2.28%  envoy                                    [.] google::protobuf::internal::MapFieldBase::SyncMapWithRepeatedField
   2.27%  [kernel]                                 [k] do_syscall_64
   2.17%  envoy                                    [.] operator new[]
   2.00%  [kernel]                                 [k] audit_filter_syscall
   2.00%  [kernel]                                 [k] __lock_text_start
   1.97%  envoy                                    [.] std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>

After:

  8.69%  envoy                                    [.] Envoy::Config::Metadata::metadataValue                                                                                    
   6.49%  envoy                                    [.] Envoy::Upstream::EdfLoadBalancerBase::refresh(unsigned int)::{lambda(Envoy::Upstream::ZoneAwareLoadBalancerBase::HostsSourc
   2.73%  perf-129552.map                          [.] 0x00007fdb9978357c
   2.53%  envoy                                    [.] google::protobuf::internal::MapFieldBase::SyncMapWithRepeatedField                                                        
   2.31%  [kernel]                                 [k] do_syscall_64
   2.30%  envoy                                    [.] std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > >::emplace_back
   2.09%  [kernel]                                 [k] established_get_first.isra.32
   1.84%  envoy                                    [.] operator new[]
   1.73%  envoy                                    [.] Envoy::Upstream::SubsetLoadBalancer::hostMatches

so yeah, the calls to metadataValue() didn't change that much. But the calls to EdfLoadBalancerBase::refresh() are new and adding up... But there might be something else
at play too. I'll keep looking.

rgs1 · 2018-07-04T02:51:17Z

After reading #2874 we decided to relax our — possibly quite aggressive — healtchecks and things are back to normal. We'll explore if permanently relaxing the health checks is reasonable, otherwise we might put some work into #2874.

@mattklein123: if you are curious, we were running healthchecks on > 2k endpoints every 10 seconds, which was triggering the subsets recomputation, which ended up using way too many cycles. Thanks for the quick reply!

brian-pane · 2018-07-05T17:24:42Z

@mattklein123 it's probably worth noting that we're not using weighting

mattklein123 · 2018-07-05T17:24:50Z

As I said I think there are quite a few things we can do here to improve performance. If someone is interested in working on it we can discuss. The other option is to add an option to disable weighting support entirely which will remove the EDF rebuilds. cc @brian-pane

rgs1 · 2018-07-05T17:42:18Z

@mattklein123 thanks! I am interested in working on this, because it's blocking one of our main use-cases — lmk what's the best channel to discuss the potential approaches.

mattklein123 · 2018-07-05T17:44:48Z

@rgs1 I can put some notes in here sometime today. TBH though, if you aren't using weighting, I would just add an option to disable weighting in the LR LB, which will just completely disable the EDF builds and take perf back to what it was before...

rgs1 · 2018-07-05T17:49:45Z

@mattklein123 cool, let me take a look 09c5d35 and explore bringing upstream.weight_enabled back. Thanks!

rgs1 · 2018-07-06T20:50:09Z

@mattklein123 hmm, it looks like envoy.cluster.$cluster.lb_subsets_created is growing unbounded and after a day or two of uptime it goes over the thousands... which is weird since we should only have have 6 subsets at the most. The growth in lb_subsets_created is aligned with deploys, so maybe there's something with our EDS service (though I don't think so).

From a quick look at the code:

https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/subset_lb.cc#L202

it looks like this could be triggering the high cpu usage that we are seeing. The more I look at this, the less I think it's related to 09c5d35.

rgs1 · 2018-07-06T20:52:03Z

Hmm, it's possible that unused subsets aren't being removed too...

rgs1 · 2018-07-06T21:01:39Z

Looks like subsets that shouldn't exist anymore aren't being removed:

$ curl -s localhost:9901/stats | grep subset | grep myapp | sort -nrk 2
cluster.myapp.lb_subsets_selected: 37219203
cluster.myapp.lb_subsets_fallback: 934700
cluster.myapp.lb_subsets_created: 396
cluster.myapp.lb_subsets_active: 396
cluster.myapp.lb_subsets_removed: 0

Why didn't the gone subsets — for metadata that isn't there anymore — get removed?

rgs1 · 2018-07-06T22:22:57Z

Ok, added a test with what I am expecting to happen and it fails:

#3802

I'll open a new issue for this, I think we can close this one for now. Oor, we can leave it open to track the perf work is still relevant but different than subsets not being removed which is an actual bug apparently.

rgs1 · 2018-07-06T22:26:19Z

Follow-up in #3803. Thanks!

mattklein123 · 2018-07-06T22:40:46Z

@rgs1 I think there are different issues here. For sure there is going to be a perf delta with the LR LB with the referenced change since we now create an EDF schedule on each refresh that you probably never use. So depending on your use case, it might still be worth it to do a PR to add a config option to turn off weighting support.

rgs1 · 2018-07-06T22:46:24Z

@mattklein123 yeah agreed, I just meant that if we are really leaking LB subsets, then that's much worse than the perf tax we get with the EDF scheduler being recreated (and will actually amplify that effect, by a lot).

stale · 2018-08-05T22:49:16Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale · 2018-08-12T23:25:28Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label Jul 3, 2018

mattklein123 mentioned this issue Jul 5, 2018

Improve LB rebuild behavior when health status changes occur #2874

Open

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 5, 2018

stale bot closed this as completed Aug 12, 2018

antoniovicente mentioned this issue Aug 8, 2019

LoadBalancer: Skip EdfScheduler creation in LoadBalancerBase if all host weights are equal. #7877

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf regression #3790

perf regression #3790

rgs1 commented Jul 3, 2018 •

edited

Loading

mattklein123 commented Jul 3, 2018

rgs1 commented Jul 4, 2018

rgs1 commented Jul 4, 2018

brian-pane commented Jul 5, 2018

mattklein123 commented Jul 5, 2018

rgs1 commented Jul 5, 2018

mattklein123 commented Jul 5, 2018

rgs1 commented Jul 5, 2018

rgs1 commented Jul 6, 2018

rgs1 commented Jul 6, 2018 •

edited

Loading

rgs1 commented Jul 6, 2018 •

edited

Loading

rgs1 commented Jul 6, 2018

rgs1 commented Jul 6, 2018

mattklein123 commented Jul 6, 2018

rgs1 commented Jul 6, 2018

stale bot commented Aug 5, 2018

stale bot commented Aug 12, 2018

perf regression #3790

perf regression #3790

Comments

rgs1 commented Jul 3, 2018 • edited Loading

mattklein123 commented Jul 3, 2018

rgs1 commented Jul 4, 2018

rgs1 commented Jul 4, 2018

brian-pane commented Jul 5, 2018

mattklein123 commented Jul 5, 2018

rgs1 commented Jul 5, 2018

mattklein123 commented Jul 5, 2018

rgs1 commented Jul 5, 2018

rgs1 commented Jul 6, 2018

rgs1 commented Jul 6, 2018 • edited Loading

rgs1 commented Jul 6, 2018 • edited Loading

rgs1 commented Jul 6, 2018

rgs1 commented Jul 6, 2018

mattklein123 commented Jul 6, 2018

rgs1 commented Jul 6, 2018

stale bot commented Aug 5, 2018

stale bot commented Aug 12, 2018

rgs1 commented Jul 3, 2018 •

edited

Loading

rgs1 commented Jul 6, 2018 •

edited

Loading

rgs1 commented Jul 6, 2018 •

edited

Loading