upstream: redesign per priority load calculation when all priority levels are in panic mode #4685

cpakulski · 2018-10-10T22:57:46Z

Description:
Currently per priority traffic load distribution is calculated based on number of healthy hosts in each priority level. When there is zero healthy hosts in all priority levels, 100% of traffic goes to priority 0.
This behavior should be changed for situation when all priority levels are in panic mode. It means that there is very low number of healthy hosts in each priority, possibly none. For this scenario (all priority levels are in panic mode) load distribution algorithm should use total number of hosts in each priority, not number of healthy hosts. For example if there are 3 priorities with 5 hosts each and the number of healthy hosts is 0 (P0), 1 (P1) and 1 (P2), the load will be 34% (33% plus rounding to 100%), 33% and 33% respectively.

htuch · 2018-10-11T01:27:41Z

CC @mattklein123 @alyssawilk

stale · 2018-11-10T02:11:57Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

cpakulski · 2018-11-10T20:10:32Z

Work in progress....

cpakulski · 2018-12-20T19:21:10Z

DESIGN PROPOSAL:
As the number of healthy hosts in a cluster decreases, the behavior of the cluster changes. The change usually happens when the number of healthy hosts drops below a threshold. The following list describes the behavior:

Cluster State: Priority=0 has more than 72% (100 % divided by overprovisioning factor) hosts in healthy state:
Behavior: 100% of the total traffic is handled by Priority=0. The rest of priority levels are not used.
Cluster State: Priority = 0 has less than 72% of hosts in healthy state, but there is enough hosts in other priority levels to handle the load (normalized total health across the cluster is 100%)
Behavior: Load is distributed across priority levels. None of the levels enters panic mode.
Cluster state: there is not enough healthy hosts across all priority levels to handle the load (normalized total health across the cluster < 100%), but each level has enough healthy hosts to avoid entering panic mode.
Behavior: Load is distributed across priority levels according to the number of healthy hosts in a priority in relation to the total number of healthy hosts in entire cluster.
Cluster state: there is not enough healthy hosts across all priority levels to handle the load and some levels entered panic mode, when the number of healthy hosts in those levels dropped below panic level (50%).
Behavior: Load is distributed across priority levels according to the number of healthy hosts in each priority level in relation to the total number of healthy hosts in entire cluster. Priority levels in panic mode distribute the traffic to all hosts in that priority level. Note: if a priority level has zero healthy hosts, that priority level receives no traffic.
Cluster state: there are not enough healthy hosts across all priority levels to handle the load and ALL levels entered panic mode (when the number of healthy hosts in those levels dropped below panic level (50%)).
Behavior: Load is distributed across priority levels according to the number of healthy hosts in each priority level in relation to the total number of healthy hosts in entire cluster. Each priority level distributes received traffic to all hosts in that priority (panic mode). Note: if a priority level has ZERO healthy hosts, that priority level receives no traffic.
Cluster State: There are no healthy hosts in the cluster.
Behavior: 100% of the traffic is send to Priority=0, which is panic mode and distributes the traffic to all hosts in Priority=0.

I would like to propose to change the behavior for the last two situations (when all priority levels are in the panic mode) to the following:

Cluster State: there is not enough healthy hosts across all priority levels to handle the load and ALL levels entered panic mode (when the number of healthy hosts in those levels dropped below panic level (50%)).
Behavior: Load is distributed across priority levels according to the number of all hosts (healthy and not healthy) in a priority level in relation to the total number of hosts (healthy and not healthy) in the entire cluster. Each priority level distributes received traffic to all hosts in that priority level (panic mode).

The problem with current solution is that if the number of healthy hosts in a priority drops to zero that priority is excluded from load distribution. When the number of healthy hosts in a cluster is very low, most of the traffic will not be handled anyways, but there is a risk that remaining healthy hosts become overloaded. In essence I am suggesting that once all levels enter panic mode, load calculation algorithm does distribution based not on the number of healthy hosts, but based on total hosts in a priority level.

Adding: @mattklein123 @alyssawilk @snowp @fredlas

fredlas · 2018-12-20T20:05:15Z

Not that I know anything about this beyond having studied Envoy's documentation a bit, but you wrote @fredlas so I will chime in!

That makes perfect sense to me. In the existing behavior, the logic of the last case is sort of an arbitrary exception to what the logic of the second to last case would suggest. And

most of the traffic will not be handled anyways, but there is a risk that remaining healthy hosts become overloaded

makes perfect sense to me as motivation for this change.

cpakulski · 2018-12-20T20:15:04Z

@fredlas Thanks! You did great job clarifying load balancing logic in #4817 and this definitely is on the same topic.

alyssawilk · 2019-01-02T21:57:01Z

I don't object to this change, but I counter-propose that if most hosts are unhealthy and all priorities are in panic mode, perhaps Envoy should simply fail to chose hosts for some percentage of load and return 50xs instead? I agree overwhelming P=0 isn't optimal but at that point everything has melted down and overwhelming the lingering subset of P=1 to P=N may not be helping matters.

mattklein123 · 2019-01-03T20:57:41Z

I agree with the proposal.

I don't object to this change, but I counter-propose that if most hosts are unhealthy and all priorities are in panic mode, perhaps Envoy should simply fail to chose hosts for some percentage of load and return 50xs instead? I agree overwhelming P=0 isn't optimal but at that point everything has melted down and overwhelming the lingering subset of P=1 to P=N may not be helping matters.

I like this idea, but I wonder if we should look at this as a totally different feature/issue/option which would basically be "auto maintenance mode?" This could then be configured in place or along size panic mode?

cpakulski · 2019-01-03T21:38:47Z

My major concern was a situation when a priority level gets excluded from load calculation because it has zero healthy hosts while a priority next to it with let us say only 1% hosts in healthy state receives all the load.

I believe that both mechanisms can co-exist. The rejection logic described by @alyssawilk would start working when all levels enter panic mode and the percentage of rejected requests would be somehow negatively related to the number of remaining healthy hosts.

I am a bit lost when it comes to returning an error for rejected traffic. Load balancer is layer 3 concept and it provides connectivity to various services: HTTP based and non-http like redis. Wouldn't always returning 5xx error violate this?

mattklein123 · 2019-01-04T16:53:06Z

I am a bit lost when it comes to returning an error for rejected traffic. Load balancer is layer 3 concept and it provides connectivity to various services: HTTP based and non-http like redis. Wouldn't always returning 5xx error violate this?

Depending on the LB error/return code, we could do different things, like return a 5xx in the router. Either way, I would recommend opening a separate issue to track this.

htuch · 2019-01-06T23:44:40Z

I'd be in favor of a solution that doesn't have wild swings in load distributions across priority levels occurring as a single host goes healthy/unhealthy. It sounds like in the current situation that this can happen and @cpakulski proposed solution solves this. OTOH, anything that simplifies, e.g. 5xx load shedding, the existing (very) complicated set of behaviors might be useful. I like to think of this from a control system perspective; you don't want to have non-linear discontinuities in behavior.

cpakulski · 2019-01-07T18:49:11Z

Thanks for replies. If there are no objections, I will implement proposed solution when all levels are in panic mode and will create a new issue to design simpler mechanism to shred load in linear fashion.

snowp · 2019-12-12T22:09:21Z

@cpakulski Just checking in to see if you're still planning on working on this? Or should I unassign you?

cpakulski · 2019-12-12T22:23:33Z

@snowp You must have used 6th sense, because I started to work on it today! I am at unit testing stage now and should create PR with a day or two.

Changed how load is calculated when all priority levels are in panic mode. Each priority level receives percentage of the traffic related to the number of hosts in that priority regardless of the health status of the hosts. This smooths out how traffic is shifted when hosts become unhealthy. See #4685 for design proposal and discussion. Signed-off-by: Christoph Pakulski <[email protected]>

cpakulski mentioned this issue Oct 10, 2018

upstream: changed how load calculation and panic level interact #4442

Merged

htuch assigned cpakulski Oct 11, 2018

htuch added the design proposal Needs design doc/proposal before implementation label Oct 11, 2018

cpakulski mentioned this issue Oct 24, 2018

docs: clarifications to Priority Levels section of load_balancing.rst #4817

Merged

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 10, 2018

mattklein123 added the help wanted Needs help! label Nov 10, 2018

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 10, 2018

mattklein123 added this to the 1.10.0 milestone Jan 11, 2019

htuch mentioned this issue Jan 14, 2019

Load balancer extensibility #5598

Closed

cpakulski mentioned this issue Jan 17, 2019

upstream: account for degraded hosts in panic mode calculations #5630

Merged

mattklein123 modified the milestones: 1.10.0, 1.11.0 Mar 11, 2019

mattklein123 modified the milestones: 1.11.0, 1.12.0 Jul 3, 2019

mattklein123 modified the milestones: 1.12.0, 1.13.0 Oct 10, 2019

mattklein123 modified the milestones: 1.13.0, 1.14.0 Dec 5, 2019

snowp added the area/load balancing label Dec 12, 2019

cpakulski mentioned this issue Dec 13, 2019

upstream: load distribution in Total Panic mode #9343

Merged

snowp closed this as completed in #9343 Jan 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upstream: redesign per priority load calculation when all priority levels are in panic mode #4685

upstream: redesign per priority load calculation when all priority levels are in panic mode #4685

cpakulski commented Oct 10, 2018

htuch commented Oct 11, 2018

stale bot commented Nov 10, 2018

cpakulski commented Nov 10, 2018

cpakulski commented Dec 20, 2018

fredlas commented Dec 20, 2018

cpakulski commented Dec 20, 2018

alyssawilk commented Jan 2, 2019

mattklein123 commented Jan 3, 2019

cpakulski commented Jan 3, 2019

mattklein123 commented Jan 4, 2019

htuch commented Jan 6, 2019

cpakulski commented Jan 7, 2019

snowp commented Dec 12, 2019

cpakulski commented Dec 12, 2019

upstream: redesign per priority load calculation when all priority levels are in panic mode #4685

upstream: redesign per priority load calculation when all priority levels are in panic mode #4685

Comments

cpakulski commented Oct 10, 2018

htuch commented Oct 11, 2018

stale bot commented Nov 10, 2018

cpakulski commented Nov 10, 2018

cpakulski commented Dec 20, 2018

fredlas commented Dec 20, 2018

cpakulski commented Dec 20, 2018

alyssawilk commented Jan 2, 2019

mattklein123 commented Jan 3, 2019

cpakulski commented Jan 3, 2019

mattklein123 commented Jan 4, 2019

htuch commented Jan 6, 2019

cpakulski commented Jan 7, 2019

snowp commented Dec 12, 2019

cpakulski commented Dec 12, 2019