Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream: redesign per priority load calculation when all priority levels are in panic mode #4685

Closed
cpakulski opened this issue Oct 10, 2018 · 14 comments · Fixed by #9343
Closed
Assignees
Labels
area/load balancing design proposal Needs design doc/proposal before implementation help wanted Needs help!
Milestone

Comments

@cpakulski
Copy link
Contributor

Description:
Currently per priority traffic load distribution is calculated based on number of healthy hosts in each priority level. When there is zero healthy hosts in all priority levels, 100% of traffic goes to priority 0.
This behavior should be changed for situation when all priority levels are in panic mode. It means that there is very low number of healthy hosts in each priority, possibly none. For this scenario (all priority levels are in panic mode) load distribution algorithm should use total number of hosts in each priority, not number of healthy hosts. For example if there are 3 priorities with 5 hosts each and the number of healthy hosts is 0 (P0), 1 (P1) and 1 (P2), the load will be 34% (33% plus rounding to 100%), 33% and 33% respectively.

@htuch htuch added the design proposal Needs design doc/proposal before implementation label Oct 11, 2018
@htuch
Copy link
Member

htuch commented Oct 11, 2018

CC @mattklein123 @alyssawilk

@stale
Copy link

stale bot commented Nov 10, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 10, 2018
@mattklein123 mattklein123 added the help wanted Needs help! label Nov 10, 2018
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 10, 2018
@cpakulski
Copy link
Contributor Author

Work in progress....

@cpakulski
Copy link
Contributor Author

DESIGN PROPOSAL:
As the number of healthy hosts in a cluster decreases, the behavior of the cluster changes. The change usually happens when the number of healthy hosts drops below a threshold. The following list describes the behavior:

  • Cluster State: Priority=0 has more than 72% (100 % divided by overprovisioning factor) hosts in healthy state:
    Behavior: 100% of the total traffic is handled by Priority=0. The rest of priority levels are not used.

  • Cluster State: Priority = 0 has less than 72% of hosts in healthy state, but there is enough hosts in other priority levels to handle the load (normalized total health across the cluster is 100%)
    Behavior: Load is distributed across priority levels. None of the levels enters panic mode.

  • Cluster state: there is not enough healthy hosts across all priority levels to handle the load (normalized total health across the cluster < 100%), but each level has enough healthy hosts to avoid entering panic mode.
    Behavior: Load is distributed across priority levels according to the number of healthy hosts in a priority in relation to the total number of healthy hosts in entire cluster.

  • Cluster state: there is not enough healthy hosts across all priority levels to handle the load and some levels entered panic mode, when the number of healthy hosts in those levels dropped below panic level (50%).
    Behavior: Load is distributed across priority levels according to the number of healthy hosts in each priority level in relation to the total number of healthy hosts in entire cluster. Priority levels in panic mode distribute the traffic to all hosts in that priority level. Note: if a priority level has zero healthy hosts, that priority level receives no traffic.

  • Cluster state: there are not enough healthy hosts across all priority levels to handle the load and ALL levels entered panic mode (when the number of healthy hosts in those levels dropped below panic level (50%)).
    Behavior: Load is distributed across priority levels according to the number of healthy hosts in each priority level in relation to the total number of healthy hosts in entire cluster. Each priority level distributes received traffic to all hosts in that priority (panic mode). Note: if a priority level has ZERO healthy hosts, that priority level receives no traffic.

  • Cluster State: There are no healthy hosts in the cluster.
    Behavior: 100% of the traffic is send to Priority=0, which is panic mode and distributes the traffic to all hosts in Priority=0.

I would like to propose to change the behavior for the last two situations (when all priority levels are in the panic mode) to the following:

  • Cluster State: there is not enough healthy hosts across all priority levels to handle the load and ALL levels entered panic mode (when the number of healthy hosts in those levels dropped below panic level (50%)).
    Behavior: Load is distributed across priority levels according to the number of all hosts (healthy and not healthy) in a priority level in relation to the total number of hosts (healthy and not healthy) in the entire cluster. Each priority level distributes received traffic to all hosts in that priority level (panic mode).

The problem with current solution is that if the number of healthy hosts in a priority drops to zero that priority is excluded from load distribution. When the number of healthy hosts in a cluster is very low, most of the traffic will not be handled anyways, but there is a risk that remaining healthy hosts become overloaded. In essence I am suggesting that once all levels enter panic mode, load calculation algorithm does distribution based not on the number of healthy hosts, but based on total hosts in a priority level.

Adding: @mattklein123 @alyssawilk @snowp @fredlas

@fredlas
Copy link
Contributor

fredlas commented Dec 20, 2018

Not that I know anything about this beyond having studied Envoy's documentation a bit, but you wrote @fredlas so I will chime in!

That makes perfect sense to me. In the existing behavior, the logic of the last case is sort of an arbitrary exception to what the logic of the second to last case would suggest. And

most of the traffic will not be handled anyways, but there is a risk that remaining healthy hosts become overloaded

makes perfect sense to me as motivation for this change.

@cpakulski
Copy link
Contributor Author

@fredlas Thanks! You did great job clarifying load balancing logic in #4817 and this definitely is on the same topic.

@alyssawilk
Copy link
Contributor

I don't object to this change, but I counter-propose that if most hosts are unhealthy and all priorities are in panic mode, perhaps Envoy should simply fail to chose hosts for some percentage of load and return 50xs instead? I agree overwhelming P=0 isn't optimal but at that point everything has melted down and overwhelming the lingering subset of P=1 to P=N may not be helping matters.

@mattklein123
Copy link
Member

I agree with the proposal.

I don't object to this change, but I counter-propose that if most hosts are unhealthy and all priorities are in panic mode, perhaps Envoy should simply fail to chose hosts for some percentage of load and return 50xs instead? I agree overwhelming P=0 isn't optimal but at that point everything has melted down and overwhelming the lingering subset of P=1 to P=N may not be helping matters.

I like this idea, but I wonder if we should look at this as a totally different feature/issue/option which would basically be "auto maintenance mode?" This could then be configured in place or along size panic mode?

@cpakulski
Copy link
Contributor Author

My major concern was a situation when a priority level gets excluded from load calculation because it has zero healthy hosts while a priority next to it with let us say only 1% hosts in healthy state receives all the load.

I believe that both mechanisms can co-exist. The rejection logic described by @alyssawilk would start working when all levels enter panic mode and the percentage of rejected requests would be somehow negatively related to the number of remaining healthy hosts.

I am a bit lost when it comes to returning an error for rejected traffic. Load balancer is layer 3 concept and it provides connectivity to various services: HTTP based and non-http like redis. Wouldn't always returning 5xx error violate this?

@mattklein123
Copy link
Member

I am a bit lost when it comes to returning an error for rejected traffic. Load balancer is layer 3 concept and it provides connectivity to various services: HTTP based and non-http like redis. Wouldn't always returning 5xx error violate this?

Depending on the LB error/return code, we could do different things, like return a 5xx in the router. Either way, I would recommend opening a separate issue to track this.

@htuch
Copy link
Member

htuch commented Jan 6, 2019

I'd be in favor of a solution that doesn't have wild swings in load distributions across priority levels occurring as a single host goes healthy/unhealthy. It sounds like in the current situation that this can happen and @cpakulski proposed solution solves this. OTOH, anything that simplifies, e.g. 5xx load shedding, the existing (very) complicated set of behaviors might be useful. I like to think of this from a control system perspective; you don't want to have non-linear discontinuities in behavior.

@cpakulski
Copy link
Contributor Author

Thanks for replies. If there are no objections, I will implement proposed solution when all levels are in panic mode and will create a new issue to design simpler mechanism to shred load in linear fashion.

@snowp
Copy link
Contributor

snowp commented Dec 12, 2019

@cpakulski Just checking in to see if you're still planning on working on this? Or should I unassign you?

@cpakulski
Copy link
Contributor Author

@snowp You must have used 6th sense, because I started to work on it today! I am at unit testing stage now and should create PR with a day or two.

snowp pushed a commit that referenced this issue Jan 24, 2020
Changed how load is calculated when all priority levels are in panic mode. Each priority level receives percentage of the traffic related to the number of hosts in that priority regardless of the health status of the hosts. This smooths out how traffic is shifted when hosts become unhealthy. See #4685 for design proposal and discussion.

Signed-off-by: Christoph Pakulski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/load balancing design proposal Needs design doc/proposal before implementation help wanted Needs help!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants