-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream: redesign per priority load calculation when all priority levels are in panic mode #4685
Comments
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
Work in progress.... |
DESIGN PROPOSAL:
I would like to propose to change the behavior for the last two situations (when all priority levels are in the panic mode) to the following:
The problem with current solution is that if the number of healthy hosts in a priority drops to zero that priority is excluded from load distribution. When the number of healthy hosts in a cluster is very low, most of the traffic will not be handled anyways, but there is a risk that remaining healthy hosts become overloaded. In essence I am suggesting that once all levels enter panic mode, load calculation algorithm does distribution based not on the number of healthy hosts, but based on total hosts in a priority level. Adding: @mattklein123 @alyssawilk @snowp @fredlas |
Not that I know anything about this beyond having studied Envoy's documentation a bit, but you wrote @fredlas so I will chime in! That makes perfect sense to me. In the existing behavior, the logic of the last case is sort of an arbitrary exception to what the logic of the second to last case would suggest. And
makes perfect sense to me as motivation for this change. |
I don't object to this change, but I counter-propose that if most hosts are unhealthy and all priorities are in panic mode, perhaps Envoy should simply fail to chose hosts for some percentage of load and return 50xs instead? I agree overwhelming P=0 isn't optimal but at that point everything has melted down and overwhelming the lingering subset of P=1 to P=N may not be helping matters. |
I agree with the proposal.
I like this idea, but I wonder if we should look at this as a totally different feature/issue/option which would basically be "auto maintenance mode?" This could then be configured in place or along size panic mode? |
My major concern was a situation when a priority level gets excluded from load calculation because it has zero healthy hosts while a priority next to it with let us say only 1% hosts in healthy state receives all the load. I believe that both mechanisms can co-exist. The rejection logic described by @alyssawilk would start working when all levels enter panic mode and the percentage of rejected requests would be somehow negatively related to the number of remaining healthy hosts. I am a bit lost when it comes to returning an error for rejected traffic. Load balancer is layer 3 concept and it provides connectivity to various services: HTTP based and non-http like redis. Wouldn't always returning 5xx error violate this? |
Depending on the LB error/return code, we could do different things, like return a 5xx in the router. Either way, I would recommend opening a separate issue to track this. |
I'd be in favor of a solution that doesn't have wild swings in load distributions across priority levels occurring as a single host goes healthy/unhealthy. It sounds like in the current situation that this can happen and @cpakulski proposed solution solves this. OTOH, anything that simplifies, e.g. 5xx load shedding, the existing (very) complicated set of behaviors might be useful. I like to think of this from a control system perspective; you don't want to have non-linear discontinuities in behavior. |
Thanks for replies. If there are no objections, I will implement proposed solution when all levels are in panic mode and will create a new issue to design simpler mechanism to shred load in linear fashion. |
@cpakulski Just checking in to see if you're still planning on working on this? Or should I unassign you? |
@snowp You must have used 6th sense, because I started to work on it today! I am at unit testing stage now and should create PR with a day or two. |
Changed how load is calculated when all priority levels are in panic mode. Each priority level receives percentage of the traffic related to the number of hosts in that priority regardless of the health status of the hosts. This smooths out how traffic is shifted when hosts become unhealthy. See #4685 for design proposal and discussion. Signed-off-by: Christoph Pakulski <[email protected]>
Description:
Currently per priority traffic load distribution is calculated based on number of healthy hosts in each priority level. When there is zero healthy hosts in all priority levels, 100% of traffic goes to priority 0.
This behavior should be changed for situation when all priority levels are in panic mode. It means that there is very low number of healthy hosts in each priority, possibly none. For this scenario (all priority levels are in panic mode) load distribution algorithm should use total number of hosts in each priority, not number of healthy hosts. For example if there are 3 priorities with 5 hosts each and the number of healthy hosts is 0 (P0), 1 (P1) and 1 (P2), the load will be 34% (33% plus rounding to 100%), 33% and 33% respectively.
The text was updated successfully, but these errors were encountered: