Support fail_traffic_on_panic without zone aware routing #17659

hochuenw-dd · 2021-08-10T17:54:10Z

Hello folks,
I'm trying to implement client-side throttling using envoy. My use case is to prevent further overwhelming the upstream cluster when it's degraded/down. To do this, apart from the admission control (which is awesome but current doesn't support per cluster/vhost), I can see 2 potential solutions using outlier detection.

outlier detection + disabled panic mode
outlier detection + enabled panic mode + enabled fail_traffic_on_panic

In our environment, we have to disable panic mode, because to enable fail_traffic_on_panic, it needs to enable zone aware routing. To enable zone aware routing, according to this doc, it needs users to define static local cluster in the bootstrap file, and it also needs change in the upstream service. These are hard to do in our environment.

But I think a potential concern with panic mode disabled approach is that envoy would return 503 only after every upstream host is down. Let's say 80% upstream cluster is down, if we disable panic mode, the traffic may further overwhelm the rest 20% of the traffic and bring the whole upstream service down which is the opposite of what we want to do here. Instead, enabled panic mode + fail_traffic_on_panic don't have this issue because it can start failing all requests when x% of traffic is down.

I don't know whether this is just a rare case in reality. I guess most likely in reality 100% service would go down, so panic mode disabled should work fine. But I feel this may happen when the upstream cluster does a partial rollout (say 80% on v2, 20% on v1)

Is my understanding correct? If so, is there a way to use fail_traffic_on_panic without zone aware routing?

Thanks!!

tonya11en · 2021-08-10T20:06:44Z

Can you link to the docs you're referencing? From my understanding, you don't need to have zone-aware routing enabled for this since it's a common LB config.

That being said, I really think what you want here is admission control-- this is exactly the scenario it's mean for, but it's just missing the per-cluster state tracking. The idea is based on client-side throttling from that Google SRE book.

Are you trying to accomplish this without any code changes? If so, I think you may be limited to what you mention in the description. Otherwise, you'll have a much better experience by adding this functionality to the admission control filter.

hochuenw-dd · 2021-08-11T03:49:37Z

@tonya11en Thanks for the reply!

Can you link to the docs you're referencing? From my understanding, you don't need to have zone-aware routing enabled for this since it's a common LB config.

To disable panic mode (set healthy_panic_threshold to 0), users don't need zone-aware routing. But that may cause the issue I described above (clients may overwhelm servers unless ALL servers are down).
To enable panic mode and fail requests immediately when it's in panic mode (set fail_traffic_on_panic to true), I think users need to have zone-aware routing since it's a zone-aware lb config (#8024). But I feel like technically zone-aware routing shouldn't be a requirement. That's why I created this issue asking if we can extend it to the common lb config. If this can be added, clients wouldn't overwhelm servers because envoy starts returning 503 directly when only x% of the servers are down (vs 100% when we disable panic mode).

That being said, I really think what you want here is admission control-- this is exactly the scenario it's mean for, but it's just missing the per-cluster state tracking. The idea is based on client-side throttling from that Google SRE book.

Are you trying to accomplish this without any code changes? If so, I think you may be limited to what you mention in the description. Otherwise, you'll have a much better experience by adding this functionality to the admission control filter.

I agree admission control is better for our use case if it supports per-cluster state tracking. We are trying to achieve the goal in a very short term (by the end of August), so for now contributing to the admission control filter is not an option for me given my knowledge to envoy codebase. But I may look into this later.

hochuenw-dd · 2021-08-14T02:04:48Z

Hi @csssuf, I see you added the fail_traffic_on_panic feature. Is my above understanding correct (users have to turn on zone-aware routing if they want to use fail_traffic_on_panic)? Does this feature request sounds reasonable to you? Thanks

github-actions · 2021-09-13T04:01:37Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2021-09-20T08:01:19Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

gongcon · 2024-04-12T17:39:09Z

Can you reopen this issue? We use locality weighted lb instead of zone aware lb. Per CommonLbConfig doc, only one of zone_aware_lb_config, locality_weighted_lb_config may be set. We are out of luck to use fail_traffic_on_panic.

Question is why zone aware lb is required for fail_traffic_on_panic.

hochuenw-dd added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Aug 10, 2021

ggreenway removed the triage Issue requires triage label Aug 11, 2021

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 13, 2021

github-actions bot closed this as completed Sep 20, 2021

reddi mentioned this issue May 2, 2024

Support fail_traffic_on_panic for locality_weighted_lb_config #33926

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fail_traffic_on_panic without zone aware routing #17659

Support fail_traffic_on_panic without zone aware routing #17659

hochuenw-dd commented Aug 10, 2021 •

edited

Loading

tonya11en commented Aug 10, 2021

hochuenw-dd commented Aug 11, 2021 •

edited

Loading

hochuenw-dd commented Aug 14, 2021

github-actions bot commented Sep 13, 2021

github-actions bot commented Sep 20, 2021

gongcon commented Apr 12, 2024 •

edited

Loading

Support fail_traffic_on_panic without zone aware routing #17659

Support fail_traffic_on_panic without zone aware routing #17659

Comments

hochuenw-dd commented Aug 10, 2021 • edited Loading

tonya11en commented Aug 10, 2021

hochuenw-dd commented Aug 11, 2021 • edited Loading

hochuenw-dd commented Aug 14, 2021

github-actions bot commented Sep 13, 2021

github-actions bot commented Sep 20, 2021

gongcon commented Apr 12, 2024 • edited Loading

hochuenw-dd commented Aug 10, 2021 •

edited

Loading

hochuenw-dd commented Aug 11, 2021 •

edited

Loading

gongcon commented Apr 12, 2024 •

edited

Loading