Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Fail Safe" control knob for Extension Server #4155

Open
logan-hcg opened this issue Sep 4, 2024 · 5 comments
Open

"Fail Safe" control knob for Extension Server #4155

logan-hcg opened this issue Sep 4, 2024 · 5 comments
Labels
help wanted Extra attention is needed kind/decision A record of a decision made by the community.
Milestone

Comments

@logan-hcg
Copy link

logan-hcg commented Sep 4, 2024

Currently, the processing of Extension Server logic (ie Translate() step) is "best effort". If the Extension Server fails in some way (ie is not available, crashes during request handling), than the envoy configuration is not impacted.

In some situations, this could be a "bad thing". For example, if the Extension Server is being used to add a default Authz filter to all Listeners, if the grpc call to the Extension Server fails, than the Listener will still be activated but will not have the Authz filter (thus incorrectly exposing the resource without the desired protection).

Ideally, similar to #3873, there would be an option added to ExtensionManager which would allow either "fail open" (current behavior of best effort) or "fail closed" (alternate behavior of disabling the resource associated with the failed hook).

@arkodg arkodg added help wanted Extra attention is needed kind/decision A record of a decision made by the community. and removed triage labels Sep 4, 2024
@arkodg arkodg added this to the v1.2.0-rc1 milestone Sep 4, 2024
@arkodg
Copy link
Contributor

arkodg commented Sep 4, 2024

we've fail closed by default for all Policy APIs, should we do the same for extension manager (if unable to connect or unable to get a response in time ) ptal @envoyproxy/gateway-maintainers . If unable to connect during startup, it would mean no config would ever be programmed in envoy proxy, and if the gRPC request times out, we could skip that xDS update, and the data plane would have the last good config

@liorokman
Copy link
Contributor

Maybe add a configuration flag in the extension manager configuration section to specify if the extension manager should fail-open or fail-close?

@arkodg
Copy link
Contributor

arkodg commented Sep 4, 2024

Maybe add a configuration flag in the extension manager configuration section to specify if the extension manager should fail-open or fail-close?

yeah thats a good home to add the mode config if user expectations vary, we still need to figure out what the default mode should be - fail open or fail close 😄

I vote for fail close by default, to ensure we fail fast and early like we do for Filters and Policies

@ardikabs
Copy link
Contributor

ardikabs commented Sep 19, 2024

if i understand correctly, since the extension has several hooks (HTTPListener, VirtualHost, Route, and Translation), it would be make sense if 500 behavior is used when Route hook failing. But what about on other hooks?

@logan-hcg
Copy link
Author

logan-hcg commented Sep 19, 2024

I'm not sure about the others, but in the case of HTTPListener, in "failsafe mode" the listener should not be added to list of active listener resources if the hook fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed kind/decision A record of a decision made by the community.
Projects
None yet
Development

No branches or pull requests

5 participants
@liorokman @arkodg @ardikabs @logan-hcg and others