-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Research] Retries for certain HTTP codes #100
Comments
Interesting! I understand your concerns with the ack window, and in our case, we can increase it by the max retry time. I'll report back here our experiences with high load. |
I'd encourage you to read up on the Linkerd caution over arbitrary retries too. |
Will do for sure :) Our use case is that we want to ensure that function pods don't get out of memory, so we limit the number of concurrent function calls with |
When openfaas switches to jetstream dynamically extending ack_wait becomes an option. When looking into this I noticed there's a choice when the function returns an error code or an error is returned outright. The choice is to send an ack or not (and wait for redelivery by jetstream). If I have a list of status codes after which a retry is desired I can add this together with jetstream support. In jetstream, number of redelivery attempts can also be limited, which would go hand in hand with this. |
This issue is to gather research and opinions on how to tackle retries for certain HTTP codes.
Expected Behaviour
If a function returns certain errors like 429 (too busy) (as can be set by max_inflight in the function's watchdog), then the queue-worker could retry the request a number of times.
Current Behaviour
The failure will be logged, but not retried.
It does seem like retries will be made implicitly if the function invocation takes longer than the "ack window".
So if a function takes 2m to finish, and the ack window is 30s, that will be retried, possibly indefinitely in our current implementation.
Possible Solution
I'd like to gather some use-cases and requests from users on how they expect this to work.
Context
@matthiashanel also has some suggestions on how the new NATS JetStream project could help with this use-case.
@andeplane recently told me about a custom fork / patch that retries up to 5 times whenever a 429 error is received with an exponential back-off. My concern with an exponential backoff with the current implementation is that it effectively shortens the ack window and could cause undefined results.
The Linkerd team also caution about automatic retries in their documentation stating risk of cascading failure. "How Retries Can Go Wrong" "Choosing a maximum number of retry attempts is a guessing game" "Systems configured this way are vulnerable to retry storms" -> https://linkerd.io/2/features/retries-and-timeouts/
The team discuss a "retry budget", should we look into this?
Should individual functions be able to express an annotation with retry data? I.e. a backoff for processing an image may be valid at 2, 4, 8 seconds, but retrying a Tweet because Twitter's API has rate-limited us for 4 hours, will clearly not work.
What happens if we cannot retry a function call like in the Twitter example above? Where does the message go, how is this persisted? See also (call for a dead-letter queue) in #81
Finally, if we do start retrying, that metadata seems key to operational tuning of the system and auto-scaling, should this be exposed via Prometheus metrics and a HTTP /metrics endpoint?
The text was updated successfully, but these errors were encountered: