[Research] Retries for certain HTTP codes #100

alexellis · 2020-05-15T16:35:37Z

This issue is to gather research and opinions on how to tackle retries for certain HTTP codes.

Expected Behaviour

If a function returns certain errors like 429 (too busy) (as can be set by max_inflight in the function's watchdog), then the queue-worker could retry the request a number of times.

Current Behaviour

The failure will be logged, but not retried.

It does seem like retries will be made implicitly if the function invocation takes longer than the "ack window".

So if a function takes 2m to finish, and the ack window is 30s, that will be retried, possibly indefinitely in our current implementation.

Possible Solution

I'd like to gather some use-cases and requests from users on how they expect this to work.

Context

@matthiashanel also has some suggestions on how the new NATS JetStream project could help with this use-case.

@andeplane recently told me about a custom fork / patch that retries up to 5 times whenever a 429 error is received with an exponential back-off. My concern with an exponential backoff with the current implementation is that it effectively shortens the ack window and could cause undefined results.

The Linkerd team also caution about automatic retries in their documentation stating risk of cascading failure. "How Retries Can Go Wrong" "Choosing a maximum number of retry attempts is a guessing game" "Systems configured this way are vulnerable to retry storms" -> https://linkerd.io/2/features/retries-and-timeouts/

The team discuss a "retry budget", should we look into this?

Should individual functions be able to express an annotation with retry data? I.e. a backoff for processing an image may be valid at 2, 4, 8 seconds, but retrying a Tweet because Twitter's API has rate-limited us for 4 hours, will clearly not work.

What happens if we cannot retry a function call like in the Twitter example above? Where does the message go, how is this persisted? See also (call for a dead-letter queue) in #81

Finally, if we do start retrying, that metadata seems key to operational tuning of the system and auto-scaling, should this be exposed via Prometheus metrics and a HTTP /metrics endpoint?

andeplane · 2020-05-15T16:43:42Z

Interesting! I understand your concerns with the ack window, and in our case, we can increase it by the max retry time. I'll report back here our experiences with high load.

alexellis · 2020-05-15T16:47:26Z

I'd encourage you to read up on the Linkerd caution over arbitrary retries too.

andeplane · 2020-05-15T16:48:56Z

Will do for sure :) Our use case is that we want to ensure that function pods don't get out of memory, so we limit the number of concurrent function calls with max_inflight. To allow auto scaling to kick in, we create an scale-up on 429's (thanks to @alexellis who designed this), and allow the queue workers to retry a few times so they may end up being executed either on new pods, or once the busy ones have finished one of their calls.

matthiashanel · 2020-05-15T17:15:01Z

When openfaas switches to jetstream dynamically extending ack_wait becomes an option.
That'd be: while the request to the function is still outstanding, we request an extension every ack_wait/2. This way ack_wait wouldn't have to be set to the "correct value" and a reasonable default could suffice.

When looking into this I noticed there's a choice when the function returns an error code or an error is returned outright. The choice is to send an ack or not (and wait for redelivery by jetstream).
In jetstream this is possible because flow control is decoupled from acking.

If I have a list of status codes after which a retry is desired I can add this together with jetstream support.
Same thing applies to the callback. What if the function returns ok, but callback invocation does not...

In jetstream, number of redelivery attempts can also be limited, which would go hand in hand with this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Research] Retries for certain HTTP codes #100

[Research] Retries for certain HTTP codes #100

alexellis commented May 15, 2020 •

edited

Loading

andeplane commented May 15, 2020

alexellis commented May 15, 2020

andeplane commented May 15, 2020

matthiashanel commented May 15, 2020

[Research] Retries for certain HTTP codes #100

[Research] Retries for certain HTTP codes #100

Comments

alexellis commented May 15, 2020 • edited Loading

Expected Behaviour

Current Behaviour

Possible Solution

Context

andeplane commented May 15, 2020

alexellis commented May 15, 2020

andeplane commented May 15, 2020

matthiashanel commented May 15, 2020

alexellis commented May 15, 2020 •

edited

Loading