Service failure semantics #294

tillrohrmann · 2023-04-17T12:31:47Z

The failure semantics are defined as follows: #294 (comment)

Tasks

Give feedback

slinkydeveloper · 2023-04-17T12:42:13Z

Do we have fail points that are infinitely retried right now? RetryPolicy has always an upper limit configurable.
Every user failure, and "retry limit exceeded failure" are propagate as service failures back to the caller
Some failures are not retried, per Notifying non-recoverable errors back to the runtime service-protocol#19

slinkydeveloper · 2023-05-09T08:39:21Z

Document: https://docs.google.com/document/d/1FX3yW0fSyRiYVMfGxq3MvZlJalwmruIljSHGoEL2_Hw/edit?usp=sharing

slinkydeveloper · 2023-05-22T08:45:57Z

I'm gonna move this out of 1A, as for the time being we set up infinite retries as described in the document, we retry on every error and we have the ability to kill invocations.

On the long term we probably need to address cancellations and eventually restatedev/service-protocol#19 as well.

tillrohrmann · 2023-06-08T16:23:56Z

Based on an offline discussion we concluded the following failure semantics for Restate:

General idea

All errors while processing an invocation (SDK bugs, determinism error, infrastructure errors, application bugs) are considered retryable errors and will be retried indefinitely by the runtime. Only if the service returns a special non-retryable error (terminal error) which requires an explicit action by the implementor of the service (e.g. annotating the error type, wrapping the actual error type with a terminal wrapper, returning a special status code, tbd.), then the invocation will be completed with this error and it will be propagated to the caller. That way, we make sure that failing with an error does not happen accidentally but is an explicit action (the error becoming kind of part of the service's signature).

Manual intervention

By retrying indefinitely, we make sure that Restate only rolls forward. However, there can be situations where the system cannot make progress anymore (e.g. a service endpoint being permanently gone). In this case, we require the user to manually intervene by cancelling or killing the invocation. Manual intervention terminates an invocation with a terminal error (unless a retryable error occurs in the errorDefer handler) which propagates up the call stack. See #304 for some more details.

Boy scout rule: Clean up your actions before you leave

In order to provide well defined failure semantics for calling services, called services that fail with a terminal error should clean up their partial changes. The way it could work is that the user registers error-defer handlers for all actions the services has done (e.g. setting state, calling other services, side-effects, etc.). On a terminal error, these error-defer handlers (sagas/compensations) need to be executed. When cancelling a service, the system must also run the error-defer handlers. Only if the service gets killed, it is ok to not run these handlers, because the user has explicitly agreed to this behavior.

Side note: On retryable errors, the error-defer handlers must not be executed because the invocation will be retried.

Additional side note: We might want to introduce a retry as new error or command that runs the error-defer handler to clean up partial state and then re-invokes the same invocation to start from scratch.

tillrohrmann · 2023-06-09T16:06:35Z

Given the above described behavior, we should also be able to implement queue semantics for one way calls (#485):

Producing side: Making the service resolution a retryable service error of the service that wants to send a one way call will make sure that we will retry enqueuing the message until it succeeds or a user manually intervenes. This makes sure that messages won't be dropped when enqueuing them.

Consuming side: All user code exceptions except for terminal exception will trigger a retry of the invocation. This means that a message will only be removed from the queue (here our inbox) if it either succeeds (does not throw an exception) or if the user explicitly says to drop this message by throwing a terminal exception.

tillrohrmann added this to the 1A milestone Apr 17, 2023

tillrohrmann changed the title ~~Default failure behaviour~~ Default service failure behaviour Apr 17, 2023

tillrohrmann changed the title ~~Default service failure behaviour~~ Service failure semantics Apr 17, 2023

slinkydeveloper modified the milestones: 1A, 1B May 22, 2023

tillrohrmann mentioned this issue Jun 11, 2023

Status inspection #505

Closed

slinkydeveloper added the umbrella label Jun 21, 2023

slinkydeveloper self-assigned this Jun 21, 2023

slinkydeveloper added the semantics System semantics and behaviour label Jun 21, 2023

tillrohrmann closed this as completed Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service failure semantics #294

Service failure semantics #294

tillrohrmann commented Apr 17, 2023 •

edited

Loading

Tasks

slinkydeveloper commented Apr 17, 2023

slinkydeveloper commented May 9, 2023

slinkydeveloper commented May 22, 2023

tillrohrmann commented Jun 8, 2023

tillrohrmann commented Jun 9, 2023

Service failure semantics #294

Service failure semantics #294

Comments

tillrohrmann commented Apr 17, 2023 • edited Loading

Tasks

slinkydeveloper commented Apr 17, 2023

slinkydeveloper commented May 9, 2023

slinkydeveloper commented May 22, 2023

tillrohrmann commented Jun 8, 2023

General idea

Manual intervention

Boy scout rule: Clean up your actions before you leave

tillrohrmann commented Jun 9, 2023

tillrohrmann commented Apr 17, 2023 •

edited

Loading