-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service failure semantics #294
Comments
|
I'm gonna move this out of 1A, as for the time being we set up infinite retries as described in the document, we retry on every error and we have the ability to kill invocations. On the long term we probably need to address cancellations and eventually restatedev/service-protocol#19 as well. |
Based on an offline discussion we concluded the following failure semantics for Restate: General ideaAll errors while processing an invocation (SDK bugs, determinism error, infrastructure errors, application bugs) are considered retryable errors and will be retried indefinitely by the runtime. Only if the service returns a special non-retryable error (terminal error) which requires an explicit action by the implementor of the service (e.g. annotating the error type, wrapping the actual error type with a terminal wrapper, returning a special status code, tbd.), then the invocation will be completed with this error and it will be propagated to the caller. That way, we make sure that failing with an error does not happen accidentally but is an explicit action (the error becoming kind of part of the service's signature). Manual interventionBy retrying indefinitely, we make sure that Restate only rolls forward. However, there can be situations where the system cannot make progress anymore (e.g. a service endpoint being permanently gone). In this case, we require the user to manually intervene by cancelling or killing the invocation. Manual intervention terminates an invocation with a terminal error (unless a retryable error occurs in the errorDefer handler) which propagates up the call stack. See #304 for some more details. Boy scout rule: Clean up your actions before you leaveIn order to provide well defined failure semantics for calling services, called services that fail with a terminal error should clean up their partial changes. The way it could work is that the user registers error-defer handlers for all actions the services has done (e.g. setting state, calling other services, side-effects, etc.). On a terminal error, these error-defer handlers (sagas/compensations) need to be executed. When cancelling a service, the system must also run the error-defer handlers. Only if the service gets killed, it is ok to not run these handlers, because the user has explicitly agreed to this behavior. Side note: On retryable errors, the error-defer handlers must not be executed because the invocation will be retried. Additional side note: We might want to introduce a retry as new error or command that runs the error-defer handler to clean up partial state and then re-invokes the same invocation to start from scratch. |
Given the above described behavior, we should also be able to implement queue semantics for one way calls (#485): Producing side: Making the service resolution a retryable service error of the service that wants to send a one way call will make sure that we will retry enqueuing the message until it succeeds or a user manually intervenes. This makes sure that messages won't be dropped when enqueuing them. Consuming side: All user code exceptions except for terminal exception will trigger a retry of the invocation. This means that a message will only be removed from the queue (here our inbox) if it either succeeds (does not throw an exception) or if the user explicitly says to drop this message by throwing a terminal exception. |
The failure semantics are defined as follows: #294 (comment)
Tasks
The text was updated successfully, but these errors were encountered: