-
Notifications
You must be signed in to change notification settings - Fork 6
Fault Tolerance
As when it comes to data streams fault tolerance becomes more important aspect since lost messages or unprocessed messages may be lost or cause significant trouble.
The processing elements of the SPQR framework (micro pipelines, components, ...) do not provide any special features which support fault tolerance. But as the framework (internally) relies on Apache Kafka to exchange data between micro pipelines, all fault tolerance features provided through Kafka are implicitly available to SPQR.
If for any setup a different communication layer is selected (by providing appropriate source and emitter components) logically different fault tolerance features are valid.
When a messages enters a pipeline and thus leaves the boundaries of the underlying Kafka layer, it is out of reach for Kafka and its fault tolerance features. But there are two different aspects to make processing as save as possible (where the latter is the better approach)
- enable message commits on Kafka and ensure that incoming messages are committed when they leave the micro pipeline - bit tricky to solve, especially when delayed response operators are deployed and at-most-once delivery may be spolied
- internal queues used for passing data between components are based on the OpenHFT Chronicle queue library which persists its contents to disk - replay is possible even on internal queue layer (work in progress!!)
If a node fails for any reason and there is one that executes the same pipeline specification (and uses the same client identifier for connecting with Kafka), it naturally takes over all traffic previously handled by the failed node, due to the nature of Kafka and its client implementation.
If a pipeline fails on an arbitrary node, it may be restarted and keep on working where it was interrupted by simply resetting the chronicle offset (work in progress!!)
SPQR - stream processing and querying in realtime by Otto Group