Skip to content

Commit

Permalink
Updated Kafka and Flume guides with reliability information.
Browse files Browse the repository at this point in the history
  • Loading branch information
tdas committed Dec 11, 2014
1 parent 2f3178c commit 2184729
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 3 deletions.
13 changes: 10 additions & 3 deletions docs/streaming-flume-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,9 +66,16 @@ configuring Flume agents.

## Approach 2 (Experimental): Pull-based Approach using a Custom Sink
Instead of Flume pushing data directly to Spark Streaming, this approach runs a custom Flume sink that allows the following.

- Flume pushes data into the sink, and the data stays buffered.
- Spark Streaming uses transactions to pull data from the sink. Transactions succeed only after data is received and replicated by Spark Streaming.
This ensures that better reliability and fault-tolerance than the previous approach. However, this requires configuring Flume to run a custom sink. Here are the configuration steps.
- Spark Streaming uses a [reliable Flume receiver](streaming-programming-guide.html#receiver-reliability)
and transactions to pull data from the sink. Transactions succeed only after data is received and
replicated by Spark Streaming.

This ensures that stronger reliability and
[fault-tolerance guarantees](streaming-programming-guide.html#fault-tolerance-semantics)
than the previous approach. However, this requires configuring Flume to run a custom sink.
Here are the configuration steps.

#### General Requirements
Choose a machine that will run the custom sink in a Flume agent. The rest of the Flume pipeline is configured to send data to that agent. Machines in the Spark cluster should have access to the chosen machine running the custom sink.
Expand Down Expand Up @@ -104,7 +111,7 @@ See the [Flume's documentation](https://flume.apache.org/documentation.html) for
configuring Flume agents.

#### Configuring Spark Streaming Application
1. **Linking:** In your SBT/Maven projrect definition, link your streaming application against the `spark-streaming-flume_{{site.SCALA_BINARY_VERSION}}` (see [Linking section](streaming-programming-guide.html#linking) in the main programming guide).
1. **Linking:** In your SBT/Maven project definition, link your streaming application against the `spark-streaming-flume_{{site.SCALA_BINARY_VERSION}}` (see [Linking section](streaming-programming-guide.html#linking) in the main programming guide).

2. **Programming:** In the streaming application code, import `FlumeUtils` and create input DStream as follows.

Expand Down
16 changes: 16 additions & 0 deletions docs/streaming-kafka-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,19 @@ title: Spark Streaming + Kafka Integration Guide
- Multiple Kafka input DStreams can be created with different groups and topics for parallel receiving of data using multiple receivers.

3. **Deploying:** Package `spark-streaming-kafka_{{site.SCALA_BINARY_VERSION}}` and its dependencies (except `spark-core_{{site.SCALA_BINARY_VERSION}}` and `spark-streaming_{{site.SCALA_BINARY_VERSION}}` which are provided by `spark-submit`) into the application JAR. Then use `spark-submit` to launch your application (see [Deploying section](streaming-programming-guide.html#deploying-applications) in the main programming guide).

Note that the Kafka receiver used by default is an
[*unreliable* receiver](streaming-programming-guide.html#receiver-reliability) section in the
programming guide). In Spark 1.2, we have added an experimental *reliable* Kafka receiver that
provides stronger
[fault-tolerance guarantees](streaming-programming-guide.html#fault-tolerance-semantics) of zero
data loss on failures. This receiver is automatically used when the write ahead log
(also introduced in Spark 1.2) is enabled
(see [Deployment](#deploying-applications.html) section in the programming guide). This
may reduce the receiving throughput of individual Kafka receivers compared to the unreliable
receivers, but this can be corrected by running
[more receivers in parallel](streaming-programming-guide.html#level-of-parallelism-in-data-receiving)
to increase aggregate throughput. Also it is strongly recommended that the replication in the
storage level be disabled when the write ahead log is enabled because the log is already stored
in a replicated storage system. This is done using `KafkaUtils.createStream(...,
StorageLevel.MEMORY_AND_DISK_SER)`.

0 comments on commit 2184729

Please sign in to comment.