Apache Spark Reference Architecture - Streaming ETL

Ingesting and performing data transformations via continously streaming of data is one of the common use cases in Apache Spark.

When talking about Streaming ETL, data could be coming from IoT devices, social media interactions, historical data etc. and can be classified into two main types :

Controlled Streaming : In this scenario, the data set volume is known ahead of time, and the data stream data is produce in a steady and controlled way (e.g. fixed volume of historical data). When dealing with controlled streaming data source, the addition of a message broker

Variable Streaming : In this scenario, the data set is being dynamically produced, for a long period, and the streaming volume of data vary with time (e.g. social data volume increases during a special event).

It's very important to understand the characteristics of your data stream, as it will dictate if your application should use a message broker such as Apache Kafka to help you control a steady flow of data even when the volume of data being pumped to the stream increases.

Common Streaming ETL use cases

Processing Logs

Processing Operational Metrics

Processing Social Data

Deployment Architecture

Deployment topology

Hardware

Softlayer machines with the following configuration : 24 cores, 128 GB RAM, 4 TB disk (4x1TB disks)

Enterprise configuration considerations

HDFS Durability

Data availability and dirability in HDFS is acomplished by keeping replication at three; which would enable data recovery in case of disk failures.

High Availability

HDFS availability can be accomplished by deploying the NameNode with high availability with at least three JournalNodes. The nodes should be placed into different racks to ensure they don’t all go down due to hardware failure or host maintenance.

Apache Spark availability can be acomplished by deploying multiple Spark Master nodes in conjunction with a Zookepper korum that would be used to master election.

ZooKeeper

It is recommended to deploy at least three ZooKeeper instances for availability and durability reasons. It is also recommended using temporary disk storage for ZooKeeper data.

References

https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Apache Spark Reference Architecture - Streaming ETL

Common Streaming ETL use cases

Deployment Architecture

Deployment topology

Hardware

Enterprise configuration considerations

HDFS Durability

High Availability

ZooKeeper

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Apache Spark Reference Architecture - Streaming ETL

Common Streaming ETL use cases

Deployment Architecture

Deployment topology

Hardware

Enterprise configuration considerations

HDFS Durability

High Availability

ZooKeeper

References