Skip to content

fadebowaley/Data-Modelling-Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Project: ETL, Spark and Airflow

This is a data engineering project that utilizes Apache Spark and Airflow to perform ETL (Extract, Transform, Load) operations on data from various sources. The purpose of this project is to demonstrate how to build a scalable and automated data pipeline that can handle large volumes of data and process it efficiently.

Overview

The project consists of three main components:

ETL: Extract data from various sources, transform it, and load it into a target database. The sources of data can be structured or unstructured, and can come from various formats, such as CSV, JSON, XML, or other data sources.

Spark: Use Apache Spark to process the data. Spark is a fast and powerful framework for processing large amounts of data in a distributed environment. It provides a unified API for batch and stream processing, machine learning, and graph processing.

Airflow: Use Apache Airflow to schedule and orchestrate the ETL pipeline. Airflow is a platform to programmatically author, schedule, and monitor workflows. It provides a web-based user interface for creating and managing workflows and tasks.

Prerequisites

Before running the project, you need to have the following prerequisites installed:

Apache Spark Apache Airflow Python 3.x PostgresSQL or any other target database Git

Getting Started

To get started with the project, follow these steps:

  1. Clone the repository using the following

git clone

  1. Create a virtual environment for the project using the following command:

python3 -m venv env

  1. Activate the virtual environment using the following command:

source env/bin/activate

  1. Install the project dependencies using the following command:

pip install -r requirements.txt

Create Configure the database connection in config.py.

Run the Airflow webserver using the following command:

airflow webserver

Run the Airflow scheduler using the following command:

airflow scheduler

Open the Airflow web interface at http://localhost:8080 and create the DAG (Directed Acyclic Graph) by clicking on the "DAGs" menu and then clicking "Create".

Fill in the DAG configuration, including the source and target database connections, and the ETL steps.

Save and trigger the DAG to start the ETL pipeline.

Conclusion

This project demonstrates how to build a scalable and automated ETL pipeline using Apache Spark and Airflow. With this pipeline, you can process large volumes of data efficiently and reliably. Feel free to modify and extend the project to suit your needs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages