Skip to content

trevransom/spark_aws_data_lake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark + AWS Data Lake and ETL

Project Summary

Sparkify, a music streaming startup, wanted to collect logs they have on user activity and song data and centralize them in a database in order to run analytics. This AWS S3 data lake, set up with a star schema, will help them to easily access their data in an intuitive fashion and start getting rich insights into their user base.

I set up an EMR instance with a Spark cluster to process their logs, reading them in from an S3 bucket. I then ran transformations on that big data, distributing it out into separate tables and then writing it back into an S3 data lake.

Why this Database and ETL design?

My client Sparkify has moved to a cloud based system and now keeps their big data logs in an AWS S3 bucket. The end goal was to get that raw .json data from their logs into fact and dimenstion tables in a S3 data lake with parquet files.

Database structure overview

ER Diagram From Udacity

How to run

  • Start by cloning this repository
  • Install all python requirements from the requirements.txt
  • Create an S3 bucket and fill in those details in the etl.py main() output_data variable
  • Initialize an EMR cluster with Spark
  • Fill in the dl_template with your own custom details
  • SSH into the EMR cluster and upload your dl_template.cfg and etl.py files
  • Run spark-submit etl.py to initialize the spark job and write the resultant tables to parquet files in your s3 output path

About

Spark and AWS Data Lake with ETL pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages