Skip to content

Build an ETL pipeline for a data lake hosted on S3

Notifications You must be signed in to change notification settings

tjin35/dend-data-lake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Data Lake Porject

Build an ETL pipeline for a data lake hosted on S3 for Sparkify. Load songs and log data from S3, process the data into analytics tables using Spark, loads the data back into S3 as a set of dimensional tables. Allow their analytics team to continue finding insights in what songs their users are listening to.

Dataset

Working with two datasets that reside in S3:

s3://udacity-dend/song_data

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.

s3://udacity-dend/log_data

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.

Soluctions

Since the goal is to help analytics team to continue finding insights in what songs their users are listening to, create songplays table as the fact table and create users, artists, songs, time tables to support the data analytics insights. The ETL pipeline loads song data and song play log from S3. Process both data sets using Spark, which includes transforming the data sets to the correct format and creating temp views to populate the star schema for song play analysis. Then write all tables to parquet files in another directory on S3.

Scripts

etl.py reads data from S3, processes that data using Spark, and writes them back to S3

dl.cfg contains the AWS credentials

Running the tests

Run etl.py in terminal to reads, processes and writes the data.

About

Build an ETL pipeline for a data lake hosted on S3

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages