Build an ETL pipeline for a data lake hosted on S3 for Sparkify. Load songs and log data from S3, process the data into analytics tables using Spark, loads the data back into S3 as a set of dimensional tables. Allow their analytics team to continue finding insights in what songs their users are listening to.
Working with two datasets that reside in S3:
s3://udacity-dend/song_data
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.
s3://udacity-dend/log_data
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.
Since the goal is to help analytics team to continue finding insights in what songs their users are listening to, create songplays table as the fact table and create users, artists, songs, time tables to support the data analytics insights. The ETL pipeline loads song data and song play log from S3. Process both data sets using Spark, which includes transforming the data sets to the correct format and creating temp views to populate the star schema for song play analysis. Then write all tables to parquet files in another directory on S3.
etl.py reads data from S3, processes that data using Spark, and writes them back to S3
dl.cfg contains the AWS credentials
Run etl.py in terminal to reads, processes and writes the data.