Data Lake Porject

Build an ETL pipeline for a data lake hosted on S3 for Sparkify. Load songs and log data from S3, process the data into analytics tables using Spark, loads the data back into S3 as a set of dimensional tables. Allow their analytics team to continue finding insights in what songs their users are listening to.

Dataset

Working with two datasets that reside in S3:

s3://udacity-dend/song_data

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.

s3://udacity-dend/log_data

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.

Soluctions

Since the goal is to help analytics team to continue finding insights in what songs their users are listening to, create songplays table as the fact table and create users, artists, songs, time tables to support the data analytics insights. The ETL pipeline loads song data and song play log from S3. Process both data sets using Spark, which includes transforming the data sets to the correct format and creating temp views to populate the star schema for song play analysis. Then write all tables to parquet files in another directory on S3.

Scripts

etl.py reads data from S3, processes that data using Spark, and writes them back to S3

dl.cfg contains the AWS credentials

Running the tests

Run etl.py in terminal to reads, processes and writes the data.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Lake Porject

Dataset

Soluctions

Scripts

Running the tests

About

Releases

Packages

Languages

tjin35/dend-data-lake

Folders and files

Latest commit

History

Repository files navigation

Data Lake Porject

Dataset

Soluctions

Scripts

Running the tests

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages