ETL process for Creating Sparkify Data Model in Postgres

This is my first project for the Udacity Nanodegree of Data Engineering. It is about an etl process (postgres based) for Sparkify.

Sparkify is a simulated (not-real) online music streaming service.

This Git repository shows how to script an etl process for loading data from json raw data to a Postgre SQL Database and for creating fact and dimension tables in that manner.

This is done using Python, mainly with pandas and psycopg2

Purpose of the Database sparkifydb

The sparkifydb database is postgre SQL based and is about storing information about songs and listening behaviour of the users

The analytical goal of this database to get all kings of insight into the user beahviour

Description of the ETL Pipeline

Description of the raw Datasets

Raw data comes in json formats and is stored in several subdirectories und the /data directory

log data

This directory contains jsons which show basically user activity per day on Sparkify.

song data

This directory contains jsons which show basically available songs and artists on Sparkify.

Scripts and Files

Basically the shell script RunScripts.sh contains the relevant etl files. So basically running this script (./RunScripts.sh) resets the sparkify database to an empty state and then creates all the table structures (create_tables.py). After that all the necessary data is derived from the json files under the /data directory and loaded into the tables created with the etl.py. The functions from sql_queries.py are used in both of these scripts for dropping, creating and inserting in postgre tables.

Other Files

test.ipynb

A notebook for testing the contents of the sql tables.

Final Data Structure

Please find descriptions of the final tables below

songplays

This is supposed to be the fact table and shows every single songplay activity, i.e. a user listened to a speicifc song at a specific time and so on. It has an artifical primary key via identity column and all other sorts of attributes concerning the songplay activity.

users

This table contains master data on users

songs

This table contains master data on songs

artists

This table contains master data on artists

time

This table contains master data on the timestamp, i.e. what hour, day, month, etc.

analysis

This notebook shows some basic analysis.

Potential next steps and imporovements

1. Use Bulk method for loading data. This would be necessary with bigger amounts data

2. Create a way to make only increment loads

3. Create alerts or similar to monitor the etl pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
.gitignore		.gitignore
.gitognore.swp		.gitognore.swp
README.md		README.md
RunScripts.sh		RunScripts.sh
analysis.ipynb		analysis.ipynb
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL process for Creating Sparkify Data Model in Postgres

Purpose of the Database sparkifydb

Description of the ETL Pipeline

Description of the raw Datasets

log data

song data

Scripts and Files

Other Files

test.ipynb

Final Data Structure

songplays

users

songs

artists

time

analysis

Potential next steps and imporovements

About

Releases

Packages

Languages

ChristophGmeiner/UdacityNDDE1_ETLPostgres

Folders and files

Latest commit

History

Repository files navigation

ETL process for Creating Sparkify Data Model in Postgres

Purpose of the Database sparkifydb

Description of the ETL Pipeline

Description of the raw Datasets

log data

song data

Scripts and Files

Other Files

test.ipynb

Final Data Structure

songplays

users

songs

artists

time

analysis

Potential next steps and imporovements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages