This is my first project for the Udacity Nanodegree of Data Engineering. It is about an etl process (postgres based) for Sparkify.
Sparkify is a simulated (not-real) online music streaming service.
This Git repository shows how to script an etl process for loading data from json raw data to a Postgre SQL Database and for creating fact and dimension tables in that manner.
This is done using Python, mainly with pandas and psycopg2
The sparkifydb database is postgre SQL based and is about storing information about songs and listening behaviour of the users
The analytical goal of this database to get all kings of insight into the user beahviour
Raw data comes in json formats and is stored in several subdirectories und the /data directory
This directory contains jsons which show basically user activity per day on Sparkify.
This directory contains jsons which show basically available songs and artists on Sparkify.
Basically the shell script RunScripts.sh contains the relevant etl files. So basically running this script (./RunScripts.sh) resets the sparkify database to an empty state and then creates all the table structures (create_tables.py). After that all the necessary data is derived from the json files under the /data directory and loaded into the tables created with the etl.py. The functions from sql_queries.py are used in both of these scripts for dropping, creating and inserting in postgre tables.
A notebook for testing the contents of the sql tables.
Please find descriptions of the final tables below
This is supposed to be the fact table and shows every single songplay activity, i.e. a user listened to a speicifc song at a specific time and so on. It has an artifical primary key via identity column and all other sorts of attributes concerning the songplay activity.
This table contains master data on users
This table contains master data on songs
This table contains master data on artists
This table contains master data on the timestamp, i.e. what hour, day, month, etc.
This notebook shows some basic analysis.
1. Use Bulk method for loading data. This would be necessary with bigger amounts data
2. Create a way to make only increment loads
3. Create alerts or similar to monitor the etl pipeline