README

Introduction

Sparkify is a digital music application that enables users to listen to music online. Currently, the application aims to offer new content to users based on their song preferences. However, the information collected from users, including songs and user logs (metadata), is stored in JSON files, which avoids querying data and therefore running analytics on it.

JSON file containing song's information.

{
    "num_songs": 1,
    "artist_id": "ARD7TVE1187B99BFB1",
    "artist_latitude": null,
    "artist_longitude": null,
    "artist_location": "California - LA",
    "artist_name": "Casual",
    "song_id": "SOMZWCG12A8C13C480",
    "title": "I Didn't Mean To",
    "duration": 218.93179,
    "year": 0
}

JSON file containing user logs information.

{
    "artist": null,
    "auth": "Logged In",
    "firstName": "Walter",
    "gender": "M",
    "itemInSession": 0,
    "lastName": "Frye",
    "length": null,
    "level": "free",
    "location": "San Francisco-Oakland-Hayward, CA",
    "method": "GET",
    "page": "Home",
    "registration": 1540919166796.0,
    "sessionId": 38,
    "song": null,
    "status": 200,
    "ts": 1541105830796,
    "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
    "userId": "39"
}

Proposed solution

The proposed solution in this repository consists of migrating source data in JSON format to a PostgreSQL database named Sparkify following a star schema design and implementing an ETL pipeline.

The star schema that the Sparkify database follows is described in Figure 1. It consists of a FACT table surrounded by DIMENSION tables that contain data attributes. A star schema is simple to implement because data located in fact tables is not normalized. Also, this schema fosters data fetching by performing direct join operations among the dimension tables and the fact table.

Figure 1 Sparkify star schema design.

The ETL pipeline for migrating source data into the Sparkify database includes the following tasks after creating the database and required tables.

Connect to the Sparkify database and get a cursor.
Process song_data files.
1. Append paths of JSON files.
2. Get the total number of files to be processed.
3. Process every single JSON file.
  1. Read JSON file.
  2. Drop records that contain any NULL fields.
  3. Extract fields for artist table.
  4. Insert records into artist table.
  5. Extract fields for song table.
  6. Insert records into song table.
Process log_data files.
1. Append paths of JSON files.
2. Get the total number of files to be processed.
3. Process every single JSON file.
  1. Read JSON file.
  2. Drop records that contain any _NUL_L fields.
  3. Extract timestamp field filtered by NextSong.
  4. Transform timestamp fields into datetime format.
  5. Insert datetime records into time table.
  6. Extract fields for user table.
  7. Insert records into user table.
  8. Extract records for songplays table including songid and artistid fields.
  9. Transform the timestamp and userid fields.
  10. Insert records into songplays table.
Close connection

Explanation of the files in this repository

The following table describes the content of this repository.

File	Description
create_tables.py	Python script that drops any existing database and tables, and creates the required Sparkify database and tables.
etl.ipynb	Jupyter notebook that performs a step-by-step extraction, transformation, and loading of single records into the Sparkify database.
etl.py	Python script that performs the extraction, transformation, and loading of all records into the Sparkify database.
README.md	File that contains the main information and instructions of how to use this repository.
sample_queries.ipynb	Jupyter notebook that includes queries examples for analyzing data from the Sparkify database.
sql_queries.py	Python script that contains queries for dropping, creating, and inserting data into Sparkify tables.
test.ipynb	Jupyter notebook that helps validate the data model by querying loaded data from the Sparkify database.
data/log_data	Source file that contains information about user logs in JSON format.
data/song_data	Source file that contains information about songs in JSON format.

Prerequisites

Before using this repository, you must comply with the following:

Install Postgresql on your local machine
For iOS, follow this tutorial to install PostgreSQL.
Install Psycopg, a PostgreSQL adapter for the Python programming language.
Clone this repository.

How to run the Python scripts

After you clone this repository:

Go to the root folder of this repository.
Run on your terminal the following command to create the Sparkify database and required tables:

python create_tables.py
Open and run the commands in the test.ipynb notebook to validate that the required tables were created.
Restart the kernel of the test.ipynb notebook to close the psycopg connection.
Run on your terminal the following command to perform the extraction, transformation, and loading of records from the source files to the Sparkify database:

python etl.py

This is an excerpt of the result you get.

71 files found in data/song_data
    1/71 files processed.
    2/71 files processed.
    3/71 files processed.
    4/71 files processed.
    5/71 files processed.
    …

    30 files found in data/log_data
    1/30 files processed.
    2/30 files processed.
    3/30 files processed.
    4/30 files processed.
    5/30 files processed.
    …

Follows steps 2 and 3 to validate that your data is correctly loaded into the Sparkify database.

Sample queries and for data analysis

The following are sample queries to run some analytics on the Sparkify database. Consult the sample_queries.ipynb file for more details.

Identifying the top 5 users with the most records

%%sql SELECT user_id, 
COUNT(user_id) AS total_user
FROM songplays
GROUP BY user_id
ORDER BY total_user DESC
LIMIT 5;

Results

user_id	total_user
80	74
44	61
49	57
29	39
88	36

Identifying the top 5 user locations

%%sql SELECT location,
COUNT(location) AS total_number
FROM songplays  
GROUP BY location 
ORDER BY total_number DESC
LIMIT 5;

Results

location	total_number
Portland-South Portland, ME	74
Waterloo-Cedar Falls, IA	61
San Francisco-Oakland-Hayward, CA	57
San Jose-Sunnyvale-Santa Clara, CA	41
Atlanta-Sandy Springs-Roswell, GA	39

Identifying the top 5 web browsers

%%sql SELECT user_agent,
COUNT(user_agent) AS total_agent
FROM songplays  
GROUP BY user_agent 
ORDER BY total_agent DESC
LIMIT 5;

Results

user_agent	total_agent
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"	111
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Firefox/31.0	70
Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0	57
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"	52
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"	36

Identifying top 5 users with paid and free subscriptions

%%sql SELECT level,
COUNT(level) AS total_level
FROM songplays  
GROUP BY level 
ORDER BY total_level DESC;

Results

level	total_level
paid	410
free	122

Identifying the gender of users

%%sql SELECT 
users.gender,
COUNT(users.gender) AS total
FROM songplays  
INNER JOIN users
ON songplays.user_id = users.user_id
GROUP BY users.gender
ORDER BY total DESC;

Results

gender	total
F	390
M	142

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Introduction

Proposed solution

Explanation of the files in this repository

Prerequisites

How to run the Python scripts

Sample queries and for data analysis

Identifying the top 5 users with the most records

Identifying the top 5 user locations

Identifying the top 5 web browsers

Identifying top 5 users with paid and free subscriptions

Identifying the gender of users

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
data		data
images		images
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sample_queries.ipynb		sample_queries.ipynb
sql_queries.py		sql_queries.py
sql_queries.pyc		sql_queries.pyc
test.ipynb		test.ipynb

sblaizerwize/Data-Modeling-with-Postgres

Folders and files

Latest commit

History

Repository files navigation

README

Introduction

Proposed solution

Explanation of the files in this repository

Prerequisites

How to run the Python scripts

Sample queries and for data analysis

Identifying the top 5 users with the most records

Identifying the top 5 user locations

Identifying the top 5 web browsers

Identifying top 5 users with paid and free subscriptions

Identifying the gender of users

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages