Sparkify is a digital music application that enables users to listen to music online. Currently, the application aims to offer new content to users based on their song preferences. However, the information collected from users, including songs and user logs (metadata), is stored in JSON files, which avoids querying data and therefore running analytics on it.
JSON file containing song's information.
{
"num_songs": 1,
"artist_id": "ARD7TVE1187B99BFB1",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "California - LA",
"artist_name": "Casual",
"song_id": "SOMZWCG12A8C13C480",
"title": "I Didn't Mean To",
"duration": 218.93179,
"year": 0
}
JSON file containing user logs information.
{
"artist": null,
"auth": "Logged In",
"firstName": "Walter",
"gender": "M",
"itemInSession": 0,
"lastName": "Frye",
"length": null,
"level": "free",
"location": "San Francisco-Oakland-Hayward, CA",
"method": "GET",
"page": "Home",
"registration": 1540919166796.0,
"sessionId": 38,
"song": null,
"status": 200,
"ts": 1541105830796,
"userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
"userId": "39"
}
The proposed solution in this repository consists of migrating source data in JSON format to a PostgreSQL database named Sparkify following a star schema design and implementing an ETL pipeline.
The star schema that the Sparkify database follows is described in Figure 1. It consists of a FACT table surrounded by DIMENSION tables that contain data attributes. A star schema is simple to implement because data located in fact tables is not normalized. Also, this schema fosters data fetching by performing direct join operations among the dimension tables and the fact table.
Figure 1 Sparkify star schema design.
The ETL pipeline for migrating source data into the Sparkify database includes the following tasks after creating the database and required tables.
- Connect to the Sparkify database and get a cursor.
- Process song_data files.
- Append paths of JSON files.
- Get the total number of files to be processed.
- Process every single JSON file.
- Read JSON file.
- Drop records that contain any NULL fields.
- Extract fields for artist table.
- Insert records into artist table.
- Extract fields for song table.
- Insert records into song table.
- Process log_data files.
- Append paths of JSON files.
- Get the total number of files to be processed.
- Process every single JSON file.
- Read JSON file.
- Drop records that contain any _NUL_L fields.
- Extract timestamp field filtered by NextSong.
- Transform timestamp fields into datetime format.
- Insert datetime records into time table.
- Extract fields for user table.
- Insert records into user table.
- Extract records for songplays table including songid and artistid fields.
- Transform the timestamp and userid fields.
- Insert records into songplays table.
- Close connection
The following table describes the content of this repository.
File | Description |
create_tables.py | Python script that drops any existing database and tables, and creates the required Sparkify database and tables. |
etl.ipynb | Jupyter notebook that performs a step-by-step extraction, transformation, and loading of single records into the Sparkify database. |
etl.py | Python script that performs the extraction, transformation, and loading of all records into the Sparkify database. |
README.md | File that contains the main information and instructions of how to use this repository. |
sample_queries.ipynb | Jupyter notebook that includes queries examples for analyzing data from the Sparkify database. |
sql_queries.py | Python script that contains queries for dropping, creating, and inserting data into Sparkify tables. |
test.ipynb | Jupyter notebook that helps validate the data model by querying loaded data from the Sparkify database. |
data/log_data | Source file that contains information about user logs in JSON format. |
data/song_data | Source file that contains information about songs in JSON format. |
Before using this repository, you must comply with the following:
- Install Postgresql on your local machine
For iOS, follow this tutorial to install PostgreSQL. - Install Psycopg, a PostgreSQL adapter for the Python programming language.
- Clone this repository.
After you clone this repository:
-
Go to the root folder of this repository.
-
Run on your terminal the following command to create the Sparkify database and required tables:
python create_tables.py
-
Open and run the commands in the
test.ipyn
b notebook to validate that the required tables were created. -
Restart the kernel of the
test.ipynb
notebook to close the psycopg connection. -
Run on your terminal the following command to perform the extraction, transformation, and loading of records from the source files to the Sparkify database:
python etl.py
This is an excerpt of the result you get.
71 files found in data/song_data
1/71 files processed.
2/71 files processed.
3/71 files processed.
4/71 files processed.
5/71 files processed.
…
30 files found in data/log_data
1/30 files processed.
2/30 files processed.
3/30 files processed.
4/30 files processed.
5/30 files processed.
…
- Follows steps 2 and 3 to validate that your data is correctly loaded into the Sparkify database.
The following are sample queries to run some analytics on the Sparkify database. Consult the sample_queries.ipynb
file for more details.
%%sql SELECT user_id,
COUNT(user_id) AS total_user
FROM songplays
GROUP BY user_id
ORDER BY total_user DESC
LIMIT 5;
Results
user_id | total_user |
80 | 74 |
44 | 61 |
49 | 57 |
29 | 39 |
88 | 36 |
%%sql SELECT location,
COUNT(location) AS total_number
FROM songplays
GROUP BY location
ORDER BY total_number DESC
LIMIT 5;
Results
location | total_number |
Portland-South Portland, ME | 74 |
Waterloo-Cedar Falls, IA | 61 |
San Francisco-Oakland-Hayward, CA | 57 |
San Jose-Sunnyvale-Santa Clara, CA | 41 |
Atlanta-Sandy Springs-Roswell, GA | 39 |
%%sql SELECT user_agent,
COUNT(user_agent) AS total_agent
FROM songplays
GROUP BY user_agent
ORDER BY total_agent DESC
LIMIT 5;
Results
user_agent | total_agent |
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36" | 111 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Firefox/31.0 | 70 |
Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0 | 57 |
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2" | 52 |
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36" | 36 |
%%sql SELECT level,
COUNT(level) AS total_level
FROM songplays
GROUP BY level
ORDER BY total_level DESC;
Results
level | total_level |
paid | 410 |
free | 122 |
%%sql SELECT
users.gender,
COUNT(users.gender) AS total
FROM songplays
INNER JOIN users
ON songplays.user_id = users.user_id
GROUP BY users.gender
ORDER BY total DESC;
Results
gender | total |
F | 390 |
M | 142 |