Project: Data Lake

A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.

You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.

Structure

This tree shows the repository structure. Only the project's main files are described.

.
├── images
│   ├── iam-create-user-01.png
│   ├── iam-create-user-02.png
│   ├── iam-create-user-03.png
│   ├── iam-create-user-04.png
│   ├── iam-create-user-05.png
│   ├── s3-artists-parquet.png
│   ├── s3-create-bucket-01.png
│   └── s3-create-bucket-02.png
│   └── s3-create-bucket-03.png
├── src
│   ├── pipelines
│   │   ├── __init__.py
│   │   ├── artists.py             # Artists table pipeline
│   │   ├── base.py                # Pipeline common code
│   │   ├── cache.py               # Dataframes cache
│   │   ├── songplays.py           # Songplays table pipeline
│   │   ├── songs.py               # Songs table pipeline
│   │   ├── time.py                # Time table pipeline
│   │   └── users.py               # Users table pipeline
│   ├── config.py                  # Application config manager
│   ├── etl.py                     # ETL pipeline script
│   └── sparkify.cfg               # Application config file
├── .editorconfig
├── .gitignore
├── README.md
└── requirements.txt

Requirements

It is assumed that the tools below are properly installed locally:

Python is a programming language that lets you work quickly and integrate systems more effectively. Versión 3.x is required.
pip is the package installer for Python.
Virtualenv is a tool to create isolated Python environments.
Amazon Web Services is the world's most comprehensive and broadly adopted cloud platform, offering over 165 fully featured services from data centers globally.

Preparing AWS

Before running the pipelines that will load the source data into our data lake, we must tune up some things in our AWS account.

Creating a bucket

We are using Amazon S3 as data lake, so we must first create the bucket in which the data will be stored.

Open the AWS console, log into your account and go to S3. Choose Buckets from the left side menu, and then click the button Create bucket.

Then choose a name for the bucket; remember to check the naming rules here. For the region we will use US West (Oregon), the same as the source data bucket.

For other options and permission we will keep the default values. Just follow the steps and confirm the bucket creation.

Once the bucket is created, select it from the main list. A popup box will appear from the right, showing some interesting properties regarding the selected bucket. Click the button Copy the Bucket ARN, because we will use it in a couple of minutes.

Creating an administrator user

Now go to IAM. Choose User from the left side menu, and then click the button Add user.

Fill in the User name box, and also mark the Programmatic access checkbox.

Do not add any permission to the user; just go to the end of the wizard, when the credentials are shown.

Copy the Access key ID and the Secret access key directly from the site and paste them in the application configuration file src/sparkify.cfg: do it in the proper configuration keys:

...
ACCESS_KEY_ID = the-access-key-id
SECRET_ACCESS_KEY = the-secret-access-key
...

Important: never share this credentials to anyone, your AWS account can be compromised.

It's time to add some permissions to the user. Go to the Users menu and navigate to the user Summary. Here you can select the tab Permissions and click the button Add inline policy.

The inline policy can be created easily using a visual editor, but we will write the JSON document corresponding the policy instead. Select the tab JSON.

And now copy and paste the JSON below into the policy editor.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::udacity-dend"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::udacity-dend/log-data/*",
                "arn:aws:s3:::udacity-dend/song-data/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": [
                "data-lake-bucket-arn"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObject"
            ],
            "Resource": [
                "data-lake-bucket-arn/*"
            ]
        }
    ]
}

The permissions are listed in the Statement collection:

The user can list the source bucket
The user can get all the objects stored in the log-data and song-data folders
The user can list the data lake bucket
The user can get, create and delete all the objects stored in the data lake bucket

In the last two permissions, make the Resource to point to the data lake: replace the text data-lake-bucket-arn with tha actual bucket ARN that we get before. Once the policy is ready, click the button Review policy and then Save changes.

Cloning the repository

The first step is to clone this repository. Just type the following command in your Terminal:

# Clone the repository...
git clone https://github.com/vermicida/data-lake.git

# ...and move to its directory
cd data-lake

Creating the virtual environment

Once the project is cloned, you must create a new virtual environment for it. You may have multiple Python versions installed, so it's important to specify the correct one with the --python argument. The code on this repository has been successfully tested over Python 3.6, but it must work on any Python 3 version; set the version that applies to your case in the following commands:

# Create the virtual environment...
virtualenv --python python3.6 project-env

# ...and activate it
source project-env/bin/activate

It's time to install the project's dependencies. You will use pip to do this job:

pip install -r requirements.txt

It will be ready in a minute.

How to use

Here are listed the steps to follow in order to make the ETL pipeline work.

Running the ETL

This step will test your patience: it can takes a few hours to complete. It performs the actions described below:

Create a Spark session.
Read the song data, stored in S3, into a dataframe. Note: I decided to take just a slice of this data because of the slowness of the writting process that comes later. Instead of get all the JSON files, it just take those from partition A/A/A.
Transform the song data to create the songs dimension table and store it back in S3 in Apache Parquet format.
Transform the song data to create the artists dimension table and store it back in S3 in Apache Parquet format.
Read the log data, stored in S3, into a dataframe.
Transform the log data to create the users dimension table and store it back in S3 in Apache Parquet format.
Transform the log data to create the time dimension table and store it back in S3 in Apache Parquet format.
Transform the song and log data to create the songplays fact table and store it back in S3 in Apache Parquet format.

Run this command:

# Move to the src directory...
cd src

# ...and run the ETL
python etl.py

Now better go to eat something, take a nap, or watch the last episode of Money Heist.

When the ETL process ends, go to your data lake bucket and verify that it worked like expected. The image below shows a sample of the artists table in Apache Parquet format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Data Lake

Table of contents

Structure

Requirements

Preparing AWS

Creating a bucket

Creating an administrator user

Cloning the repository

Creating the virtual environment

How to use

Running the ETL

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
images		images
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

vermicida/data-lake

Folders and files

Latest commit

History

Repository files navigation

Project: Data Lake

Table of contents

Structure

Requirements

Preparing AWS

Creating a bucket

Creating an administrator user

Cloning the repository

Creating the virtual environment

How to use

Running the ETL

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages