Skip to content

This Project simply demonstrates how to build a robust automated data pipeline for extracting, transforming and loading Spotify data using AWS Serverless Data architecture.

Notifications You must be signed in to change notification settings

jaykay04/Spotify-ETL-Project-Using-AWS-Serverless-Architecture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Spotify ETL Project Using AWS Serverless Architecture

Introduction

This Project simply demonstrates how to build a robust automated and scalable data pipeline for extracting, transforming and loading Spotify data using AWS Serverless Data architecture.

Data was extracted from Spotify API and stored in S3 bucket. The extraction process was automated using Cloudwatch Events and scheduled to run every 24 hours. After extraction, the data was transformed and stored back in S3 with a S3 event trigger activated to automate the cleaning process as well.

In order to allow the analytics team consume the data in a structured format, AWS Glue was used to crawl the data from S3 and saved it in a table format inside the spotify database in AWS Glue Data Catalogue.

Finally, AWS Athena was deployed to query the data and run analytics so as to derive useful and actionable insights to aid Data Driven decisions.

AWS Services and Development Environment

  • Jupyter Notebook
  • AWS Lambda
  • S3
  • CloudWatch
  • AWS Glue
  • AWS Athena

Libraries

pip install numpy as py
pip install pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

Architecture Diagram

Work Flow

The first thing i did was to develop the code in Jupyter notebook where all the data needed was extracted from the spotify API and transformed accordingly to meet business requirements and ready to be deployed to AWS Lambda.

After developing the code in the dev environment and everything works fine, i then deployed the extraction code in AWS Lambda and a Cloudwatch Event trigger was activated to schedule the extrcation of data every 24 hours.
The Data extracted was then stored in S3 bucket so that transformation can be performed on it as well to meet business requirement.

Next, i deployed the transformation code to lambda, which takes the raw data from s3 bucket and transforms it before storing it in another folder in the same S3 bucket.All of this process was also automated by adding an S3 event trigger to the lambda function.

Now the data is clean and ready for the analytics team to consume. For seamless running of Analytics on the data inside our S3 bucket, i deployed a Glue crawler which infers the schema from the data in the S3 and automatically loads the data to Spotify database in Glue Catalogue.

Finally, AWS Athena was deployed to run analytics using SQL queries on the data

Conclusion

This Project was able to demonstrate how we can build a robust, automated, scalable end to end data pipeline using AWS Serverless Data Architecture.

About

This Project simply demonstrates how to build a robust automated data pipeline for extracting, transforming and loading Spotify data using AWS Serverless Data architecture.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published