This Project simply demonstrates how to build a robust automated and scalable data pipeline for extracting, transforming and loading Spotify data using AWS Serverless Data architecture.
Data was extracted from Spotify API and stored in S3 bucket. The extraction process was automated using Cloudwatch Events and scheduled to run every 24 hours. After extraction, the data was transformed and stored back in S3 with a S3 event trigger activated to automate the cleaning process as well.
In order to allow the analytics team consume the data in a structured format, AWS Glue was used to crawl the data from S3 and saved it in a table format inside the spotify database in AWS Glue Data Catalogue.
Finally, AWS Athena was deployed to query the data and run analytics so as to derive useful and actionable insights to aid Data Driven decisions.
- Jupyter Notebook
- AWS Lambda
- S3
- CloudWatch
- AWS Glue
- AWS Athena
pip install numpy as py
pip install pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
The first thing i did was to develop the code in Jupyter notebook where all the data needed was extracted from the spotify API and transformed accordingly to meet business requirements and ready to be deployed to AWS Lambda.
After developing the code in the dev environment and everything works fine, i then deployed the extraction code in AWS Lambda and a Cloudwatch Event trigger was activated to schedule the extrcation of data every 24 hours.
The Data extracted was then stored in S3 bucket so that transformation can be performed on it as well to meet business requirement.
Next, i deployed the transformation code to lambda, which takes the raw data from s3 bucket and transforms it before storing it in another folder in the same S3 bucket.All of this process was also automated by adding an S3 event trigger to the lambda function.
Now the data is clean and ready for the analytics team to consume. For seamless running of Analytics on the data inside our S3 bucket, i deployed a Glue crawler which infers the schema from the data in the S3 and automatically loads the data to Spotify database in Glue Catalogue.
Finally, AWS Athena was deployed to run analytics using SQL queries on the data
This Project was able to demonstrate how we can build a robust, automated, scalable end to end data pipeline using AWS Serverless Data Architecture.