End To End Fred API Data Pipeline

Problem Statement

This project implements concepts learn from Data Engineering Zoomcamp course. To build a pipeline to ingest the data store them in a datalake and process them into the Data warehouse.

The Data used in this project is from FRED Economic Data. Using the FRED API to call the data using Python with Prefect Orchestration. Store them in Google Cloud Storage Bucket and Use DBT to transform data into BigQuery and connect with Looker Studio to find insight from the data.

There are multiple data in FRED Economic data. I want to create a pipeline to ingest these databases base on the topic choose. Analyze the data to find any interesting trends. The Data is being updated in different frequencies Monthly, quarterly, Yearly, etc. Some of the data went back to the 1990.

In this project, I aim to create the pipeline to ingest the above data and create report to find the

The trend of the data
Breakdown of the data by Country

Technologies

This project used the tool below.

Perfect as data Orchestration
DBT core as a data transformation tool for data warehouse
Terraform as infrastructure setup and management
Docker for hosting Prefect Agent
Google Cloud Storage Bucket as a data lake.
Google BigQuery as a Data warehouse
Google Compute Engine as VM to host the pipeline
Looker Studio for report and visualization
Makefile for ease of reproducibility

Project Data flow

The Data flow for this project

(1) Ingest the Data via API

The data is called from FRED API.

We first need to get the category id from for all the topics. But FRED does not have an endpoint to get all the category id so to get this data we need to scape it from the Category Page by ussing this Python script.

Checking the Robots.txt FRED does not disallow scraping of this data.

After we ran the script and get the category id we then can use the id to call Cagetory Series endpoint to get all the series associated with the category.

After we get the series id we can call Maps API - Series Group Info to get the group id of the series not all series id have group id and there is no easy way to find out which series have seris group id and which one is not. So to reduce the complexity down. I have extract the series with the group id and saved them in this CSV.

This file use a column Active to indicate which data should be call daily by the script. And have data for the frequency of the data collect along with the date data start collect and the current data date.

then we use the series group id to call the Maps API - Series Data endpoint which will return the data by country for the id we requested.

(2) Load data via an External table

The data is stored in google cloud storage in stagging folder for the most recent data.

When the new data is call the previos data is move to the archive folder to keep at log and will be delete in 30 day

We use Google BigQuery to connect to the data using external table data sources this connection is defined in Terraform file. The dataset is seperate into 2 datasets for development and production.

(3) Data tranformation

This project uses DBT for data transformation the model is separated into 2 stages core and stagging. In staging this is for casting the data type into the correct type and in core use this to join all the tables from stagging and from seed into one table

(4) Visualize data to find insight

Use Looker Studio to connect to the BigQuery data warehouse and create reports to find trends and insight

Link to the Dashboard

Due to the trail account the above link may no longer work.

If you can't access the link please use this video to see the dashboard created instead.

(5)(6) Schedule daily data ingest and transformation

Used prefect as an orchestrator tool to schedule the daily call of our script and data transformation using the deployment functionality.

Reproducibility

Prerequisite: To reproduce this project you would need the below account

You also need below package

Makefile pip install make
Gcloud CLI pip install gcloud
Terraform
DotEnv pip install python-dotenv

Step 1: Create infrastructure

Clone this project

git clone https://github.com/Chalermdej-l/Final_Project_FredETE

Access the clone directory

cd Final_Project_FredETE

Input the credential create in the Prerequisite step into the .env file

Input the credential.json create in Google Cloud Account into the folder cred

Run the following command using Makefile depending on your system

make update-yml-window

make update-yml-linix

This code will populate the credential in the YAML file using the credential input in the .env file

!!Please note if you ran this command before input the credential please re-clone the project again as the YAML file will be populate with incorrect data

Next, let's setup the infrastructure

make infra-setup

This command will setup the terraform and run the plan to check for any error

To create the resource please run

make infra-create

This will create BigQuery, Google Cloud Storage, VM Instances

Once the code is done please go to the VM Instances and copy the external IP

Please input the External IP into the .env file we will need this to connect to the VM

Step 2: Connect to the VM and install the dependency

Open a new terminal and navigate to the clone directory we will use this terminal to connect to our created VM. And run the below command

make vm-connect

This script will connect to the VM. There might be a question asking to save this host into the known host please select yes.

After we are in the VM please clone the repository again

git clone https://github.com/Chalermdej-l/Final_Project_FredETE

Then navigate to the clone folder

cd Final_Project_FredETE

Run the below command to install Python Make and dotenv

sudo apt-get update -y

sudo apt install python3-pip -y

sudo pip install make python-dotenv

Then go back to the local terminal and run

make vm-copycred

This will copy the credential we input in .env and the credential.json we download to the VM

Go back to the VM terminal and run the below command to setup the credential

make update-yml-linix

!!Please note if you ran this command before copy the credential over please re-clone the project again in the VM and start at previous step

Then run the below command to install Docker and Docker-Compose

make vm-setup

sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

sudo chmod +x /usr/local/bin/docker-compose
docker-compose --version

If docker-compose installed correctly then there will print the current version.

After finishing run the below command to create the docker image

make docker-build

Then run this command to spin up the docker image which will host our prefect agent in the background

make docker-up

Step 3: Deploy code to prefect

After the docker is running please run this commnad will deploy our python script to prefect with the schedule to run monthly

make deployment-create

This command will deploy the sciprt to run DBT to tranform our database with a schedule to run monthly. As the data only updated by month.

make deployment-dbtprod

Then please run the below command to set up the data to call the script

make dbt-ingest

If you run this command and receive this eror [Errno 2] No such file or directory: 'cred/credential.json' then run the below command

nano profiles.yml

And please change the keyfile to ../cred/credential.json and re run the above command

This will run the DBT seed and set up the data for the script to run

Step 4: Run the script

Please go to Prefect and run the job in below order to start ingesting the data

Fred-Category
Fred-Series
Fred-MapAPI
DBT-Daily-prod

After finish running all the jobs the data will be ingested into BigQuery

Step 5: Clean up

After finish with the project if you want to remove the project you can run the below command in local terminal

make infra-down

This command will run terraform destory to all resource we created in this project.

Further Improvements

Implement CI/CD
Explore other Endpoint of the API

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.ssh		.ssh
DBT		DBT
cred		cred
flows		flows
infra		infra
other		other
.env		.env
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
profiles.yml		profiles.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End To End Fred API Data Pipeline

Table of contents

Problem Statement

Technologies

Project Data flow

(1) Ingest the Data via API

(2) Load data via an External table

(3) Data tranformation

(4) Visualize data to find insight

(5)(6) Schedule daily data ingest and transformation

Reproducibility

Step 1: Create infrastructure

Step 2: Connect to the VM and install the dependency

Step 3: Deploy code to prefect

Step 4: Run the script

Step 5: Clean up

Further Improvements

About

Releases

Packages

Languages

Chalermdej-l/Final_Project_FredETE

Folders and files

Latest commit

History

Repository files navigation

End To End Fred API Data Pipeline

Table of contents

Problem Statement

Technologies

Project Data flow

(1) Ingest the Data via API

(2) Load data via an External table

(3) Data tranformation

(4) Visualize data to find insight

(5)(6) Schedule daily data ingest and transformation

Reproducibility

Step 1: Create infrastructure

Step 2: Connect to the VM and install the dependency

Step 3: Deploy code to prefect

Step 4: Run the script

Step 5: Clean up

Further Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages