Covid_ETL_Project_with_Azure_Data_Factory

This Project is about Injesting couple of Covid-19 Datasets from the ECDC website, transforming it using various ADF components, then loading it into SQL Datawarehouse for the Analytics team to derive useful and actionable insights from these datasets. We want to know the impact of Covid in the European Region as a whole during the year 2020.

Azure Data Factory

Azure Data Factory (ADF) is a fully managed, serverless data integration solution for ingesting, preparing, and transforming all your data at scale. It enables every organization in every industry to use it for a rich variety of use cases: data Engineering, migrating their on-premises SSIS packages to Azure, operational data integration, analytics, ingesting data into data warehouses, and more.

Situation

We have various Covid-19 data from multiple sources. Due to the fact that these datasets are scattered everywhere, it became relatively difficult to extract useful information from them at a go.

Task

The task of this project is to ingest the dataset from the various data sources, clean and transform the data to make it more robust and fit for purspose. The cleaned data should then be loaded into a central repo, like a Datawarehosue and Datalake so that the analytics team can consume it with their BI tools such as Power BI. Also, we can also run ML Models on Datalake.

Approach

Environment Setup

Azure Subscription
Data Factory
Azure Blob Storage Account
Data Lake Storage Gen 2
Azure SQL Database
Azure Databricks Cluster
HD Insight Cluster
Azure Devops

Solution Architecture Diagram

Extraction/Ingestion

Four different datasets was ingested from both the ECDC website and azure blob storage into Datalake Gen2. They are;

Cases and Deaths Data
Hospital Admissions Data
Population Data
Test Conducted Data

We used various components of ADF Pipeline activities to ingest the data from both HTTP Data Source and Azure Storage Account to Azure DataLake. Some of those activities are;

Validation Activity
Get Metadata Activity
Copy Activity

Transformation

The Cases and Deaths data together with the Hospital admissions data was trasnformed using ADF Data flows. The Data Flows transformation used on both dataset include;

Select transformation
LookUp transformation
Filter transformation
Join transformation
Sort transformation
Conditional split transformation
Derived columns transformation
Sink transformation et.c

These set of transformations was done to make the datasets more robust and fit for purpose.

For the Population data set, Azure databricks was solely used to transform it so that it can be robust and fit for purpose. The following steps was taken to transform the data on Azure Databricks.

Azure Datalake storage was mounted on databricks so as to allow databricks read the data into the pyspark notebook easily.

Next the Population data was read and a temp view was created which was followed by pivoting the dataset by age group as shown below

The LookUp data was read, followed by a Join transformation before the transformed data was written back to the mount point in ADLS

Load

Now that the datasets are now fully prepared, we loaded it into the SQL Datawarehouse for the Analytics team to consume with their BI tools with ease.

The first thing was to create the tables in the SQL database that will house each of the datasets

* After creating the tables, the copy activity was used to load the dataset to its respective tables in the SQL Datawarehouse.

Result and Analysis

The SQL Datawarehouse was connected to Power BI for the following Insights and Visualizations

Covid-19 Trend in the EU/EEA & UK 2020 by Cases, Deaths, Hospital Occupancy and ICU Occupancy

Covid-19 Cases and Death breakdown by population in the UK, France and Germany

Confirmed Cases Vs Total Deaths By Country

Total Number of Covid test carried out Vs Confirmed Cases

Follow Me

Linkedin: https://www.linkedin.com/in/joshua-gbegudu-6a7a59a9/
Github: github.com/jaykay04
Instagram: @Official_jaykay04

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Raw Dataset		Raw Dataset
images		images
ADFProject Diagram.png		ADFProject Diagram.png
Covid-19+Trends+By+Country.pbix		Covid-19+Trends+By+Country.pbix
README.md		README.md
covid_transform_testing.hql		covid_transform_testing.hql
create_covid_tables_ddl.sql		create_covid_tables_ddl.sql
mount_storage.py		mount_storage.py
transform_population_data.py		transform_population_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Covid_ETL_Project_with_Azure_Data_Factory

Azure Data Factory

Situation

Task

Approach

Environment Setup

Solution Architecture Diagram

Extraction/Ingestion

Transformation

Load

Result and Analysis

The SQL Datawarehouse was connected to Power BI for the following Insights and Visualizations

Follow Me

About

Releases

Packages

Languages

jaykay04/Covid_ETL_Project_with_Azure_Data_Factory

Folders and files

Latest commit

History

Repository files navigation

Covid_ETL_Project_with_Azure_Data_Factory

Azure Data Factory

Situation

Task

Approach

Environment Setup

Solution Architecture Diagram

Extraction/Ingestion

Transformation

Load

Result and Analysis

The SQL Datawarehouse was connected to Power BI for the following Insights and Visualizations

Follow Me

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages