This Project is about Injesting couple of Covid-19 Datasets from the ECDC website, transforming it using various ADF components, then loading it into SQL Datawarehouse for the Analytics team to derive useful and actionable insights from these datasets. We want to know the impact of Covid in the European Region as a whole during the year 2020.
Azure Data Factory (ADF) is a fully managed, serverless data integration solution for ingesting, preparing, and transforming all your data at scale. It enables every organization in every industry to use it for a rich variety of use cases: data Engineering, migrating their on-premises SSIS packages to Azure, operational data integration, analytics, ingesting data into data warehouses, and more.
We have various Covid-19 data from multiple sources. Due to the fact that these datasets are scattered everywhere, it became relatively difficult to extract useful information from them at a go.
The task of this project is to ingest the dataset from the various data sources, clean and transform the data to make it more robust and fit for purspose. The cleaned data should then be loaded into a central repo, like a Datawarehosue and Datalake so that the analytics team can consume it with their BI tools such as Power BI. Also, we can also run ML Models on Datalake.
- Azure Subscription
- Data Factory
- Azure Blob Storage Account
- Data Lake Storage Gen 2
- Azure SQL Database
- Azure Databricks Cluster
- HD Insight Cluster
- Azure Devops
Four different datasets was ingested from both the ECDC website and azure blob storage into Datalake Gen2. They are;
- Cases and Deaths Data
- Hospital Admissions Data
- Population Data
- Test Conducted Data
We used various components of ADF Pipeline activities to ingest the data from both HTTP Data Source and Azure Storage Account to Azure DataLake. Some of those activities are;
- Validation Activity
- Get Metadata Activity
- Copy Activity
The Cases and Deaths data together with the Hospital admissions data was trasnformed using ADF Data flows. The Data Flows transformation used on both dataset include;
- Select transformation
- LookUp transformation
- Filter transformation
- Join transformation
- Sort transformation
- Conditional split transformation
- Derived columns transformation
- Sink transformation et.c
These set of transformations was done to make the datasets more robust and fit for purpose.
For the Population data set, Azure databricks was solely used to transform it so that it can be robust and fit for purpose. The following steps was taken to transform the data on Azure Databricks.
- Azure Datalake storage was mounted on databricks so as to allow databricks read the data into the pyspark notebook easily.
- Next the Population data was read and a temp view was created which was followed by pivoting the dataset by age group as shown below
- The LookUp data was read, followed by a Join transformation before the transformed data was written back to the mount point in ADLS
Now that the datasets are now fully prepared, we loaded it into the SQL Datawarehouse for the Analytics team to consume with their BI tools with ease.
- The first thing was to create the tables in the SQL database that will house each of the datasets
- Covid-19 Trend in the EU/EEA & UK 2020 by Cases, Deaths, Hospital Occupancy and ICU Occupancy
- Covid-19 Cases and Death breakdown by population in the UK, France and Germany
- Confirmed Cases Vs Total Deaths By Country
- Total Number of Covid test carried out Vs Confirmed Cases
- Linkedin: https://www.linkedin.com/in/joshua-gbegudu-6a7a59a9/
- Github: github.com/jaykay04
- Instagram: @Official_jaykay04