Skip to content

Starter project for the open source approach to a datawarehouse

License

Notifications You must be signed in to change notification settings

Fredehagelund92/starter_data_platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Starter Data Platform

An awesome project that will help taking the first step on your journey toward data-driven decisions.

Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors License Badge

About

full

This repository provides a starting point for implementing a modern open source data platform. Data has become an essential part of organizations and their decision making process. Having an central repository with all the data, enables analysts to provide meaningful analytics for the rest of the organization.

This project will be used to explain and show how an data platform can be established with only open-source products.

Need info or have feedback? Do not hesitate to create an issue


Table of contents


Components

Storage

Extract

Airbyte is a new open-source application that is creating the new standard for data integration. The CDK makes it easy to integrate your data from various datasources. There are other tools like Meltano, Kiba etc. however airbytes community is growing rapidly and its so easy to get started. The data integration is heavily influenced by singer.

When starting a new data project i always start by integrating following data:

  • Holidays (Used to create a calendar/date dimension. This makes it way easier when supporting multiple countries)
  • Addresses (Often source systems does not provide geospatial and detailed address information.)
  • Exchange rates (Again if supporting multiple countries it might be needed to have metrics converted)

This enables me to quickly get started and generally it is not a problem, since storage is cheap.


Transformation

dbt introduces new standards to the data engineering and analyst space. For years people have been using GUI tools like Talend, SSIS, Pentaho, Alteryx etc. If you are familiar with one of these tools it can be rather quick to develop jobs. Personally i always end up banging my head against the wall due to following problems:

  • Components are not very flexible and workarounds are often needed.
  • Version control can be a mess due to project files.
  • CI/CD for your data pipelines is limited or time consuming.

With dbt engineers and analysts are able to do their transformations using select statements and jinja macros. These statements are called "models" and can be build using the dbt CLI. dbt will then compile the models and execute it based on the desired behaviouclick 8.0.1r:

  • Materialize a full table - full refresh
  • Materialize a view - overwrite
  • Append to existing table - incremental
  • Ephemeral - data will not be stored in your database, but can instead be used as a common table expression in other models

The dbt framework also provide functionalities like:

  • Documentation of models, datasources etx.
  • Data lineage graph of relationships between models
  • Schema and data tests using assertation (unique, not_null, accepted_values etc.)

Orchestration

Prefect


Visualization

Redash is one of the two big Open-source applications for visualization. The other one is Metabase, which also have a large following. Whether you pick one or another depends on the organization and its competencies.

Redash reports are build directly on top SQL queries and have a lot of different types of visualizations. Metabase is more point and click type of reporting, which makes it easier for self-service setup.


Requirements

servers

The minimum requirement for servers are as follows:

  • Materialization server
  • Dagster server
  • Airbyte server

The specific technical requirements heavily depends depends on various factors like data volume, number of analysts etc. If you want more details please check out the specific documentation of the applications.

For CI/CD i prefer a local Github Actions runner. For most cases it should be sufficient to install it on the Prefect server.


Getting Started

Setup Materialize

Start by installing Materialize on the first machine using these instructions.


  1. Login to your instance

    psql -U materialize -h localhost -p 6875 materialize
    
  2. Create databases

    create database raw;
    create database production;
    create database development;
    

Setup Airbyte

Start by installing Airbyte on the first machine using these instructions.


  1. Login to your instance

    psql -U materialize -h localhost -p 6875 materialize
    

Setup dbt project

Setup Prefect

Clone project

Folder structure

Install prefect

Deploy dags

Development

About

Starter project for the open source approach to a datawarehouse

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages