Starter Data Platform

An awesome project that will help taking the first step on your journey toward data-driven decisions.

About

This repository provides a starting point for implementing a modern open source data platform. Data has become an essential part of organizations and their decision making process. Having an central repository with all the data, enables analysts to provide meaningful analytics for the rest of the organization.

This project will be used to explain and show how an data platform can be established with only open-source products.

Need info or have feedback? Do not hesitate to create an issue

Components

Storage

Extract

Airbyte is a new open-source application that is creating the new standard for data integration. The CDK makes it easy to integrate your data from various datasources. There are other tools like Meltano, Kiba etc. however airbytes community is growing rapidly and its so easy to get started. The data integration is heavily influenced by singer.

When starting a new data project i always start by integrating following data:

Holidays (Used to create a calendar/date dimension. This makes it way easier when supporting multiple countries)
Addresses (Often source systems does not provide geospatial and detailed address information.)
Exchange rates (Again if supporting multiple countries it might be needed to have metrics converted)

This enables me to quickly get started and generally it is not a problem, since storage is cheap.

Transformation

dbt introduces new standards to the data engineering and analyst space. For years people have been using GUI tools like Talend, SSIS, Pentaho, Alteryx etc. If you are familiar with one of these tools it can be rather quick to develop jobs. Personally i always end up banging my head against the wall due to following problems:

Components are not very flexible and workarounds are often needed.
Version control can be a mess due to project files.
CI/CD for your data pipelines is limited or time consuming.

With dbt engineers and analysts are able to do their transformations using select statements and jinja macros. These statements are called "models" and can be build using the dbt CLI. dbt will then compile the models and execute it based on the desired behaviouclick 8.0.1r:

Materialize a full table - full refresh
Materialize a view - overwrite
Append to existing table - incremental
Ephemeral - data will not be stored in your database, but can instead be used as a common table expression in other models

The dbt framework also provide functionalities like:

Documentation of models, datasources etx.
Data lineage graph of relationships between models
Schema and data tests using assertation (unique, not_null, accepted_values etc.)

Orchestration

Prefect

Visualization

Redash is one of the two big Open-source applications for visualization. The other one is Metabase, which also have a large following. Whether you pick one or another depends on the organization and its competencies.

Redash reports are build directly on top SQL queries and have a lot of different types of visualizations. Metabase is more point and click type of reporting, which makes it easier for self-service setup.

Requirements

The minimum requirement for servers are as follows:

Materialization server
Dagster server
Airbyte server

The specific technical requirements heavily depends depends on various factors like data volume, number of analysts etc. If you want more details please check out the specific documentation of the applications.

For CI/CD i prefer a local Github Actions runner. For most cases it should be sufficient to install it on the Prefect server.

Getting Started

Setup Materialize

Start by installing Materialize on the first machine using these instructions.

Login to your instance

psql -U materialize -h localhost -p 6875 materialize

Create databases

create database raw;
create database production;
create database development;

Setup Airbyte

Start by installing Airbyte on the first machine using these instructions.

Login to your instance

psql -U materialize -h localhost -p 6875 materialize

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
dags		dags
orchestration		orchestration
transform		transform
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Starter Data Platform

About

Table of contents

Components

Storage

Extract

Transformation

Orchestration

Visualization

Requirements

Getting Started

Setup Materialize

Setup Airbyte

Setup dbt project

Setup Prefect

Clone project

Folder structure

Install prefect

Deploy dags

Development

About

Releases

Packages

Languages

License

Fredehagelund92/starter_data_platform

Folders and files

Latest commit

History

Repository files navigation

Starter Data Platform

About

Table of contents

Components

Storage

Extract

Transformation

Orchestration

Visualization

Requirements

Getting Started

Setup Materialize

Setup Airbyte

Setup dbt project

Setup Prefect

Clone project

Folder structure

Install prefect

Deploy dags

Development

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages