An awesome project that will help taking the first step on your journey toward data-driven decisions.
This repository provides a starting point for implementing a modern open source data platform. Data has become an essential part of organizations and their decision making process. Having an central repository with all the data, enables analysts to provide meaningful analytics for the rest of the organization.
This project will be used to explain and show how an data platform can be established with only open-source products.
Need info or have feedback? Do not hesitate to create an issue
Airbyte is a new open-source application that is creating the new standard for data integration. The CDK makes it easy to integrate your data from various datasources. There are other tools like Meltano, Kiba etc. however airbytes community is growing rapidly and its so easy to get started. The data integration is heavily influenced by singer.
When starting a new data project i always start by integrating following data:
- Holidays (Used to create a calendar/date dimension. This makes it way easier when supporting multiple countries)
- Addresses (Often source systems does not provide geospatial and detailed address information.)
- Exchange rates (Again if supporting multiple countries it might be needed to have metrics converted)
This enables me to quickly get started and generally it is not a problem, since storage is cheap.
dbt introduces new standards to the data engineering and analyst space. For years people have been using GUI tools like Talend, SSIS, Pentaho, Alteryx etc. If you are familiar with one of these tools it can be rather quick to develop jobs. Personally i always end up banging my head against the wall due to following problems:
- Components are not very flexible and workarounds are often needed.
- Version control can be a mess due to project files.
- CI/CD for your data pipelines is limited or time consuming.
With dbt engineers and analysts are able to do their transformations using select statements and jinja macros. These statements are called "models" and can be build using the dbt CLI. dbt will then compile the models and execute it based on the desired behaviouclick 8.0.1r:
- Materialize a full table - full refresh
- Materialize a view - overwrite
- Append to existing table - incremental
- Ephemeral - data will not be stored in your database, but can instead be used as a common table expression in other models
The dbt framework also provide functionalities like:
- Documentation of models, datasources etx.
- Data lineage graph of relationships between models
- Schema and data tests using assertation (unique, not_null, accepted_values etc.)
Redash is one of the two big Open-source applications for visualization. The other one is Metabase, which also have a large following. Whether you pick one or another depends on the organization and its competencies.
Redash reports are build directly on top SQL queries and have a lot of different types of visualizations. Metabase is more point and click type of reporting, which makes it easier for self-service setup.
The minimum requirement for servers are as follows:
- Materialization server
- Dagster server
- Airbyte server
The specific technical requirements heavily depends depends on various factors like data volume, number of analysts etc. If you want more details please check out the specific documentation of the applications.
For CI/CD i prefer a local Github Actions runner. For most cases it should be sufficient to install it on the Prefect server.
Start by installing Materialize on the first machine using these instructions.
-
Login to your instance
psql -U materialize -h localhost -p 6875 materialize
-
Create databases
create database raw; create database production; create database development;
Start by installing Airbyte on the first machine using these instructions.
-
Login to your instance
psql -U materialize -h localhost -p 6875 materialize