Run dbt on a directory of Parquet files, with duckdb as the computation engine.
This repo is still in development. Until it's on Pypi, you can install with
pip3 install git+https://github.com/AlexanderVR/dbt-parquet.git#egg=dbt-parquet
Use dbt init
to create a new project with this adapter.
Or manually add to ~/.dbt/profiles.yml
something like
jaffle_shop:
target: dev
outputs:
default:
type: parquet
threads: 4
database: ./data
Note that the database
option indicates the path that we will store the Parquet files.
Data is assumed to be laid out as follows:
{database}/{table_name}.parquet
if no schema is provided{database}/{schema}/{table_name}.parquet
otherwise
See dbt/adapters/parquet/relation.py for details.
dbt
provides solid DAG-based abstractions for managing collections of related data transformations.- I don't always need a costly data warehouse for my data problems. Have very successfully used dbt-duckdb and dbt-sqlite
- When data resides elsewhere, loading it into
duckdb
orsqlite
just to rundbt
, then exporting the desired output tables, is not ideal. E.g. when refreshing only parts of thedbt
graph.
More generally, thinking in terms of each dbt
"model" as generating a data "asset", which can have a wide variety of metadata and be an input to other computations (dbt
or otherwise), can lead in very fruitful directions as illustrated by the dagster + dbt integration.
The hope with dbt-parquet
is that by breaking out assets from a monolithic data "warehouse" or database file, the semantics become as clean and portable as, well, the humble file.
- Note that only table materializations are supported, as views do not make sense with parquet files.
- With the
httpfs
extension,duckdb
can run queries over files stored in S3. The necessary changes todbt-parquet
would involve abstracting out any calls in involving the file path (e.g. listing, removal, rename, creation, and theget_catalog
macro) to work against S3. - For "huge" data, might be nice to support partitioned files.
Inspired by dbt-duckdb, dagster and of course duckdb