`dbt-parquet`

Run dbt on a directory of Parquet files, with duckdb as the computation engine.

Installation

This repo is still in development. Until it's on Pypi, you can install with pip3 install git+https://github.com/AlexanderVR/dbt-parquet.git#egg=dbt-parquet

Usage

Use dbt init to create a new project with this adapter.

Or manually add to ~/.dbt/profiles.yml something like

jaffle_shop:
  target: dev
  outputs:
    default:
      type: parquet
      threads: 4
      database: ./data

Note that the database option indicates the path that we will store the Parquet files.

Data is assumed to be laid out as follows:

{database}/{table_name}.parquet if no schema is provided
{database}/{schema}/{table_name}.parquet otherwise

See dbt/adapters/parquet/relation.py for details.

Why

dbt provides solid DAG-based abstractions for managing collections of related data transformations.
I don't always need a costly data warehouse for my data problems. Have very successfully used dbt-duckdb and dbt-sqlite
When data resides elsewhere, loading it into duckdb or sqlite just to run dbt, then exporting the desired output tables, is not ideal. E.g. when refreshing only parts of the dbt graph.

More generally, thinking in terms of each dbt "model" as generating a data "asset", which can have a wide variety of metadata and be an input to other computations (dbt or otherwise), can lead in very fruitful directions as illustrated by the dagster + dbt integration.

The hope with dbt-parquet is that by breaking out assets from a monolithic data "warehouse" or database file, the semantics become as clean and portable as, well, the humble file.

Current deficiencies

Note that only table materializations are supported, as views do not make sense with parquet files.
With the httpfs extension, duckdb can run queries over files stored in S3. The necessary changes to dbt-parquet would involve abstracting out any calls in involving the file path (e.g. listing, removal, rename, creation, and the get_catalog macro) to work against S3.
For "huge" data, might be nice to support partitioned files.

Acknowledgements

Inspired by dbt-duckdb, dagster and of course duckdb

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
dbt		dbt
examples/animals		examples/animals
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
dev-requirements.txt		dev-requirements.txt
mypy.ini		mypy.ini
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`dbt-parquet`

Installation

Usage

Why

Current deficiencies

Acknowledgements

About

Releases

Packages

Languages

AlexanderVR/dbt-parquet

Folders and files

Latest commit

History

Repository files navigation

dbt-parquet

Installation

Usage

Why

Current deficiencies

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`dbt-parquet`

Packages