Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding data quality examples #38

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

wisemuffin
Copy link
Owner

@wisemuffin wisemuffin commented Nov 7, 2024

data quality Approach

We do our data ingestion via dlt and pandera.

As dlt and pandera cant do dead-letter queue / quarantine rows we need to wait for data to be loaded in the raw layer then do our data quality checks on the staging layer. See notes on other options below.

I have chosen to do our data quality test in dbt. As it has a simple to use store_failures | dbt Developer Hub feature.

dbt data tests also integrate niceley with dagster asset checks :)

examples of data quality

dbt store failures (quarantine / dead letter)

store_failures | dbt Developer Hub

from the data test in transformation/transformation_nsw_doe/data_tests/assert_example_master_dataset.sql

example of me storing a failure

image

this gets reported as a dagster asset check against the ref() in the test. If more than one ref then you can choose with meta config which asset the test belongs to.

model quality

We run dbt's project evaluator during CICD which highlights areas of a dbt project that are misaligned with dbt Labs' best practices. Specifically, this package tests for:

Modeling - your dbt DAG for modeling best practices
Testing - your models for testing best practices
Documentation - your models for documentation best practices
Structure - your dbt project for file structure and naming best practices
Performance - your model materializations for performance best practices
Governance - your best practices for model governance features.

In addition to tests, this package creates the model int_all_dag_relationships which holds information about your DAG in a tabular format and can be queried using SQL in your Warehouse.

source: https://github.com/dbt-labs/dbt-project-evaluator

Notes on other DQ tools for Dead letter queue / data quarantine

recommendation right now is just load data in. then catch issues with dbt tests. Dagster will the run dbt tests as asset checks and you can set failed rows to be stored.

Note dbt data tests cant do on source() needs to be ref. Dagster will then put asset check against the one ref() if multiple you can configure which asset the test should belong to.

dbt has store failures for tests: store_failures | dbt Developer Hub dbt tests already integrate well with dagster. Not only stores last run (if it failed or didnt).

duckdb csv reject feature: Reading Faulty CSV Files

Here is an example of how databricks handles bad files or records with its badrecords option: Handle bad records and files

also dlt issue raised for dead letter queue i.e. dont just silently remove rows: Add dead-letter queue functionality when contract mode == discard_row · Issue #1980 · dlt-hub/dlt

Another option i was thinking on was just use dlt to load all the data in into your raw layer and then do your data quality checks ontop of the raw layer before moving into stagging. This is what Soda SQL have proposed here: Test data quality in a Dagster pipeline

Copy link

vercel bot commented Nov 7, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nsw-doe-data-stack-in-a-box ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 7, 2024 6:39am

Copy link

github-actions bot commented Nov 7, 2024

Your pull request is automatically being deployed to Dagster Cloud.

Location Status Link Updated
nsw-doe-data-stack-in-a-box Deploy failed Nov 07, 2024 at 06:59 AM (UTC)

Copy link

github-actions bot commented Nov 7, 2024

Your pull request is automatically being deployed to Dagster Cloud.

Location Status Link Updated
demo-pipeline-scaling-tpch Deploy failed Nov 07, 2024 at 06:59 AM (UTC)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant