adding data quality examples #38

wisemuffin · 2024-11-07T06:03:20Z

data quality Approach

We do our data ingestion via dlt and pandera.

As dlt and pandera cant do dead-letter queue / quarantine rows we need to wait for data to be loaded in the raw layer then do our data quality checks on the staging layer. See notes on other options below.

I have chosen to do our data quality test in dbt. As it has a simple to use store_failures | dbt Developer Hub feature.

dbt data tests also integrate niceley with dagster asset checks :)

examples of data quality

dbt store failures (quarantine / dead letter)

store_failures | dbt Developer Hub

from the data test in transformation/transformation_nsw_doe/data_tests/assert_example_master_dataset.sql

example of me storing a failure

this gets reported as a dagster asset check against the ref() in the test. If more than one ref then you can choose with meta config which asset the test belongs to.

model quality

We run dbt's project evaluator during CICD which highlights areas of a dbt project that are misaligned with dbt Labs' best practices. Specifically, this package tests for:

Modeling - your dbt DAG for modeling best practices
Testing - your models for testing best practices
Documentation - your models for documentation best practices
Structure - your dbt project for file structure and naming best practices
Performance - your model materializations for performance best practices
Governance - your best practices for model governance features.

In addition to tests, this package creates the model int_all_dag_relationships which holds information about your DAG in a tabular format and can be queried using SQL in your Warehouse.

source: https://github.com/dbt-labs/dbt-project-evaluator

Notes on other DQ tools for Dead letter queue / data quarantine

recommendation right now is just load data in. then catch issues with dbt tests. Dagster will the run dbt tests as asset checks and you can set failed rows to be stored.

Note dbt data tests cant do on source() needs to be ref. Dagster will then put asset check against the one ref() if multiple you can configure which asset the test should belong to.

dbt has store failures for tests: store_failures | dbt Developer Hub dbt tests already integrate well with dagster. Not only stores last run (if it failed or didnt).

duckdb csv reject feature: Reading Faulty CSV Files

Here is an example of how databricks handles bad files or records with its badrecords option: Handle bad records and files

also dlt issue raised for dead letter queue i.e. dont just silently remove rows: Add dead-letter queue functionality when contract mode == discard_row · Issue #1980 · dlt-hub/dlt

Another option i was thinking on was just use dlt to load all the data in into your raw layer and then do your data quality checks ontop of the raw layer before moving into stagging. This is what Soda SQL have proposed here: Test data quality in a Dagster pipeline

vercel · 2024-11-07T06:03:24Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
nsw-doe-data-stack-in-a-box	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 7, 2024 6:39am

github-actions · 2024-11-07T06:05:05Z

Your pull request is automatically being deployed to Dagster Cloud.

Location	Status	Link	Updated
`nsw-doe-data-stack-in-a-box`		Deploy failed	Nov 07, 2024 at 06:59 AM (UTC)

github-actions · 2024-11-07T06:05:08Z

Your pull request is automatically being deployed to Dagster Cloud.

Location	Status	Link	Updated
`demo-pipeline-scaling-tpch`		Deploy failed	Nov 07, 2024 at 06:59 AM (UTC)

adding data quality examples

fe36457

wisemuffin temporarily deployed to dev November 7, 2024 06:03 — with GitHub Actions Inactive

wisemuffin had a problem deploying to test November 7, 2024 06:03 — with GitHub Actions Failure

wisemuffin temporarily deployed to test November 7, 2024 06:03 — with GitHub Actions Inactive

vercel bot deployed to Preview November 7, 2024 06:05 View deployment

adding data quality examples

24f29b3

wisemuffin had a problem deploying to test November 7, 2024 06:23 — with GitHub Actions Failure

wisemuffin temporarily deployed to dev November 7, 2024 06:23 — with GitHub Actions Inactive

wisemuffin had a problem deploying to test November 7, 2024 06:23 — with GitHub Actions Error

vercel bot deployed to Preview November 7, 2024 06:25 View deployment

adding data quality examples

1c9a4b0

wisemuffin deployed to dev November 7, 2024 06:37 — with GitHub Actions Active

wisemuffin deployed to test November 7, 2024 06:37 — with GitHub Actions Active

wisemuffin had a problem deploying to test November 7, 2024 06:38 — with GitHub Actions Failure

vercel bot deployed to Preview November 7, 2024 06:39 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding data quality examples #38

adding data quality examples #38

wisemuffin commented Nov 7, 2024 •

edited

Loading

vercel bot commented Nov 7, 2024 •

edited

Loading

github-actions bot commented Nov 7, 2024 •

edited

Loading

github-actions bot commented Nov 7, 2024 •

edited

Loading

adding data quality examples #38

Are you sure you want to change the base?

adding data quality examples #38

Conversation

wisemuffin commented Nov 7, 2024 • edited Loading

data quality Approach

examples of data quality

dbt store failures (quarantine / dead letter)

model quality

Notes on other DQ tools for Dead letter queue / data quarantine

vercel bot commented Nov 7, 2024 • edited Loading

github-actions bot commented Nov 7, 2024 • edited Loading

github-actions bot commented Nov 7, 2024 • edited Loading

wisemuffin commented Nov 7, 2024 •

edited

Loading

vercel bot commented Nov 7, 2024 •

edited

Loading

github-actions bot commented Nov 7, 2024 •

edited

Loading

github-actions bot commented Nov 7, 2024 •

edited

Loading