[Feature] Passthrough data tests #4451

chapmanjacobd · 2021-12-07T21:20:59Z

Is there an existing feature request for this?

I have searched the existing issues

Describe the Feature

from https://getdbt.slack.com/archives/C2JRTGLLS/p1638891747103000

Similar to https://docs.getdbt.com/docs/building-a-dbt-project/tests#storing-test-failures but there are some issues that are fundamental to the interface that was chosen (eg. not being able to reference bifurcated pass/failed data in downstream models)

dbt currently captures the application testing of the data pipeline fairly well--ie. testing that data pipeline changes are conceptually sound before deploying changes into production (not only schema tests but pipeline-specific tests; not data tests).

But it seems like, for some test features currently in dbt, there is a conflation of application testing and data testing. While it may have no impact to conflate these things it does not make sense to do so in every case.

Instead of pass/fail data tests at the dbt application layer it might be useful for folks to have the option to bifurcate and expose passing and failing data through views--at the same granularity as the existing severity option

for example, this might look like

version: 2

models:
  - name: large_table
    columns:
      - name: slightly_unreliable_column
        tests:
          - unique:
              passthrough:
                pass: "large_table_unique_pass"
                fail: "large_table_unique_fail"

The view names could be generated if not explicitly specified
All failing data could be unioned into a view with all rows which fail tests of the same source table (large_table_fail); likewise for all data passing all tests (large_table_pass)
passthrough: could also be set at the project level to be the default behavior of tests, similar to severity

Some complications:

For tables that have passthrough: in any test the downstream model would need to reference the passing table/view (large_table_pass). This dynamic behavior might not be desired although it should be fairly clear what is going on to the end user in DAG visualization--and users wouldn't likely be setting this without knowing what it is doing (hopefully).

Describe alternatives you've considered

The easiest alternative will likely be to store my data tests as models. This is what I plan on doing if this idea isn't accepted or until it is in a beta release.

Who will this benefit?

People who use dbt-expectations who have messy data sometimes but also might need to have data pushed into production frequently where multi-table data dependencies where automated filtering might not matter as much and their favorite food is CHEETOS® Crunchy FLAMIN' HOT® Cheese Flavored Snacks

Are you interested in contributing this feature?

No response

Anything else?

potentially related to

Railroad-oriented programming
#3066
#2968
#2959
#517

The text was updated successfully, but these errors were encountered:

chapmanjacobd · 2021-12-08T18:27:58Z

Maybe instead of large_table_pass it should just be large_table. Then there is no dynamic renaming necessary--only union/except (or programmatically combining all the where clauses and conditions of tests)

chapmanjacobd · 2021-12-09T00:35:10Z

Here is an example of what this might look like compared to how tests are currently handled (from an etic perspective)

-- mart_large_table_failed.sql

WITH tests AS (
    SELECT
        'nodata_null' AS reason,
        *
    FROM
        {{ ref('stg_tif_postprocessed_fail_nodata_null') }}
    UNION ALL
    SELECT
        'dtype_is_not_min_scalar' AS reason,
        *
    FROM
        {{ ref('stg_tif_postprocessed_fail_dtype_is_not_min_scalar') }}
    UNION ALL
    SELECT
        'processing failed' AS reason,
        *
    FROM
        {{ ref('stg_tif_postprocessed_fail_processing') }}
    UNION ALL
    SELECT
        'preprocessed metadata reading failed' AS reason,
        *
    FROM
        {{ ref('stg_tif_preprocessed_fail_metadata') }}
)
SELECT
    string_agg(
        reason,
        '; '
    ) reason,
    *
FROM
    tests
GROUP BY
    {list_of_all_columns}

-- mart_large_table.sql

SELECT
    *
FROM
    {{ ref('stg_tif_postprocessed') }}
EXCEPT
SELECT
    *
FROM
    {{ ref('mart_large_table_failed') }}

maybe this would be too hard to implement with tests in a generic way. using models is not too bad--just a little repetitive

ChenyuLInx · 2022-01-26T17:33:49Z

Hey @chapmanjacobd , thank you so much for providing all this information and linking potential issues related to this! CHEETOS is one of my favorite snacks too!! And sorry for the delayed response!

I can totally relate to why you want to do this as a test instead of writing it as a model. When I did my dbt onboarding project, I wrote probably 10 staging models that are roughly the same, and I wish there was some template that automatically does it.

Our approach to testing is better summed up as: Tests are things about your models that should pass (or at least, should not fail too much). If they do, you have a problem; maybe you should even interrupt the DAG mid-build. So it sort of feels like you are approaching tests with a different idea: they should be a way of quarantining data or flagging it. There is another discussion about this if you are interested #4306.

After talking to @jtcohen6, here are a few tools we think can help with this situation(tests that frequently fails because the underlying data is messy, but I still want to keep that test around.):

Use severity + thresholds (error_if, warn_if)
Create a staging model to fix that underlying data, if you can(if you have a lot of use cases like this I can see that you do repetitive)
Use --store-failures to save that test's failing rows as a database table, to make it easier to query the latest failures. You can also override the test materialization, to actually create tables as the output of test queries (like what "store failures" does).
At that point, it might make sense to think about the "failing records" as another model in the DAG, such that you can have(which is pretty much exactly what you wrote):

model_a --> model_a_failures --> model_b
|                                 ^
|_________________________________|

where model_b is like:

select * from {{ ref('model_a') }}
where id not in (
  select * from {{ ref('model_a_failures') }} 
)

There might be some sort of dynamic way to template that out so that model_b actually looks like:

select * from {{ ref('model_a') }}
where id is not null -- condition dynamically inferred from test on model_a

that would be using the graph variable, a.k.a. the source of many many shenanigans in dbt projects in 2022 — since it gives users programmatic access (in the jinja context) to other stuff defined in their project. But this is not what we recommend and we think it might not serve you well. But always good to know what's possible.

github-actions · 2022-10-17T02:15:21Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

github-actions · 2022-10-28T02:15:10Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

chapmanjacobd added enhancement New feature or request triage labels Dec 7, 2021

jtcohen6 added Team: Execution discussion labels Dec 13, 2021

jtcohen6 removed the triage label Jan 28, 2022

jtcohen6 removed the discussion label Apr 19, 2022

github-actions bot added the stale Issues that have gone stale label Oct 17, 2022

github-actions bot closed this as completed Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Passthrough data tests #4451

[Feature] Passthrough data tests #4451

chapmanjacobd commented Dec 7, 2021 •

edited

Loading

chapmanjacobd commented Dec 8, 2021

chapmanjacobd commented Dec 9, 2021 •

edited

Loading

ChenyuLInx commented Jan 26, 2022

github-actions bot commented Oct 17, 2022

github-actions bot commented Oct 28, 2022

[Feature] Passthrough data tests #4451

[Feature] Passthrough data tests #4451

Comments

chapmanjacobd commented Dec 7, 2021 • edited Loading

Is there an existing feature request for this?

Describe the Feature

Some complications:

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

chapmanjacobd commented Dec 8, 2021

chapmanjacobd commented Dec 9, 2021 • edited Loading

ChenyuLInx commented Jan 26, 2022

github-actions bot commented Oct 17, 2022

github-actions bot commented Oct 28, 2022

chapmanjacobd commented Dec 7, 2021 •

edited

Loading

chapmanjacobd commented Dec 9, 2021 •

edited

Loading