Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Passthrough data tests #4451

Closed
1 task done
chapmanjacobd opened this issue Dec 7, 2021 · 5 comments
Closed
1 task done

[Feature] Passthrough data tests #4451

chapmanjacobd opened this issue Dec 7, 2021 · 5 comments
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@chapmanjacobd
Copy link

chapmanjacobd commented Dec 7, 2021

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

from https://getdbt.slack.com/archives/C2JRTGLLS/p1638891747103000

Similar to https://docs.getdbt.com/docs/building-a-dbt-project/tests#storing-test-failures but there are some issues that are fundamental to the interface that was chosen (eg. not being able to reference bifurcated pass/failed data in downstream models)

dbt currently captures the application testing of the data pipeline fairly well--ie. testing that data pipeline changes are conceptually sound before deploying changes into production (not only schema tests but pipeline-specific tests; not data tests).

But it seems like, for some test features currently in dbt, there is a conflation of application testing and data testing. While it may have no impact to conflate these things it does not make sense to do so in every case.

Instead of pass/fail data tests at the dbt application layer it might be useful for folks to have the option to bifurcate and expose passing and failing data through views--at the same granularity as the existing severity option

for example, this might look like

version: 2

models:
  - name: large_table
    columns:
      - name: slightly_unreliable_column
        tests:
          - unique:
              passthrough:
                pass: "large_table_unique_pass"
                fail: "large_table_unique_fail"
  • The view names could be generated if not explicitly specified
  • All failing data could be unioned into a view with all rows which fail tests of the same source table (large_table_fail); likewise for all data passing all tests (large_table_pass)
  • passthrough: could also be set at the project level to be the default behavior of tests, similar to severity

Some complications:

For tables that have passthrough: in any test the downstream model would need to reference the passing table/view (large_table_pass). This dynamic behavior might not be desired although it should be fairly clear what is going on to the end user in DAG visualization--and users wouldn't likely be setting this without knowing what it is doing (hopefully).

Describe alternatives you've considered

The easiest alternative will likely be to store my data tests as models. This is what I plan on doing if this idea isn't accepted or until it is in a beta release.

Who will this benefit?

People who use dbt-expectations who have messy data sometimes but also might need to have data pushed into production frequently where multi-table data dependencies where automated filtering might not matter as much and their favorite food is CHEETOS® Crunchy FLAMIN' HOT® Cheese Flavored Snacks

Are you interested in contributing this feature?

No response

Anything else?

potentially related to

Railroad-oriented programming
#3066
#2968
#2959
#517

@chapmanjacobd chapmanjacobd added enhancement New feature or request triage labels Dec 7, 2021
@chapmanjacobd
Copy link
Author

Maybe instead of large_table_pass it should just be large_table. Then there is no dynamic renaming necessary--only union/except (or programmatically combining all the where clauses and conditions of tests)

@chapmanjacobd
Copy link
Author

chapmanjacobd commented Dec 9, 2021

Here is an example of what this might look like compared to how tests are currently handled (from an etic perspective)

-- mart_large_table_failed.sql

WITH tests AS (
    SELECT
        'nodata_null' AS reason,
        *
    FROM
        {{ ref('stg_tif_postprocessed_fail_nodata_null') }}
    UNION ALL
    SELECT
        'dtype_is_not_min_scalar' AS reason,
        *
    FROM
        {{ ref('stg_tif_postprocessed_fail_dtype_is_not_min_scalar') }}
    UNION ALL
    SELECT
        'processing failed' AS reason,
        *
    FROM
        {{ ref('stg_tif_postprocessed_fail_processing') }}
    UNION ALL
    SELECT
        'preprocessed metadata reading failed' AS reason,
        *
    FROM
        {{ ref('stg_tif_preprocessed_fail_metadata') }}
)
SELECT
    string_agg(
        reason,
        '; '
    ) reason,
    *
FROM
    tests
GROUP BY
    {list_of_all_columns}
-- mart_large_table.sql

SELECT
    *
FROM
    {{ ref('stg_tif_postprocessed') }}
EXCEPT
SELECT
    *
FROM
    {{ ref('mart_large_table_failed') }}

maybe this would be too hard to implement with tests in a generic way. using models is not too bad--just a little repetitive

@ChenyuLInx
Copy link
Contributor

Hey @chapmanjacobd , thank you so much for providing all this information and linking potential issues related to this! CHEETOS is one of my favorite snacks too!! And sorry for the delayed response!

I can totally relate to why you want to do this as a test instead of writing it as a model. When I did my dbt onboarding project, I wrote probably 10 staging models that are roughly the same, and I wish there was some template that automatically does it.

Our approach to testing is better summed up as: Tests are things about your models that should pass (or at least, should not fail too much). If they do, you have a problem; maybe you should even interrupt the DAG mid-build. So it sort of feels like you are approaching tests with a different idea: they should be a way of quarantining data or flagging it. There is another discussion about this if you are interested #4306.

After talking to @jtcohen6, here are a few tools we think can help with this situation(tests that frequently fails because the underlying data is messy, but I still want to keep that test around.):

  • Use severity + thresholds (error_if, warn_if)

  • Create a staging model to fix that underlying data, if you can(if you have a lot of use cases like this I can see that you do repetitive)

  • Use --store-failures to save that test's failing rows as a database table, to make it easier to query the latest failures. You can also override the test materialization, to actually create tables as the output of test queries (like what "store failures" does).

  • At that point, it might make sense to think about the "failing records" as another model in the DAG, such that you can have(which is pretty much exactly what you wrote):

model_a --> model_a_failures --> model_b
|                                 ^
|_________________________________|

where model_b is like:

select * from {{ ref('model_a') }}
where id not in (
  select * from {{ ref('model_a_failures') }} 
)

There might be some sort of dynamic way to template that out so that model_b actually looks like:

select * from {{ ref('model_a') }}
where id is not null -- condition dynamically inferred from test on model_a

that would be using the graph variable, a.k.a. the source of many many shenanigans in dbt projects in 2022 — since it gives users programmatic access (in the jinja context) to other stuff defined in their project. But this is not what we recommend and we think it might not serve you well. But always good to know what's possible.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Oct 17, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

3 participants