Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3417] [Feature] more usable compiled code output from uniqueness tests #9139

Closed
3 tasks done
Mikael-Thorup opened this issue Nov 23, 2023 · 2 comments
Closed
3 tasks done
Labels
enhancement New feature or request wontfix Not a bug or out of scope for dbt-core

Comments

@Mikael-Thorup
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

whenever I have a failing uniqueness test, I end up spending too long rewriting the compiled output from dbt test, e.g.

select
    id as unique_field,
    count(*) as n_records

from some_model
where id is not null
group by id
having count(*) > 1

into

with
    src as (select * from some_model),
    dupes as (
        select 
            id as unique_field, 
            count(*) as n_records

        from src
        where id is not null
        group by id
        having count(*) > 1
    )
select *
from src
where id in (select unique_field from dupes)

in order to find out e.g. which of the upstream models caused the duplicates. I'd propose the compiled code from dbt test uniqueness tests follows the following format instead:

with
    src as (select * from some_model),
    dupes as (
        select 
            id as unique_field, 
            count(*) as n_records

        from src
        where id is not null
        group by id
        having count(*) > 1
    )
select * from dupes
--select * from src where id in (select unique_field from dupes)

this would make it easier rewrite the output: comment out select * from dupes and uncomment the line below.

Describe alternatives you've considered

my current alternative is to have the abovementioned query saved, but then I need to rewrite both the unique column, and the table name every time.

Who will this benefit?

debugging models with failing uniqueness tests.

Are you interested in contributing this feature?

I'd love to but I'd need some help as I mostly know sql and jinja

Anything else?

wondering if this would also be useful for not_null tests and potentially other tests, but it's a lot easier rewriting

select some_column
from some_table
where some_column is null

into

select *
from some_table
where some_column is null

so this wouldn't benefit as much from the approach I'm suggesting.

@Mikael-Thorup Mikael-Thorup added enhancement New feature or request triage labels Nov 23, 2023
@github-actions github-actions bot changed the title [Feature] more usable compiled code output from uniqueness tests [CT-3417] [Feature] more usable compiled code output from uniqueness tests Nov 23, 2023
@dbeatty10
Copy link
Contributor

Thanks for proposing this idea @Mikael-Thorup 💡

Downsides

This would change the schema and number of rows for anyone that has store_failures_as or store_failures configured.

But the primary reason for the way it is can be explained by #4777 -- including all columns is more expensive in some databases.

Upside

Your proposal would allow seeing the full context of the failed test by showing the whole row(s).

You can see how the not_null test tries to balance both considerations here by including all columns when using store_failures, but only the affected column otherwise.

Workaround

It is possible for you to override this behavior on your end.

For example, you can implement your desired output by creating a file in your macros directory with the following content:

{% macro default__test_unique(model, column_name) %}
with
    src as (select * from {{ model }}),
    dupes as (
        select 
            {{ column_name }} as unique_field, 
            count(*) as n_records

        from src
        where {{ column_name }} is not null
        group by {{ column_name }}
        having count(*) > 1
    )
select *
from src
where {{ column_name }} in (select unique_field from dupes)
{% endmacro %}

You can also override the other built-in generic data tests listed here in a similar fashion.

I didn't fully test the example above, so your mileage may vary.

Summary

For the reasons above, we aren't keen to change this. We might reconsider if we hear a lot of folks clamoring for this or if we decide the positive impact would be worth it.

In the meantime, I'm going to close this as "not planned", but we'll stay open to feedback if anyone wants to comment on this issue (or creates a similar issue) in the future.

@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 28, 2023
@dbeatty10 dbeatty10 added wontfix Not a bug or out of scope for dbt-core and removed triage labels Nov 28, 2023
@Mikael-Thorup
Copy link
Author

Hi @dbeatty10 , thanks for the writeup! I think I have a workaround to avoid the downsides:

with
    dupes as (
        select 
            id as unique_field, 
            count(*) as n_records

        from {{ref('some_model')}}
        where id is not null
        group by id
        having count(*) > 1
    )
select unique_field, n_records from dupes
--select * from {{ref('some_model')}} where id in (select unique_field from dupes)

But also very OK if you don't wanna re-open the PR; I can just implement this through the workaround you mention 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

2 participants