Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show unexpected values when accepted_values test fails #3265

Closed
brylie opened this issue Apr 15, 2021 · 5 comments
Closed

Show unexpected values when accepted_values test fails #3265

brylie opened this issue Apr 15, 2021 · 5 comments
Labels
dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request stale Issues that have gone stale

Comments

@brylie
Copy link

brylie commented Apr 15, 2021

When an accepted_values test fails, print the unexpected values.

I am currently writing a test against a column with many possible values, where we want to know if any new values appear in the column that might affect our analytics.

When the accepted_values test fails, it shows a message like:

Got N results, expected 0.

However, it doesn't display information about the unexpected results

Showing the unexpected results would make it easier to investigate and modify the test.

I can help contribute to this feature.

@brylie brylie added enhancement New feature or request triage labels Apr 15, 2021
@jtcohen6 jtcohen6 added dbt tests Issues related to built-in dbt testing functionality and removed triage labels Apr 15, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 15, 2021

@brylie What a cool idea! I think that the requisite pieces may just be falling into place, so that we could make this happen.

Back in #2593, I made a detailed proposal for how tests could store their (failing) results in database tables. To date, each schema test query has returned a single numeric value—count(*) or whatever else—so it would have been impossible to save and present an actual table of failing values. In order to make that happen, we've been working on creating a standard test "materialization" (#3181), and implementing it for schema generic tests as well (#3192). Those generic tests will now return a set of rows—whether that's simply rows from the table where a column is null, or rows from an aggregate query that rolls up counts by unacceptable values.

All of which is to say, if we're doing all that and can save those results in a database table, in addition to grabbing the total failure count/value, we could also print a few of those results to stdout using good old results.print_table():

Failure in test accepted_values__orders__order_status (models/resources.yml)
  Found 7 unacceptable values of orders.order_status, erroring because >0.
  
    | VALUE_FIELD |      COUNT |
    | ----------- | ---------- |
    | blue        | 17,452,443 |
    | red         |  6,105,277 |
    | green       |  1,111,547 |
  
    For full results:
    --------------------------------------------------------------------------------
    select * from dev_jcohen__dbt_test_results.accepted_values__orders__order_status
    --------------------------------------------------------------------------------

  compiled SQL at target/compiled/models/resources.yml/schema_test/not_null__orders__order_status.sql

I think we'd put in limitations like:

  • Print top 5 rows, and up to 5 columns
  • An optional property of generic test blocks that defines order_by (in the case above, order by 2 desc)

Let me know what you think about the above, and let's see if we can sneak in this (or something like it) alongside the other test functionality coming in v0.20.0 :)

@brylie
Copy link
Author

brylie commented Apr 16, 2021

Yeah, that seems like a good idea.

How does the error message "Got N results, expected 0." get the row count? Would it be possible to run a select distinct(column) from result when the error message is generated?

@jtcohen6
Copy link
Contributor

I think this imagines us rewriting the default accepted_values test to be:

with all_values as (

    select
        {{ column_name }} as value_field,
        count(*) as count

    from {{ model }}
    group by 1

),

validation_errors as (

    select
        value_field,
        count

    from all_values
    where value_field not in (
        {% for value in values -%}
            {% if quote_values -%}
            '{{ value }}'
            {%- else -%}
            {{ value }}
            {%- endif -%}
            {%- if not loop.last -%},{%- endif %}
        {%- endfor %}
    )
)

select * from validation_errors

@jtcohen6 jtcohen6 mentioned this issue May 8, 2021
4 tasks
@jtcohen6
Copy link
Contributor

I was able to get this working (sorta) with some janky code that leverages the existing functionality around dbt seed --show. That functionality is quite old and not well maintained, and the steps I took below are not ones I'd recommend following if we tried to do this for real. But, fun to play around with!

Steps:

  • core/dbt/main.py: Add --show flag to test subparser
  • core/dbt/flags.py: Add SHOW, set from args
  • core/dbt/include/global_project/macros/materializations/test.sql: Add logic to the if should_store_failures() conditional branch that, if flags.SHOW, runs an extra statement named results that wraps {{ main_sql }} and fetches results into the context
  • core/dbt/task/test.py: In the execute_test method, load the bonus results statement from the context, and pass its agate table form into the agate_table property of the RunResult returned by the task's execute method
  • core/dbt/task/printer.py: Include logic in print_run_result_error that checks for result.agate_table and, if available, prints it using the same logic as in the seed task

All of that for:

$ dbt test --show --store-failures
Running with dbt=0.21.0-rc1
Found 1 model, 1 test, 1 snapshot, 0 analyses, 350 macros, 0 operations, 1 seed file, 0 sources, 0 exposures

20:06:50 | Concurrency: 5 threads (target='dev')
20:06:50 |
20:06:50 | 1 of 1 START test unique_country_codes_two_letter_iso_code........... [RUN]
20:06:50 | 1 of 1 FAIL 247 unique_country_codes_two_letter_iso_code............. [FAIL 247 in 0.06s]
20:06:50 |
20:06:50 | Finished running 1 test in 0.16s.

Completed with 1 error and 0 warnings:

Failure in test unique_country_codes_two_letter_iso_code (data/whatever.yml)
  Got 247 results, configured to fail if != 0

  compiled SQL at target/compiled/my_new_project/data/whatever.yml/unique_country_codes_two_letter_iso_code.sql

See test failures:
---------------------------------------------------------------------------------------------
select * from "jerco"."dbt_jcohen_dbt_test__audit"."unique_country_codes_two_letter_iso_code"
---------------------------------------------------------------------------------------------
Random sample:
--------------
| unique_field | n_records |
| ------------ | --------- |
| PA           |         2 |
| HM           |         2 |
| NC           |         2 |
| PE           |         2 |
| NR           |         2 |
| KY           |         2 |
| RO           |         2 |
| FO           |         2 |
| UZ           |         2 |
| GA           |         2 |
| ...          |       ... |


Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Mar 30, 2022
@github-actions github-actions bot closed this as completed Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

2 participants