Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema test blocks #1173

Closed
drewbanin opened this issue Dec 5, 2018 · 4 comments
Closed

Schema test blocks #1173

drewbanin opened this issue Dec 5, 2018 · 4 comments
Assignees
Labels
1.0.0 Issues related to the 1.0.0 release of dbt dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request
Milestone

Comments

@drewbanin
Copy link
Contributor

drewbanin commented Dec 5, 2018

Describe the feature

dbt should support blocks that represent schema test definitions. A rough sketch of what these could look like:

{% test schema_test_name(....arguments) %}

{{ config(...test_configs) }}

select ...

{% endtest %}

Implementation ideas:

  • The name of the test block should dictate how the calling schema.yml file calls this test
  • The config block should help provide a rendered name and description for the resulting test node
  • The test should return a set of failing rows for the test, rather than a count(*) directly

Important considerations:

  • databases like BigQuery charge based on data scanned, so a naive translation of existing count(*) style tests to a select * style test could be really expensive. Let's keep that in mind as we build out this functionality

Configuration ideas:

To support dynamic test node names and descriptions, the config block should accept two configs:

  • name
  • description

A quick and imperfect example:

{% test unique(model, column_name) %}

{{ config(
  name="test_unique_{{ model.name }}.{{ column_name }}",
  description="Assert that {{ column_name }} is unique in {{ model }}"
) }}
....

I know, I know, curlies inside of curlies. Let's figure out a sensible way to make these names and descriptions easy to write.

When this feature is live, we should deprecate the old-style schema test macros and mark them for removal in a future release.

Additional context

  • The compiled SQL would be more inspectable in a query runner
    • We should think hard about how to 1) generate a test query and 2) generate a diagnosis/debugging query from the original select statement
  • We could build-in support for things like a maxerror count, severity, or enable/disable flags at a config level
  • We could add support for recording failing rows in a table for inspection (Record failing rows for tests into an auditable table #903)
@drewbanin drewbanin added enhancement New feature or request schema tests help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors labels Dec 5, 2018
@drewbanin drewbanin added 1.0.0 Issues related to the 1.0.0 release of dbt and removed help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors labels Mar 31, 2020
@drewbanin drewbanin changed the title New schema test macro syntax Schema test blocks Mar 31, 2020
@drewbanin drewbanin mentioned this issue Mar 31, 2020
11 tasks
@jtcohen6 jtcohen6 added this to the Oh-Twenty [TBD] milestone Nov 13, 2020
@jtcohen6 jtcohen6 added the dbt tests Issues related to built-in dbt testing functionality label Dec 31, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 8, 2021

I wanted to share some of our latest thinking on schema test blocks, which we're hoping to include in v0.20.0. Necessary caveats: We're still working through the details and various implications; this is just the tip of the iceberg; anything below is still subject to change.

Semantics

First off, I'd like to ditch the names "schema" and "data" tests, and many of the current distinctions between them. The two should provide the dbt developer two angles into the same functionality:

  1. It's easy to define a one-off test as a SQL file. Write a query that returns the rows you don't want from the specific model(s). It's executed by dbt test every time.
  2. If you want to reuse that test query for multiple models/columns, rather than copying and pasting, you can refactor the SQL to accept arguments. This is the reusable/generic/special-sort-of-macro test block proposed by this issue, and it would take the place of the test__ macro currently used.

This issue is about the latter.

Proposed syntax

# macros/reusable_tests.sql

{% test not_null(model, column_name), adapter = 'default' %}

    select * from {{ model }}
    where {{ column_name }} is null

{% endtest %}
# macros/schema.yml

tests:
 - name: not_null
   adapter: default
   description: "Assert that {{ column_name }} is not null in {{ model }}"
   fail_msg: "Found {{ result }} null values of {{ model }}.{{ column_name }}, which is {{ warn_or_error }}"
   fail_calc: "count(*)"   # this is default
   args:
     - name: model
       description: "The model to test"
     - name: column_name
       description: "The column or expression to test"

The YAML properties are optional—just the SQL definition is sufficient to define this test on models/columns/etc—but they're really what make the test block shine, and what takes it a step beyond the standard macro. In this world, packages of custom tests offer more than just a reusable bit of SQL: they offer a complete, packaged asset with richer behavior in docs/artifacts (a la #2959).

Most custom schema tests existing today could upgrade easily by:

  • Switching {% macro test__test_name(...) %} ... {% endmacro %} to {% test test_name(...) ... {% endtest %}
  • Switching the final select count(*) to select *. If the schema test returns a different calculation from count(*), it will be necessary to define the fail_calc.

Tactical questions

  • How much should we disambiguate between the two types of tests? Should they have different resource types, or both just be called test (in which case, names must be unique between them)?
    • Should the generic test blocks be defined in the macros/ path, or the tests/ path alongside one-off tests?
    • Each type would be defining different YAML resource properties: reusable tests are more like macros, and one-off tests ought to be more like other node types (Allow instances of generic data tests to be documented #2578). Would this be too confusing, to have a single type of YAML block (tests:), wherein some entries define a macro-esque thing while other entries define an analysis-y thing? I imagine we'd say it's best practice to keep them separate.
  • How much standard macro functionality should test blocks still support?
    • Should the test block appear in the documentation site? As a functional resource (similar to macros)? Each specific instantiation, as a node in the project / DAG?
    • Should a test block still be "callable" as a macro? E.g. should a test block named my_test return its SQL if called as test__my_test(model, column_name) (say, by another macro)? I don't feel this is critical—users can find straightforward ways to work around it if we remove this functionality, i.e. by using standard macros—but we should come up with a documented and consistent answer here. (E.g., today, the incremental materialization is technically callable as a macro named dbt.materialization_incremental.)

@clausherther
Copy link
Contributor

Love this idea. Couple of initial thoughts, more to come...:

  • Tests really make a lot of sense as a first-level resource type. Personally, I don't see a difference in macro-style tests vs the older data tests as I think it's just a matter of parameterization as you point out.
  • I like the idea of a separate tests folder for these to better organize them away from macros. Would that also allow me to have a macro and a test with the same name? (e.g. expression_is_true)?
  • I don't think it's necessary for tests to be callable. If I needed I would revert back to a macro that gets called from the test.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 9, 2021

Would that also allow me to have a macro and a test with the same name? (e.g. expression_is_true)?

A macro ({% macro expression_is_true() %}) and a test ({% test expression_is_true() %}), yes, I think so!

As for a one-off test (tests/expression_is_true.sql) and a generic test ({% test expression_is_true() %}) with the same name: I'm not sure, leaning no.

@jtcohen6
Copy link
Contributor

Would this be too confusing, to have a single type of YAML block (tests:), wherein some entries define a macro-esque thing while other entries define an analysis-y thing?

I've thought about this more: No! I think many of these properties would be shared between both types of tests. I totally see a one-off test (tests/*.sql) wanting to define description, fail_msg, and fail_calc.

Of course, some properties—adapter, args, and the special variables model and column_name—would be specific to the generic test block, in reflection the fact it is more macro-like.

I think both test types could also define docs and columns, the way analyses do today, which impacts their appearance in the documentation site. By default, one-off tests should appear more like nodes, and generic tests should appear more like macros/materializations.

The fact that both test types would have the majority of their resource properties in common feels like a strong argument in favor of centralizing them under a single tests: YAML block. Disallowing one-off tests from being called tests/unique.sql or tests/not_null.sql feels like a small price to pay for that improved consistency.

I imagine we'd say it's best practice to keep them separate.

Having said all of the above, I think this is definitely still true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0.0 Issues related to the 1.0.0 release of dbt dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants