Implement group_by_columns argument for relevant tests (#633)

* Extend testing macros with group_by_columns arg * Add integration tests for group_by_columns macro args * Seed tests for group_by_columns arg in test macros * Describe group_by_columns enhancement in CHANGELOG * Add integrations tests for fewer_rows_than macro * change fake data in test_recency to numeric * fix changelog typo * whitespace after commas * remove id column added just for join * remove outer keyword from full join * use explicit fake join keys for equal_rowcount and fewer_rows_than * document grouping feature in README * fix join key name for consistency with macro name (cosmetic change) * Use more descriptive group_by_columns README example Co-authored-by: Joel Labes <[email protected]> * Fix code comment typo in fewer_rows_than Co-authored-by: Joel Labes <[email protected]> Co-authored-by: Joel Labes <[email protected]>
dbt-labs · Aug 26, 2022 · ed47585 · ed47585
1 parent a976cdf
commit ed47585
Show file tree

Hide file tree

Showing 17 changed files with 244 additions and 50 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,10 +11,12 @@
 # Unreleased
 
 ## New features
+- Implemented an optional `group_by_columns` argument across many of the generic testing macros to test for properties that only pertain to group-level or are can be more rigorously conducted at the group level. Property available in `recency`, `at_least_one`, `equal_row_count`, `fewer_rows_than`, `not_constant`, `not_null_proportion`, and `sequential` tests [#633](https://github.com/dbt-labs/dbt-utils/pull/633)
 - New feature to omit the `source_column_name` column on the `union_relations` macro ([#331](https://github.com/dbt-labs/dbt-utils/issues/331), [#624](https://github.com/dbt-labs/dbt-utils/pull/624))
 
 ## Contributors:
 - [@christineberger](https://github.com/christineberger) (#624)
+- [@emilyriederer](https://github.com/emilyriederer) 
 
 # dbt-utils v0.8.6
 

diff --git a/README.md b/README.md
@@ -94,6 +94,8 @@ models:
 
 ```
 
+This test supports the `group_by_columns` parameter; see [Grouping in tests](#grouping-in-tests) for details.
+
 #### fewer_rows_than ([source](macros/generic_tests/fewer_rows_than.sql))
 Asserts that the respective model has fewer rows than the model being compared.
 
@@ -108,6 +110,8 @@ models:
           compare_model: ref('other_table_name')
 ```
 
+This test supports the `group_by_columns` parameter; see [Grouping in tests](#grouping-in-tests) for details.
+
 #### equality ([source](macros/generic_tests/equality.sql))
 Asserts the equality of two relations. Optionally specify a subset of columns to compare.
 
@@ -193,6 +197,7 @@ models:
           field: created_at
           interval: 1
 ```
+This test supports the `group_by_columns` parameter; see [Grouping in tests](#grouping-in-tests) for details.
 
 #### at_least_one ([source](macros/generic_tests/at_least_one.sql))
 Asserts that a column has at least one value.
@@ -209,6 +214,8 @@ models:
           - dbt_utils.at_least_one
 ```
 
+This test supports the `group_by_columns` parameter; see [Grouping in tests](#grouping-in-tests) for details.
+
 #### not_constant ([source](macros/generic_tests/not_constant.sql))
 Asserts that a column does not have the same value in all rows.
 
@@ -224,6 +231,8 @@ models:
           - dbt_utils.not_constant
 ```
 
+This test supports the `group_by_columns` parameter; see [Grouping in tests](#grouping-in-tests) for details.
+
 #### cardinality_equality ([source](macros/generic_tests/cardinality_equality.sql))
 Asserts that values in a given column have exactly the same cardinality as values from a different column in a different model.
 
@@ -293,6 +302,8 @@ models:
               at_least: 0.95
 ```
 
+This test supports the `group_by_columns` parameter; see [Grouping in tests](#grouping-in-tests) for details.
+
 #### not_accepted_values ([source](macros/generic_tests/not_accepted_values.sql))
 Asserts that there are no rows that match the given values.
 
@@ -472,6 +483,8 @@ seeds:
 * `interval` (default=1): The gap between two sequential values
 * `datepart` (default=None): Used when the gaps are a unit of time. If omitted, the test will check for a numeric gap.
 
+This test supports the `group_by_columns` parameter; see [Grouping in tests](#grouping-in-tests) for details.
+
 #### unique_combination_of_columns ([source](macros/generic_tests/unique_combination_of_columns.sql))
 Asserts that the combination of columns is unique. For example, the
 combination of month and product is unique, however neither column is unique
@@ -548,6 +561,34 @@ models:
 
 ----
 
+#### Grouping in tests
+
+Certain tests support the optional `group_by_columns` argument to provide more granularity in performing tests. This can be useful when:
+
+- Some data checks can only be expressed within a group (e.g. ID values should be unique within a group but can be repeated between groups)
+- Some data checks are more precise when done by group (e.g. not only should table rowcounts be equal but the counts within each group should be equal)
+
+This feature is currently available for the following tests:
+
+- equal_rowcount()
+- fewer_rows_than()
+- recency()
+- at_least_one()
+- not_constant()
+- sequential_values()
+- non_null_proportion()
+
+To use this feature, the names of grouping variables can be passed as a list. For example, to test for at least one valid value by group, the `group_by_columns` argument could be used as follows:
+
+```
+  - name: data_test_at_least_one
+    columns:
+      - name: field
+        tests:
+          - dbt_utils.at_least_one:
+              group_by_columns: ['customer_segment']
+```
+
 ## Macros
 
 ### Introspective macros

diff --git a/integration_tests/data/schema_tests/data_test_fewer_rows_than_table_1.csv b/integration_tests/data/schema_tests/data_test_fewer_rows_than_table_1.csv
@@ -1,4 +1,4 @@
-field
-1
-2
-3
+col_a,field
+1,1
+1,2
+1,3
diff --git a/integration_tests/data/schema_tests/data_test_fewer_rows_than_table_2.csv b/integration_tests/data/schema_tests/data_test_fewer_rows_than_table_2.csv
@@ -1,5 +1,5 @@
-field
-1
-2
-3
-4
+col_a,field
+1,1
+1,2
+1,3
+1,4
diff --git a/integration_tests/data/schema_tests/data_test_not_constant.csv b/integration_tests/data/schema_tests/data_test_not_constant.csv
@@ -1,4 +1,4 @@
-field
-1
-1
-2
+col_a,field
+1,1
+1,1
+1,2
diff --git a/integration_tests/data/schema_tests/data_test_sequential_values.csv b/integration_tests/data/schema_tests/data_test_sequential_values.csv
@@ -1,6 +1,6 @@
-my_even_sequence
-2
-4
-6
-8
-10
+col_a,my_even_sequence
+1,2
+1,4
+1,6
+2,8
+2,10
diff --git a/integration_tests/data/schema_tests/schema.yml b/integration_tests/data/schema_tests/schema.yml
@@ -7,6 +7,9 @@ seeds:
         tests:
           - dbt_utils.sequential_values:
               interval: 2
+          - dbt_utils.sequential_values:
+              interval: 2
+              group_by_columns: ['col_a']
 
 
   - name: data_test_sequential_timestamps

diff --git a/integration_tests/models/generic_tests/schema.yml b/integration_tests/models/generic_tests/schema.yml
@@ -6,12 +6,16 @@ seeds:
       - name: field
         tests:
           - dbt_utils.not_constant
+          - dbt_utils.not_constant:
+              group_by_columns: ['col_a']
 
   - name: data_test_at_least_one
     columns:
       - name: field
         tests:
           - dbt_utils.at_least_one
+          - dbt_utils.at_least_one:
+              group_by_columns: ['field']
 
   - name: data_test_expression_is_true
     tests:
@@ -142,6 +146,9 @@ seeds:
           - dbt_utils.not_null_proportion:
               at_least: 0.5
               at_most: 0.5
+          - dbt_utils.not_null_proportion:
+              at_least: 0
+              group_by_columns: ['point_9']
       - name: point_9
         tests:
           - dbt_utils.not_null_proportion:
@@ -154,11 +161,24 @@ models:
           datepart: day
           field: today
           interval: 1
+      - dbt_utils.recency:
+          datepart: day
+          field: today
+          interval: 1
+          group_by_columns: ['col1']
+      - dbt_utils.recency:
+          datepart: day
+          field: today
+          interval: 1
+          group_by_columns: ['col1', 'col2']
 
   - name: test_equal_rowcount
     tests:
       - dbt_utils.equal_rowcount:
           compare_model: ref('test_equal_rowcount')
+      - dbt_utils.equal_rowcount:
+          compare_model: ref('test_equal_rowcount')
+          group_by_columns: ['field']
 
   - name: test_equal_column_subset
     tests:
@@ -168,3 +188,11 @@ models:
             - first_name
             - last_name
             - email
+
+  - name: test_fewer_rows_than
+    tests:
+      - dbt_utils.fewer_rows_than:
+          compare_model: ref('data_test_fewer_rows_than_table_2')
+      - dbt_utils.fewer_rows_than:
+          compare_model: ref('data_test_fewer_rows_than_table_2')
+          group_by_columns: ['col_a']
diff --git a/integration_tests/models/generic_tests/test_fewer_rows_than.sql b/integration_tests/models/generic_tests/test_fewer_rows_than.sql
@@ -5,5 +5,5 @@ with data as (
 )
 
 select
-    field
+   col_a, field
 from data
diff --git a/integration_tests/models/generic_tests/test_recency.sql b/integration_tests/models/generic_tests/test_recency.sql
@@ -2,11 +2,15 @@
 {% if target.type == 'postgres' %}
 
 select
+    1 as col1,
+    2 as col2,
     {{ dbt_utils.date_trunc('day', dbt_utils.current_timestamp()) }} as today
 
 {% else %}
 
 select
+    1 as col1,
+    2 as col2,
     cast({{ dbt_utils.date_trunc('day', dbt_utils.current_timestamp()) }} as datetime) as today
 
 {% endif %}
diff --git a/macros/generic_tests/at_least_one.sql b/macros/generic_tests/at_least_one.sql
@@ -1,18 +1,26 @@
-{% test at_least_one(model, column_name) %}
-  {{ return(adapter.dispatch('test_at_least_one', 'dbt_utils')(model, column_name)) }}
+{% test at_least_one(model, column_name, group_by_columns = []) %}
+  {{ return(adapter.dispatch('test_at_least_one', 'dbt_utils')(model, column_name, group_by_columns)) }}
 {% endtest %}
 
-{% macro default__test_at_least_one(model, column_name) %}
+{% macro default__test_at_least_one(model, column_name, group_by_columns) %}
+
+{% if group_by_columns|length() > 0 %}
+  {% set select_gb_cols = group_by_columns|join(' ,') + ', ' %}
+  {% set groupby_gb_cols = 'group by ' + group_by_columns|join(',') %}
+{% endif %}
 
 select *
 from (
     select
         {# In TSQL, subquery aggregate columns need aliases #}
         {# thus: a filler col name, 'filler_column' #}
+      {{select_gb_cols}}
       count({{ column_name }}) as filler_column
 
     from {{ model }}
 
+    {{groupby_gb_cols}}
+
     having count({{ column_name }}) = 0
 
 ) validation_errors

diff --git a/macros/generic_tests/equal_rowcount.sql b/macros/generic_tests/equal_rowcount.sql
@@ -1,35 +1,73 @@
-{% test equal_rowcount(model, compare_model) %}
-  {{ return(adapter.dispatch('test_equal_rowcount', 'dbt_utils')(model, compare_model)) }}
+{% test equal_rowcount(model, compare_model, group_by_columns = []) %}
+  {{ return(adapter.dispatch('test_equal_rowcount', 'dbt_utils')(model, compare_model, group_by_columns)) }}
 {% endtest %}
 
-{% macro default__test_equal_rowcount(model, compare_model) %}
+{% macro default__test_equal_rowcount(model, compare_model, group_by_columns) %}
 
 {#-- Needs to be set at parse time, before we return '' below --#}
-{{ config(fail_calc = 'coalesce(diff_count, 0)') }}
+{{ config(fail_calc = 'sum(coalesce(diff_count, 0))') }}
 
 {#-- Prevent querying of db in parsing mode. This works because this macro does not create any new refs. #}
 {%- if not execute -%}
     {{ return('') }}
 {% endif %}
 
+{% if group_by_columns|length() > 0 %}
+  {% set select_gb_cols = group_by_columns|join(', ') + ', ' %}
+  {% set join_gb_cols %}
+    {% for c in group_by_columns %}
+      and a.{{c}} = b.{{c}}
+    {% endfor %}
+  {% endset %}
+  {% set groupby_gb_cols = 'group by ' + group_by_columns|join(',') %}
+{% endif %}
+
+{#-- We must add a fake join key in case additional grouping variables are not provided --#}
+{#-- Redshift does not allow for dynamically created join conditions (e.g. full join on 1 = 1 --#}
+{#-- The same logic is used in fewer_rows_than. In case of changes, maintain consistent logic --#}
+{% set group_by_columns = ['id_dbtutils_test_equal_rowcount'] + group_by_columns %}
+{% set groupby_gb_cols = 'group by ' + group_by_columns|join(',') %}
+
 with a as (
 
-    select count(*) as count_a from {{ model }}
+    select 
+      {{select_gb_cols}}
+      1 as id_dbtutils_test_equal_rowcount,
+      count(*) as count_a 
+    from {{ model }}
+    {{groupby_gb_cols}}
+
 
 ),
 b as (
 
-    select count(*) as count_b from {{ compare_model }}
+    select 
+      {{select_gb_cols}}
+      1 as id_dbtutils_test_equal_rowcount,
+      count(*) as count_b 
+    from {{ compare_model }}
+    {{groupby_gb_cols}}
 
 ),
 final as (
 
     select
+
+        {% for c in group_by_columns -%}
+          a.{{c}} as {{c}}_a,
+          b.{{c}} as {{c}}_b,
+        {% endfor %}
+
         count_a,
         count_b,
         abs(count_a - count_b) as diff_count
+
     from a
-    cross join b
+    full join b
+    on
+    a.id_dbtutils_test_equal_rowcount = b.id_dbtutils_test_equal_rowcount
+    {{join_gb_cols}}
+
 
 )