[CT-1022] [Bug] snapshot always creates a new slice for all records using check_cols and strategy "check" #5636

lefthanded78 · 2022-08-10T09:59:10Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When using a snapshot configured with strategy=check and specified check_cols, where one or more columns does not exist in the source-data, dbt creates a new slice for all source-records every time the snapshot is called.

Expected Behavior

If a column is specified that does not exist in either the source or the destination, we need a diffenrent handling for this:

column in source is missing: error (no matter capitalization)
column in target is missing: add column to target =>new slice for all records

Steps To Reproduce

Create a snapshot like this:

{% snapshot test_bids_2413_snapshot %}
{{ config(
    strategy='check',
    check_cols=['COLUMN_A','COLUMN_B'],
    unique_key='TABLE_KEY',
    target_schema='dbt_maretz'
)}}
   
SELECT 
   TABLE_KEY,
   MOD(TABLE_KEY,2)               COLUMN_A,
   TABLE_KEY * TABLE_KEY          COLUMN_B,
   MD5(CAST(TABLE_KEY AS STRING)) COLUMN_MD5
FROM UNNEST(GENERATE_ARRAY(1, 10)) TABLE_KEY
{% endsnapshot %}

execute the snapshot and you will get 10 records. Start it again and no records will be touched in the target. Fine
Now add a new column to the check_cols (with a wrong capitalization):

{% snapshot test_bids_2413_snapshot %}
{{ config(
    strategy='check',
    check_cols=['COLUMN_A','COLUMN_B','column_MD5'],
    unique_key='TABLE_KEY',
    target_schema='dbt_maretz'
)}}
   
SELECT 
   TABLE_KEY,
   MOD(TABLE_KEY,2)               COLUMN_A,
   TABLE_KEY * TABLE_KEY          COLUMN_B,
   MD5(CAST(TABLE_KEY AS STRING)) COLUMN_MD5
FROM UNNEST(GENERATE_ARRAY(1, 10)) TABLE_KEY
{% endsnapshot %} 

- add a column to check_cols, which really does not exist in source (

execute the snapshot and you will get 10 more records. Dbt does not recognize, that the new column does not exist. It follows the path for adding new columns to the target, without adding the new column to the target. By doing this, every time you start the snapshot from this, ALL records will be snapshoted which leads to a more or less unrecognized explosion of rowcounts.
The good thing: the data won't be wrong, but the rowcounts will rise.

Relevant log output

2022-08-10T09:23:22.684208Z: 09:23:22  Began running node snapshot.dbt_project.test_bids_2413_snapshot
2022-08-10T09:23:22.684510Z: 09:23:22  1 of 1 START snapshot dbt_maretz.test_bids_2413_snapshot ....................... [RUN]
2022-08-10T09:23:22.684781Z: 09:23:22  Acquiring new bigquery connection "snapshot.dbt_project.test_bids_2413_snapshot"
2022-08-10T09:23:22.684889Z: 09:23:22  Began compiling node snapshot.dbt_project.test_bids_2413_snapshot
2022-08-10T09:23:22.684984Z: 09:23:22  Compiling snapshot.dbt_project.test_bids_2413_snapshot
2022-08-10T09:23:22.688545Z: 09:23:22  finished collecting timing info
2022-08-10T09:23:22.688691Z: 09:23:22  Began executing node snapshot.dbt_project.test_bids_2413_snapshot
2022-08-10T09:23:22.747962Z: 09:23:22  Opening a new connection, currently in state closed
2022-08-10T09:23:22.753196Z: 09:23:22  On snapshot.dbt_project.test_bids_2413_snapshot: /* {"app": "dbt", "dbt_version": "1.2.0", "profile_name": "user", "target_name": "dev", "node_id": "snapshot.dbt_project.test_bids_2413_snapshot"} */
select * from (
            select COLUMN_A, COLUMN_B, column_MD5 from (

   
SELECT 
   TABLE_KEY,
   MOD(TABLE_KEY,2)               COLUMN_A,
   TABLE_KEY * TABLE_KEY          COLUMN_B,
   MD5(CAST(TABLE_KEY AS STRING)) COLUMN_MD5
FROM UNNEST(GENERATE_ARRAY(1, 10)) TABLE_KEY) subq
        ) as __dbt_sbq
        where false
        limit 0
    
2022-08-10T09:23:23.837961Z: 09:23:23  On snapshot.dbt_project.test_bids_2413_snapshot: /* {"app": "dbt", "dbt_version": "1.2.0", "profile_name": "user", "target_name": "dev", "node_id": "snapshot.dbt_project.test_bids_2413_snapshot"} */

        

  create or replace table `google_project`.`dbt_maretz`.`test_bids_2413_snapshot__dbt_tmp`
      expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour)
    )
  as (
    with snapshot_query as (

        

   
SELECT 
   TABLE_KEY,
   MOD(TABLE_KEY,2)               COLUMN_A,
   TABLE_KEY * TABLE_KEY          COLUMN_B,
   MD5(CAST(TABLE_KEY AS STRING)) COLUMN_MD5
FROM UNNEST(GENERATE_ARRAY(1, 10)) TABLE_KEY

    ),

    snapshotted_data as (

        select *,
            TABLE_KEY as dbt_unique_key

        from `google_project`.`dbt_maretz`.`test_bids_2413_snapshot`
        where dbt_valid_to is null

    ),

    insertions_source_data as (

        select
            *,
            TABLE_KEY as dbt_unique_key,
            CURRENT_TIMESTAMP() as dbt_updated_at,
            CURRENT_TIMESTAMP() as dbt_valid_from,
            nullif(CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP()) as dbt_valid_to,
            to_hex(md5(concat(coalesce(cast(TABLE_KEY as string), ''), '|',coalesce(cast(CURRENT_TIMESTAMP() as string), '')))) as dbt_scd_id

        from snapshot_query
    ),

    updates_source_data as (

        select
            *,
            TABLE_KEY as dbt_unique_key,
            CURRENT_TIMESTAMP() as dbt_updated_at,
            CURRENT_TIMESTAMP() as dbt_valid_from,
            CURRENT_TIMESTAMP() as dbt_valid_to

        from snapshot_query
    ),

    insertions as (

        select
            'insert' as dbt_change_type,
            source_data.*

        from insertions_source_data as source_data
        left outer join snapshotted_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where snapshotted_data.dbt_unique_key is null
           or (
                snapshotted_data.dbt_unique_key is not null
            and (
                (
  TRUE
)
            )
        )

    ),

    updates as (

        select
            'update' as dbt_change_type,
            source_data.*,
            snapshotted_data.dbt_scd_id

        from updates_source_data as source_data
        join snapshotted_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where (
            (
  TRUE
)
        )
    )

    select * from insertions
    union all
    select * from updates

  );
    
2022-08-10T09:23:26.327342Z: 09:23:26  BigQuery adapter: Adding columns ([]) to table `google_project`.`dbt_maretz`.`test_bids_2413_snapshot`".
2022-08-10T09:23:26.693756Z: 09:23:26  Writing runtime SQL for node "snapshot.dbt_project.test_bids_2413_snapshot"
2022-08-10T09:23:26.711076Z: 09:23:26  On snapshot.dbt_project.test_bids_2413_snapshot: /* {"app": "dbt", "dbt_version": "1.2.0", "profile_name": "user", "target_name": "dev", "node_id": "snapshot.dbt_project.test_bids_2413_snapshot"} */

      merge into `google_project`.`dbt_maretz`.`test_bids_2413_snapshot` as DBT_INTERNAL_DEST
    using `google_project`.`dbt_maretz`.`test_bids_2413_snapshot__dbt_tmp` as DBT_INTERNAL_SOURCE
    on DBT_INTERNAL_SOURCE.dbt_scd_id = DBT_INTERNAL_DEST.dbt_scd_id

    when matched
     and DBT_INTERNAL_DEST.dbt_valid_to is null
     and DBT_INTERNAL_SOURCE.dbt_change_type in ('update', 'delete')
        then update
        set dbt_valid_to = DBT_INTERNAL_SOURCE.dbt_valid_to

    when not matched
     and DBT_INTERNAL_SOURCE.dbt_change_type = 'insert'
        then insert (`TABLE_KEY`, `COLUMN_A`, `COLUMN_B`, `COLUMN_MD5`, `dbt_updated_at`, `dbt_valid_from`, `dbt_valid_to`, `dbt_scd_id`)
        values (`TABLE_KEY`, `COLUMN_A`, `COLUMN_B`, `COLUMN_MD5`, `dbt_updated_at`, `dbt_valid_from`, `dbt_valid_to`, `dbt_scd_id`)


  
2022-08-10T09:23:29.537574Z: 09:23:29  On snapshot.dbt_project.test_bids_2413_snapshot: /* {"app": "dbt", "dbt_version": "1.2.0", "profile_name": "user", "target_name": "dev", "node_id": "snapshot.dbt_project.test_bids_2413_snapshot"} */
drop table if exists `google_project`.`dbt_maretz`.`test_bids_2413_snapshot__dbt_tmp`
2022-08-10T09:23:30.228303Z: 09:23:30  finished collecting timing info
2022-08-10T09:23:30.228793Z: 09:23:30  Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '827d6f8b-6639-4dd5-a6d8-1386330af18c', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fcc5b2d6ac0>]}
2022-08-10T09:23:30.229192Z: 09:23:30  1 of 1 OK snapshotted dbt_maretz.test_bids_2413_snapshot ....................... [�MERGE (20.0 rows, 2.9 KB processed)� in 7.54s]
2022-08-10T09:23:30.229408Z: 09:23:30  Finished running node snapshot.dbt_project.test_bids_2413_snapshot

Environment

- OS: ? (dbt_cloud)
- Python: ? (dbt_cloud)
- dbt: 1.2.0

Which database adapter are you using with dbt?

bigquery

Additional Context

Worked properly in 1.0.0
maybe related to Fix adding new cols to check_cols in snapshots #4893

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2022-08-12T13:15:11Z

@lefthanded78 Thanks for opening and providing great context!

I suspect this might only affect BigQuery due to the unique way it handles column identifiers. Oh the joys of case (in)sensitivity handling 😅

✅ I think it will correctly raise an error if the column is truly missing with a message like:

  Runtime Error in snapshot test_bids_2413_snapshot (snapshots/test_bids_2413_snapshot.sql)

e.g., if you change the column name from column_MD5 to column_MD4.

❌ But it won't raise an error if there's a case-insensitive match (for BigQuery specifically).

Suspected root cause

Here's what I think is going on:

This code is supposed to get the actual (case-sensitive) names of each of the columns (from the database's point of view). It basically says to the database, "I'm gonna give you a list of column names specified by the user, but the casing might not match yours -- could you tell me your casing so that we can be on the same page going forward?"

dbt-core/core/dbt/include/global_project/macros/materializations/snapshots/strategies.sql

Lines 119 to 123 in a3b018f

    
           {#-- query for proper casing/quoting, to support comparison below --#} 
        
           {%- set select_check_cols_from_target -%} 
        
             select {{ check_cols_config | join(', ') }} from ({{ node['compiled_code'] }}) subq 
        
           {%- endset -%} 
        
           {% set query_columns = get_columns_in_query(select_check_cols_from_target) %}

From your log output above, this is the query that code rendered:

select * from (
            select COLUMN_A, COLUMN_B, column_MD5 from (

   
SELECT 
   TABLE_KEY,
   MOD(TABLE_KEY,2)               COLUMN_A,
   TABLE_KEY * TABLE_KEY          COLUMN_B,
   MD5(CAST(TABLE_KEY AS STRING)) COLUMN_MD5
FROM UNNEST(GENERATE_ARRAY(1, 10)) TABLE_KEY) subq
        ) as __dbt_sbq
        where false
        limit 0

The expectation was that we could ask the database for column_MD5 and it would give back COLUMN_MD5.

For unquoted identifiers, most(?) databases will return column names with the casing that matches the internal representation of the database.

But not BigQuery! It will go along with the casing in the select statement, even if it's unquoted and mixed case 😢 . It's trying to play nice, but we really want and need it to give us canonical strings that we can compare.

Where to go from here

IFF the suspected cause is correct, then the simplest "solution" would be for the end user to use the same casing in the select statement of their snapshot and the check_cols. Admittedly, this could be very tough for the end user to see, especially if using a select *.

I'd imagine any code change in dbt-core (or dbt-bigquery) to be potentially tricky due to how complicated casing can be.

One tricky idea:

If the following code is reached without an error, we know there's at least a match between the config and the query used by the snapshot. Then it becomes a matter if determining if any of the columns in the config are brand new and not yet in the SCD table that is the result of the snapshot.

dbt-core/core/dbt/include/global_project/macros/materializations/snapshots/strategies.sql

Lines 133 to 142 in a3b018f

    
               {%- set intersection = [] -%} 
        
               {%- for col in query_columns -%} 
        
                   {%- if col in existing_cols -%} 
        
                       {%- do intersection.append(adapter.quote(col)) -%} 
        
                   {%- else -%} 
        
                       {% set ns.column_added = true %} 
        
                   {%- endif -%} 
        
               {%- endfor -%} 
        
               {{ return((ns.column_added, intersection)) }} 
        
           {%- endmacro %}

Here's some psuedo code replacement for the above block:

Turn existing_cols into a dictionary named existing_cols_dict with the key being an uppercase version and the value staying as-is
If col uppercased is in the dictionary then:
1. {%- do intersection.append(adapter.quote(col)) -%}

If the above is implemented, subsequent testing should cover at least the following two cases:

adding a column to the check_cols config that already exists within the resulting SCD table
adding a column to the check_cols config (and snapshot query) that doesn't yet exist within the resulting SCD table

github-actions · 2023-02-09T02:01:44Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2023-02-17T02:04:33Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

lefthanded78 added bug Something isn't working triage labels Aug 10, 2022

github-actions bot changed the title ~~[Bug] snapshot always creates a new slice for all records using check_cols and strategy "check"~~ [CT-1022] [Bug] snapshot always creates a new slice for all records using check_cols and strategy "check" Aug 10, 2022

jtcohen6 added snapshots Issues related to dbt's snapshot functionality Team:Adapters Issues designated for the adapter area of the code labels Aug 10, 2022

dbeatty10 self-assigned this Aug 11, 2022

dbeatty10 added bigquery and removed triage labels Aug 12, 2022

dbeatty10 removed their assignment Aug 12, 2022

github-actions bot added the stale Issues that have gone stale label Feb 9, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1022] [Bug] snapshot always creates a new slice for all records using check_cols and strategy "check" #5636

[CT-1022] [Bug] snapshot always creates a new slice for all records using check_cols and strategy "check" #5636

lefthanded78 commented Aug 10, 2022 •

edited

Loading

dbeatty10 commented Aug 12, 2022

github-actions bot commented Feb 9, 2023

github-actions bot commented Feb 17, 2023

[CT-1022] [Bug] snapshot always creates a new slice for all records using check_cols and strategy "check" #5636

[CT-1022] [Bug] snapshot always creates a new slice for all records using check_cols and strategy "check" #5636

Comments

lefthanded78 commented Aug 10, 2022 • edited Loading

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

dbeatty10 commented Aug 12, 2022

Suspected root cause

Where to go from here

github-actions bot commented Feb 9, 2023

github-actions bot commented Feb 17, 2023

lefthanded78 commented Aug 10, 2022 •

edited

Loading