Methods to achieve null safety for `deduplicate` #815

dbeatty10 · 2023-07-26T17:14:25Z

resolves #814
resolves #621

This is a bug fix with no breaking changes.

It also adds two new features:

optional row_alias keyword argument (type: string, default: none)
optional columns keyword argument (type: list, default: none)

Description & motivation

This PR is still in draft status, and more description will be added at a later date.

In the meantime, see #814 (and everything it links to, in particular #713) for background motivation and discussion to-date.

As a summary, this PR gives the user multiple options to achieve null safety for deduplicate:

The user passes a relation that has been materialized in the database (i.e., it isn't an ephemeral model)
- fe03f43 -- as long as the relation is not a CTE, it's columns can be fetched via the get_filtered_columns_in_relation macro
The user passes the row_alias keyword argument
- 3eced4d -- when the row_alias keyword argument is set, then we can deduplicate via the row_number() window function (at the cost of the row_alias being an extra column that wasn't in the original data set)
The user passes the columns keyword argument
- d46676e -- when columns keyword argument is set, then we can deduplicate via the row_number() window function and only return the requested columns

Outside of those options, the deduplication will not be null-safe.

Option 1

models/my_model_1.sql

{{ config(materialized="table") }}

select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version

models/deduped_1.sql

    {{
        dbt_utils.deduplicate(
            ref('my_model_1'),
            partition_by='user_id, created_at',
            order_by='version desc'
        ) | indent
    }}

dbt build -s +deduped_1
dbt show -s deduped_1

user_id	created_at	version
1		2

Option 2

models/my_model_2.sql

{{ config(materialized="ephemeral") }}

select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version

models/deduped_2.sql

    {{
        dbt_utils.deduplicate(
            ref('my_model_2'),
            partition_by='user_id, created_at',
            order_by='version desc',
            row_alias='rn'
        ) | indent
    }}

dbt build -s +deduped_2
dbt show -s deduped_2

user_id	created_at	version	rn
1		2	1

Option 3

models/my_model_3.sql

{{ config(materialized="ephemeral") }}

select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version

models/deduped_3.sql

    {{
        dbt_utils.deduplicate(
            ref('my_model_3'),
            partition_by='user_id, created_at',
            order_by='version desc',
            columns=['user_id', 'created_at', 'version']
        ) | indent
    }}

dbt build -s +deduped_3
dbt show -s deduped_3

user_id	created_at	version
1		2

Option 4

models/my_model_4.sql

{{ config(materialized="ephemeral") }}

select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version

Warning

This is the one not guaranteed to be null-safe (depending on the adapter).

models/deduped_4.sql

    {{
        dbt_utils.deduplicate(
            ref('my_model_4'),
            partition_by='user_id, created_at',
            order_by='version desc',
        ) | indent
    }}

dbt build -s +deduped_4
dbt show -s deduped_4

Here's the warning that will be logged:

Warning: the implementation of the `deduplicate` macro for the `postgres` adapter is not null safe. 
Set `row_alias` within calls to `deduplicate` to achieve null safety (which will also add it as an extra column to the output).

e.g.,
    {
        dbt_utils.deduplicate(
            'my_cte',
            partition_by='user_id',
            order_by='version desc',
            row_alias='rn'
        ) | indent
    }

Warning triggered by model: my_project.deduped_4
dbt project / package: my_project
path: models/deduped_4.sql

user_id	created_at	version

Key history of `deduplicate` macro

Checklist

I followed guidelines to ensure that my changes will work on "non-core" adapters
👈 I have updated the README.md (if applicable)
👈 I have added tests & descriptions to my models (and macros if applicable)
👈 I have added an entry to CHANGELOG.md

graciegoheen · 2023-07-26T22:09:20Z

macros/sql/deduplicate.sql

+        from {{ relation }} as _inner
+    )
+
+    select *


What databases allow for minus or except syntax? I know snowflake does - that could be an option for removing the extra column. Though maybe in that case you'd just use qualify

How would minus or except work to remove extra column(s)? Do you mean select * exclude ( <col_name>, <col_name>, ... )?

This would be the perfect solution if we could rely on it! 💡

But it is not in the SQL standard, and the databases that don't have qualify are probably missing select * exclude (...) as well. So I don't think we'll be able to reliably use it as part of the default implementation 😢.

select * exclude (...)

Snowflake has select * exclude:

And so does DuckDB:

select * except (...)

And because it's not in the standard, other databases use except instead of exclude.

BigQuery uses except:

As does Databricks:

Sorry, yes I meant exclude. What about using the star macro with the except argument?

The initial implementation in #512 used the star macro but it was removed in #548.

I haven't considered the details of how we might be able to bring it back or what those implications would be.

I think we'd still need to handle the case where the relation is a CTE name instead of a Relation. That's the case that this draft PR is covering with the row_alias parameter. An alternative way to cover it would be a columns parameter like suggested here. Allowing the end user to choose between either row_alias or columns would provide the most optionality.

@graciegoheen your idea about using the star macro inspired fe03f43.

It retrieves columns similarly to dbt_utils.star IFF:

relation is a Relation

relation is not an ephemeral CTE

Otherwise, a user can pass a list of columns manually (d46676e). Or they can specify a row_alias that is acceptable to them.

dbeatty10 · 2023-07-31T23:41:54Z

macros/sql/deduplicate.sql

@@ -104,7 +104,10 @@ path: {}
    {% set row_alias = kwargs.get('row_alias') %}
    {% set columns = kwargs.get('columns') %}

-    {% if row_alias != None or columns != None %}
+    {% if relation.is_cte is defined and not relation.is_cte %}


This is a simplified alternative to this and this to determine if a Relation is backed by a real database object whose columns can be fetched via the star macro.

i.e., the columns for tables and views can be retrieved via information_schema.columns (or an equivalent), but can't for CTE.

github-actions · 2024-01-29T01:43:58Z

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

Null safety for deduplicate via row_alias keyword argument

3eced4d

graciegoheen reviewed Jul 26, 2023

View reviewed changes

dbeatty10 added 3 commits July 31, 2023 16:37

Null safety for deduplicate via columns keyword argument

d46676e

Null safety for deduplicate when relation is not a CTE

fe03f43

Update caveats

e13d72d

dbeatty10 commented Jul 31, 2023

View reviewed changes

dbeatty10 changed the title ~~Null safety for deduplicate via row_alias keyword argument~~ Methods to achieve null safety for deduplicate Aug 1, 2023

dbeatty10 mentioned this pull request Aug 11, 2023

Deprecate the implementation of deduplicate that uses a natural join #824

Closed

5 tasks

github-actions bot added the Stale label Jan 29, 2024

dbeatty10 removed the Stale label Jan 29, 2024

dbeatty10 added the bug Something isn't working label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methods to achieve null safety for `deduplicate` #815

Methods to achieve null safety for `deduplicate` #815

dbeatty10 commented Jul 26, 2023 •

edited

Loading

graciegoheen Jul 26, 2023

dbeatty10 Jul 27, 2023

graciegoheen Jul 27, 2023

dbeatty10 Jul 28, 2023

dbeatty10 Aug 1, 2023

dbeatty10 Jul 31, 2023

github-actions bot commented Jan 29, 2024

Methods to achieve null safety for deduplicate #815

Are you sure you want to change the base?

Methods to achieve null safety for deduplicate #815

Conversation

dbeatty10 commented Jul 26, 2023 • edited Loading

Description & motivation

Option 1

Option 2

Option 3

Option 4

Key history of deduplicate macro

Checklist

graciegoheen Jul 26, 2023

Choose a reason for hiding this comment

dbeatty10 Jul 27, 2023

Choose a reason for hiding this comment

select * exclude (...)

select * except (...)

graciegoheen Jul 27, 2023

Choose a reason for hiding this comment

dbeatty10 Jul 28, 2023

Choose a reason for hiding this comment

dbeatty10 Aug 1, 2023

Choose a reason for hiding this comment

dbeatty10 Jul 31, 2023

Choose a reason for hiding this comment

github-actions bot commented Jan 29, 2024

Methods to achieve null safety for `deduplicate` #815

Methods to achieve null safety for `deduplicate` #815

dbeatty10 commented Jul 26, 2023 •

edited

Loading

Key history of `deduplicate` macro

`select * exclude (...)`

`select * except (...)`