Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methods to achieve null safety for deduplicate #815

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 91 additions & 2 deletions macros/sql/deduplicate.sql
Original file line number Diff line number Diff line change
@@ -1,8 +1,40 @@
{%- macro deduplicate(relation, partition_by, order_by) -%}
{{ return(adapter.dispatch('deduplicate', 'dbt_utils')(relation, partition_by, order_by)) }}
{{ return(adapter.dispatch('deduplicate', 'dbt_utils')(relation, partition_by, order_by, **kwargs)) }}
{% endmacro %}

{%- macro default__deduplicate(relation, partition_by, order_by) -%}
{#
-- ⚠️ This macro drops rows that contain NULL values ⚠️

-- The implementation below uses a natural join which avoids returning an
-- extra column at the cost of not being null safe.

-- dbt_utils._safe_deduplicate is an alternative that avoids dropping rows
-- that contain NULL values at the cost of adding an extra column.
#}
{%- macro _unsafe_deduplicate(relation, partition_by, order_by) -%}

{%- set error_message = "
Warning: the implementation of the `deduplicate` macro for the `{}` adapter is not null safe. \

Set `row_alias` within calls to `deduplicate` to achieve null safety (which will also add it \
as an extra column to the output).

e.g.,
{{
dbt_utils.deduplicate(
'my_cte',
partition_by='user_id',
order_by='version desc',
row_alias='rn'
) | indent
}}

Warning triggered by model: {}.{}
dbt project / package: {}
path: {}
".format(target.type, model.package_name, model.name, model.package_name, model.original_file_path) -%}

{%- do exceptions.warn(error_message) -%}

with row_numbered as (
select
Expand All @@ -29,6 +61,63 @@

{%- endmacro -%}

{#
-- For data platforms that don't support QUALIFY or an equivalent, the
-- best we can do to ensure null safety is to use a window function +
-- filter (which returns an extra column):
-- https://modern-sql.com/caniuse/qualify
#}
{%- macro _safe_deduplicate(relation, partition_by, order_by, row_alias="rn", columns=none) -%}

{% if not row_alias %}
{% set row_alias = "rn" %}
{% endif %}

with row_numbered as (
select

{% if columns != None %}
{% for column in columns %}
{{ column }},
{% endfor %}
{% else %}
_inner.*,
{% endif %}

row_number() over (
partition by {{ partition_by }}
order by {{ order_by }}
) as {{ row_alias }}
from {{ relation }} as _inner
)

select *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What databases allow for minus or except syntax? I know snowflake does - that could be an option for removing the extra column. Though maybe in that case you'd just use qualify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would minus or except work to remove extra column(s)? Do you mean select * exclude ( <col_name>, <col_name>, ... )?

This would be the perfect solution if we could rely on it! 💡

But it is not in the SQL standard, and the databases that don't have qualify are probably missing select * exclude (...) as well. So I don't think we'll be able to reliably use it as part of the default implementation 😢.


select * exclude (...)

Snowflake has select * exclude:

image

And so does DuckDB:

image

select * except (...)

And because it's not in the standard, other databases use except instead of exclude.

BigQuery uses except:

image

As does Databricks:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes I meant exclude. What about using the star macro with the except argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial implementation in #512 used the star macro but it was removed in #548.

I haven't considered the details of how we might be able to bring it back or what those implications would be.

I think we'd still need to handle the case where the relation is a CTE name instead of a Relation. That's the case that this draft PR is covering with the row_alias parameter. An alternative way to cover it would be a columns parameter like suggested here. Allowing the end user to choose between either row_alias or columns would provide the most optionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@graciegoheen your idea about using the star macro inspired fe03f43.

It retrieves columns similarly to dbt_utils.star IFF:

  • relation is a Relation
  • relation is not an ephemeral CTE

Otherwise, a user can pass a list of columns manually (d46676e). Or they can specify a row_alias that is acceptable to them.

from row_numbered
where {{ row_alias }} = 1

{%- endmacro -%}

{#
-- ⚠️ This macro drops rows that contain NULL values unless one of the following is true:
-- - `relation` parameter is a non-CTE dbt Relation
-- - `row_alias` parameter is included
-- - `columns` parameter is included
#}
{%- macro default__deduplicate(relation, partition_by, order_by) -%}
{% set row_alias = kwargs.get('row_alias') %}
{% set columns = kwargs.get('columns') %}

{% if relation.is_cte is defined and not relation.is_cte %}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simplified alternative to this and this to determine if a Relation is backed by a real database object whose columns can be fetched via the star macro.

i.e., the columns for tables and views can be retrieved via information_schema.columns (or an equivalent), but can't for CTE.

{% set columns = dbt_utils.get_filtered_columns_in_relation(relation) %}
{{ dbt_utils._safe_deduplicate(relation, partition_by, order_by, columns=columns) }}
{% elif row_alias != None or columns != None %}
{{ dbt_utils._safe_deduplicate(relation, partition_by, order_by, row_alias=row_alias, columns=columns) }}
{% else %}
{{ dbt_utils._unsafe_deduplicate(relation, partition_by, order_by) }}
{% endif %}

{%- endmacro -%}

-- Redshift has the `QUALIFY` syntax:
-- https://docs.aws.amazon.com/redshift/latest/dg/r_QUALIFY_clause.html
{% macro redshift__deduplicate(relation, partition_by, order_by) -%}
Expand Down