Default implementation of `deduplicate` is not null-safe #621

dbeatty10 · 2022-07-16T15:36:14Z

Describe the bug

@belasobral93 discovered that deduplicate doesn't work for Spark when any of the order_by columns are null.

Root cause

The root cause is that Spark defaults to NULLS FIRST for the null_sort_order for the ORDER BY clause.

Why `dbt_utils` rather than `spark_utils`?

Explanation

But dbt_utils doesn't officially support Spark, so why is this being reported here (instead of spark_utils)?

dbt_utils is providing five implementations for deduplicate:

default
redshift
postgres
snowflake
bigquery

Since dbt_utils only tests against the four databases listed above and each of those has an override, it means there is no testing for the default implementation which the other dbt adapters inherit (like Spark).

Steps to reproduce

SQL example

The current implementation essentially acts like the following example:

with relation as (

    select 1 as partition_by, 'a' as order_by
    union all
    select 1 as partition_by, 'a' as order_by
    union all
    select 2 as partition_by, 'a' as order_by
    union all
    select 2 as partition_by, 'b' as order_by
    union all
    select 2 as partition_by, NULL as order_by

),

row_numbered as (

    select
        _inner.*,
        row_number() over (
            partition by partition_by
            order by order_by
        ) as rn
    from relation as _inner

)

select
    distinct relation.*
from relation
natural join row_numbered
where row_numbered.rn = 1
;

Expected results

partition_by	order_by
1	a
2	a

Actual results

partition_by	order_by
1	a

Screenshots and log output

Not provided.

System information

Not provided.

Which database are you using dbt with?

The output of dbt --version:
Not provided.

Additional context

To make the fix, it might be as simple as:

adding nulls last after order_by (when an order by clause exists, of course)

To confirm the bug and establish sufficient test cases, we could:

comment out all adapter-specific overrides for deduplicate
re-run the test suite
add test cases as needed
re-run the test suite
make fixes to the implementation(s)
re-run the test suite

Are you interested in contributing the fix?

I will contribute the fix.

The text was updated successfully, but these errors were encountered:

joellabes · 2022-09-14T04:16:54Z

To make the fix, it might be as simple as:

adding nulls last after order_by (when an order by clause exists, of course)

Is nulls [last|first] a well-known part of the SQL spec? I've never heard of it, a bit of quick Googling implies that it's specific-ish

A couple of things that are only in my head/true by convention but not documented:

Deviations from the norm (default__ implementations) should be handled in an override. (Strongly held)
"The norm" is vanilla-flavoured things that can reasonably be expected of pretty much everything. ¹ (Pretty strongly held)
I tend to assume that Postgres is vanilla-flavoured, as the longest-lived and only-OSS DB in the original core four. (Not at all strongly held, but don't have a better idea)
¹ Lookin' at you, BigQuery types

The approach that would be truest to points 1 and 3 would be:

The default implementation changes to Postgres'
Anyone who can't do that builds their own

Which is right if we want a globally coherent and understandable set of principles. It's not super pragmatic though; it appears that 10 data platforms support our current implementation, and as far as I can see most of them would not support Postgres' because they don't have distinct on.

In my opinion, "Vanilla as default" is more important than "every default implementation is what would work best in Postgres". (In fact, this might be a case where the default implementation works on PGSQL, but there is a special optimisation that led to the custom version? I haven't checked).

So that's a lot of words to say that if MySQL, Redshift, Materialize, SingleStore don't understand nulls last then we are inconveniencing some new adapters by throwing it in, and maybe we're better off just specifying nulls last in the Spark version.

joellabes · 2022-12-05T06:28:46Z

Related (in that it's also a deduplicate issue): #713

github-actions · 2023-07-29T01:44:22Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2023-08-05T01:47:22Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

nitindatta · 2024-04-27T15:21:23Z

anyone any workarounds for this

github-actions · 2024-10-25T01:58:56Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2024-11-01T02:03:58Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

dbeatty10 added bug Something isn't working triage and removed triage labels Jul 16, 2022

dbeatty10 mentioned this issue Jul 16, 2022

Test the default implementation against all supported adapters #622

Closed

16 tasks

dbeatty10 changed the title ~~Default implementation of deduplicate is not covered by testing~~ Default implementation of deduplicate is not null-safe Jul 16, 2022

graciegoheen mentioned this issue Apr 26, 2023

deduplicate macro for Databricks now uses the QUALIFY clause, which fixes NULL columns issues from the default natural join logic #786

Merged

17 tasks

dbeatty10 mentioned this issue Jul 26, 2023

Methods to achieve null safety for deduplicate #815

Draft

4 tasks

github-actions bot added the Stale label Jul 29, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2023

dbeatty10 removed the Stale label Apr 18, 2024

dbeatty10 reopened this Apr 18, 2024

github-actions bot added the Stale label Oct 25, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default implementation of `deduplicate` is not null-safe #621

Default implementation of `deduplicate` is not null-safe #621

dbeatty10 commented Jul 16, 2022

joellabes commented Sep 14, 2022

joellabes commented Dec 5, 2022

github-actions bot commented Jul 29, 2023

github-actions bot commented Aug 5, 2023

nitindatta commented Apr 27, 2024

github-actions bot commented Oct 25, 2024

github-actions bot commented Nov 1, 2024

Default implementation of deduplicate is not null-safe #621

Default implementation of deduplicate is not null-safe #621

Comments

dbeatty10 commented Jul 16, 2022

Describe the bug

Root cause

Why dbt_utils rather than spark_utils?

Steps to reproduce

Expected results

Actual results

Screenshots and log output

System information

Additional context

Are you interested in contributing the fix?

joellabes commented Sep 14, 2022

joellabes commented Dec 5, 2022

github-actions bot commented Jul 29, 2023

github-actions bot commented Aug 5, 2023

nitindatta commented Apr 27, 2024

github-actions bot commented Oct 25, 2024

github-actions bot commented Nov 1, 2024

Default implementation of `deduplicate` is not null-safe #621

Default implementation of `deduplicate` is not null-safe #621

Why `dbt_utils` rather than `spark_utils`?