deduplicate bug in Spark in case of a null column. #814

clintf1982 · 2023-07-20T12:54:40Z

Describe the bug

Same bug as #713 however for Spark
When there is a null column(Unrelated to the partition by or order by columns), Spark doesn't return the expected rows in the deduplicate default function.

Steps to reproduce

*** This is python code where the deduplication code was copied to in order to reproduce the bug.

  sql = """
  with relation as (
  select * from ( values (1, 3, null), (1, 2, null), (1, 1, null)) 
  ), row_numbered as (
      select
          _inner.*,
          row_number() over (
              partition by col1
              order by col2
          ) as rn
      from relation as _inner
  )

  select
      distinct data.*
  from relation as data
  natural join row_numbered
  where row_numbered.rn = 1
  """
  df = session.sql(sql)
  df.show()

Expected results

+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| null |
+----+----+----+

Actual results

+----+----+----+
|col1|col2|col3|
+----+----+----+
+----+----+----+

Screenshots and log output

System information

The contents of your packages.yml file:

Which database are you using dbt with?

The output of dbt --version:

Core:
  - installed: 1.6.0-b8
  - latest:    1.5.3    - Ahead of latest version!

Plugins:

Additional context

Are you interested in contributing the fix?

The text was updated successfully, but these errors were encountered:

github-actions · 2024-01-17T01:46:18Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2024-01-24T01:47:00Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

github-actions · 2024-10-16T01:59:23Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2024-10-24T01:58:31Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

clintf1982 added bug Something isn't working triage labels Jul 20, 2023

dbeatty10 mentioned this issue Jul 26, 2023

Methods to achieve null safety for deduplicate #815

Draft

4 tasks

github-actions bot added the Stale label Jan 17, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 24, 2024

dbeatty10 removed the Stale label Apr 18, 2024

dbeatty10 reopened this Apr 18, 2024

github-actions bot added the Stale label Oct 16, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deduplicate bug in Spark in case of a null column. #814

deduplicate bug in Spark in case of a null column. #814

clintf1982 commented Jul 20, 2023 •

edited

Loading

github-actions bot commented Jan 17, 2024

github-actions bot commented Jan 24, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 24, 2024

deduplicate bug in Spark in case of a null column. #814

deduplicate bug in Spark in case of a null column. #814

Comments

clintf1982 commented Jul 20, 2023 • edited Loading

Describe the bug

Steps to reproduce

Expected results

Actual results

Screenshots and log output

System information

Additional context

Are you interested in contributing the fix?

github-actions bot commented Jan 17, 2024

github-actions bot commented Jan 24, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 24, 2024

clintf1982 commented Jul 20, 2023 •

edited

Loading