Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3373] [Feature] Incremental Strategy: 'insert_unmatched' #9056

Closed
3 tasks done
dbernett-amplify opened this issue Nov 11, 2023 · 3 comments
Closed
3 tasks done

[CT-3373] [Feature] Incremental Strategy: 'insert_unmatched' #9056

dbernett-amplify opened this issue Nov 11, 2023 · 3 comments
Labels
enhancement New feature or request incremental Incremental modeling with dbt wontfix Not a bug or out of scope for dbt-core

Comments

@dbernett-amplify
Copy link

dbernett-amplify commented Nov 11, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Overview

A new incremental strategy should be defined that has the following behavior:

  • When the row in the source table has no match in the destination table, it is inserted
  • When the row in the source table has a match in the destination table, nothing happens (i.e. the row in the destination table is left unchanged).

We might call this strategy insert_unmatched because it inserts only the rows that are not matched.

Implementation

This should be as simple as creating a new strategy that is identical to default__get_merge_sql but has the following chunk of code removed:

    {% if unique_key %}
    when matched then update set
        {% for column_name in update_columns -%}
            {{ column_name }} = DBT_INTERNAL_SOURCE.{{ column_name }}
            {%- if not loop.last %}, {%- endif %}
        {%- endfor %}
    {% endif %}

This results in a merge statement that includes a when not matched clause only.

(Note that I've tested that this works for Snowflake. Not sure about other platforms.)

Describe alternatives you've considered

Rather than making this a separate incremental strategy, we could also have a config argument passed to the existing merge strategy. Maybe something like update_matches then the logic in default__get_merge_sql could be updaed to:

    {% if unique_key and update_matches %}
    when matched then update set
        {% for column_name in update_columns -%}
            {{ column_name }} = DBT_INTERNAL_SOURCE.{{ column_name }}
            {%- if not loop.last %}, {%- endif %}
        {%- endfor %}
    {% endif %}

Who will this benefit?

Suppose that:

  1. You are certain that your source table contains no duplicates.
  2. You have late-arriving data in your source table (i.e. rows can arrive in an order different from the one specified by their timestamp)
    (This is actually quite common!)

If you are building an incremental model, you need to:

  1. Set a look back period so that you don't miss the late-arriving data (e.g., where session_start > dateadd(day, -3, current_date))
  2. Define a unique key. Because now you have an artificial duplicates problem that you created by re-processing the same data more than once (due to 1 above).

In this case, any matches that are detected when the merge statement is run can safely be assumed to be full duplicates of the row already existing in the destination table and can be discarded. (These are just your artificially created duplicates due to the lookback period.)

However, the current "merge" strategy overwrites the duplicated rows, which is a waste of compute. This can be a significant savings if the lookback period is fairly long relative to the time between job runs, resulting in re-processing a large number of duplicate rows.

Are you interested in contributing this feature?

Sure

Anything else?

No response

@dbernett-amplify dbernett-amplify added enhancement New feature or request triage labels Nov 11, 2023
@github-actions github-actions bot changed the title [Feature] Incremental Strategy: 'insert_unmatched' [CT-3373] [Feature] Incremental Strategy: 'insert_unmatched' Nov 11, 2023
@dbeatty10
Copy link
Contributor

dbeatty10 commented Jan 10, 2024

Thanks for publishing the logic you've developed -- I bet it will come in handy for some folks 🧠

Rather than add new incremental strategies in dbt-core, we'd rather encourage the ecosystem through:

  1. Enabling custom incremental strategies, and
  2. Foster sharing that custom logic through the dbt Package Hub

There's probably still some room to make these easier to author for custom incremental strategies that make use of MERGE statements, and we've opened #9223 as a result. Would definitely welcome your insights and opinions in that issue 🙏

In particular, #9223 (comment) does a demo of your proposed insert_unmatched logic using a custom incremental materialization.

@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 10, 2024
@dbeatty10 dbeatty10 added wontfix Not a bug or out of scope for dbt-core and removed triage labels Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request incremental Incremental modeling with dbt wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

2 participants