Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake create or replace #1409

Merged
merged 34 commits into from
May 15, 2019
Merged

Snowflake create or replace #1409

merged 34 commits into from
May 15, 2019

Conversation

bastienboutonnet
Copy link
Contributor

@bastienboutonnet bastienboutonnet commented Apr 22, 2019

Had a few chats with @drewbanin regarding wanting to solve an issue with Snowflake lack of proper transactions which would cause downtime to tables which ended up either truncated or dropped when doing full-refreshes of incremental tables or re-generating regular tables.

I originally suggested doing table swaps but @drewbanin suggested we do create or replace which actually makes a lot more sense and is neater in implementation (no need to create temporary tables that need to be cleaned etc.)

Regarding incremental logic for Snowflake @drewbanin pointed out some work had started being done to use merge instead of inserts in this PR: #1307, so it made sense to build on top of that PR to solve the on false issue (well solve...) and rework the materialisation logics of incremental runs and tables.

Aims:

Incremental Materialisation/Merge:

  • When no unique_key is provided, we revert to a regular insert ... as this seemed to cause issues with on false.
  • Used merge for incremental models when unique key is provided (this part of the code remains pretty much unchanged from the referred to PR.

Full-refresh and tables materialisations

Leverage create or replace in Snowflake for full-refreshes and table materialisations

  • it's atomic
  • no downtime, empty tables, missing tables
  • no need to worry about destructive vs non-destructive (makes it possible to remove --non-destructive in future versions)

Relates to following issues:

#525
#1101

@drewbanin
Copy link
Contributor

Thanks for opening this PR @bastienboutonnet - will give this a look today :)

@drewbanin drewbanin self-requested a review April 24, 2019 00:14
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cosmetic comments here, and a couple of areas to simplify these materializations even further. I really like your approach for reconciling that issue with the on false clause in the incremental materialization.

This is really stellar! Happy to discuss any of the comments I dropped in here, otherwise, let me know when this is ready for another look. At that point, I'll kick off the integration tests and we can hopefully get this merged :D

@drewbanin
Copy link
Contributor

This PR closes #1379, #1101, #1414

@bastienboutonnet
Copy link
Contributor Author

@drewbanin thanks a lot for reviewing this. I implemented most of your feedback. I still have a question regarding the --non-destructive block. But other than that I think we could merge pretty soon


{%- if unique_key is none -%}
{# -- if no unique_key is provided run regular insert as Snowflake may complain #}
insert into {{ target_relation }} ({{ dest_cols_csv }})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a really good fix for the on false issue with Snowflake's merge statements. Do you think it makes sense to put this logic here? Or should we move it into the Snowflake implementation of get_merge_sql?

I like the idea of making materializations represent business logic instead of database logic, as they become a lot more generalizable. Curious what you think!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes total sense! I actually was feeling a bit "awkward" about having this logic sit there but didn't think too much about where else it could live and this is very good, so I'm going to go ahead and change this as you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I think this would be the place to implement it. If unique_key is provided, then we can proceed with common_get_merge_sql, otherwise we should return the insert statement you've built here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, its exactly what I just started doing!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing, I realised there is no incremental deletes anymore, and the merge statement doesn't call a delete. Would you think we need it here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation of incremental models on Snowflake used delete statements to approximate an upsert. Before we did:

create temp table as (select * from model code)
delete from destination table where destination unique key = temp table unique key
insert into destination table (select * from temp table)

So, records were only deleted if they were going to be immediately re-inserted. We'd actually prefer not to call a delete, and instead use the merge to update these rows in-place. This should be handled by the when matched clause in the merge statement.

I do think there's a conversation to be had about performance. I wonder if there's any difference between:

  1. Deleting existing records and reinserting them (with new values)
  2. Updating existing records in place

An example

Destination table

unique_key value
1 abc
2 def

Temp table (generated from model select)

unique_key value
2 ghi
3 xyz

Desired destination table state

unique_key value
1 abc
2 ghi
3 xyz

So, there are two ways to accomplish this desired end-state. We can either (pseudocode):

1. delete + insert

delete from destination table where id = 2
insert into destination table (id, value) values (2, ghi), (3, xyz)

2. update + insert (via merge)

merge into destination table
from temp table
when matched update -- updates row with id = 2
when not matched insert -- adds rows with id = 3

This does raise an interesting question about edge-case behavior with merge. What happens if there are duplicate unique_ids in either 1) the destination table or 2) the staging table?

Previously, it was straightforward to understand how the delete + insert pattern behaved. While having a duplicated unique_key would probably lead to undesirable results, the insert and delete queries would execute successfully.

With the merge implementation, I think users will see an error about non-deterministic results if their unique_key is not unique! All told, I think this will actually be a good thing, as it should help alert users to bugs in their model code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. From what you say here's what I think. Merge is definitely the preferable option and I think unless there's really a good reason for it, you should be getting an error if you're trying to insert dupes. There is probably something fucked up with the source.

Alternatively we could add support for the ERROR_ON_NONDETERMINISTIC_MERGE session parameter (when FALSE it would pick one of the duplicated rows and insert it) but there doesn't seem to be a clear way on how to select the row and I think this is just bad anyway. I don't really see the point of inserting a dupe row. So I agree with your last point in that comment. So I think the current implementation is cool.

@drewbanin
Copy link
Contributor

This PR is in really good shape! Just one comment about non-destructive mode, and maybe an interesting discussion to have about the job of the get_merge_sql statement, but otherwise I really like all of this!

Can you take a pass through and remove/update any "todo" comments in here? Definitely let me know if you still have outstanding questions about these things :)

@drewbanin
Copy link
Contributor

@bastienboutonnet just fixed a merge conflict (we updated dev/wilt-chamberlain) and the tests should be running now!

@bastienboutonnet
Copy link
Contributor Author

bastienboutonnet commented Apr 30, 2019

Awesome! Should I be worried that it looks like many tests are failing?

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! Thanks for your hard work here @bastienboutonnet - this is going to be a really wonderful addition to dbt on Snowflake ❄️ 🎉 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants