Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add non-destructive functionality to Snowflake table materializations #1972

Closed
wants to merge 18 commits into from

Conversation

liveandletbri
Copy link

Problem

Snowflake's time travel functionality is most easily usable when a table is not dropped and re-created every time it is refreshed. For incrementally-loaded models, this is fine. The table is only dropped when we perform a full refresh. But for materialized: table models, this is an issue, since these tables are dropped regularly (for us, many of them are dropped hourly).

Time travel is very useful for troubleshooting, as it allows you to see exactly what data a table contained at an exact time. However, each time the table is dropped, the "time travel history" starts over at that point. I cannot query a table's contents from 12:15 PM if it was dropped and re-built any time between then and now. Well, not easily.

Snowflake can show you a table's history using the show tables history command. For a table refreshed hourly, you might see something like this:

table_name    date_dropped
my_table      NULL
my_table      12/3/2019 3:00 PM
my_table      12/3/2019 2:00 PM
my_table      12/3/2019 1:00 PM

It's currently 3:45PM. To find what the data looked like for my_table at 12:15 PM, I would have to do this:

alter table my_table rename to my_table_current; --set aside production version of table so nothing is using the name "my_table" anymore
undrop table my_table; --allows me to uncover the version of the table dropped at 3:00, which now takes up the unique name "my_table"
alter table my_table rename to my_table_3PM; --move the 3PM version out of the way, freeing up the name again
undrop table my_table;
alter table my_table rename to my_table_2PM;
undrop table my_table;
alter table my_table rename to my_table_1PM; --now I finally have the version of the table that was dropped at 1:00, so I can restore the current version back to its original name
alter table my_table_current rename to my_table;
drop table my_table_3PM;
drop table my_table_2PM;

select *
from my_table_1PM 
...

Solution

By adding the non-destructive functionality back, tables can be preserved so investigation is as easy as this:

SELECT * 
FROM my_table at (timestamp => '12/3/2019 12:15 PM')

I know that the --non-destructive flag was removed in 0.14.0 (#1419), but I'm hoping this version of its functionality will be easier to maintain for the following reasons:

  • It only affects table materializations, so you won't have any issues with view or incremental models
  • It's a config argument, not a flag, so you won't have to account for its functionality in any other files except the table materialization file(s)
  • On that note, I've set it up to only apply to Snowflake's table materializations, since according to your doc it isn't useful for any other platforms.
  • It uses delete instead of truncate to avoid auto-committing the transaction
  • It should fail the same way an incremental load would fail if you changed the columns on your table, so there's nothing "tricky" or "pernicious" when columns change (but shoutout to Adding Full Refresh on schema change option for model config #1850 which would be a cool fix for all of that)

Questions

As a first-time contributor to DBT, I do have a few questions

  1. Is it alright that I use delete explicitly rather than adding a functionality to the adapter, so I could write {{ adapter.delete_relation }}? I saw the existing file already used insert explicitly, and here's another file that uses delete explicitly.
  2. The Snowflake table materialization file, when I began with it, did not have any of the intermediate or backup relation logic that populates, renames, and drops tables. All it had is a create_table_as call. I can see it in the core table file, though. How does DBT know, when I run my refreshes against Snowflake, to use this logic?
  3. And, as a followup to question 2, is there any way I can reduce the code written in the Snowflake table file because a lot of the same code already exists in the core table file?

@cla-bot
Copy link

cla-bot bot commented Dec 4, 2019

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @liveandletbri

@liveandletbri
Copy link
Author

I just signed the CLA

@liveandletbri
Copy link
Author

I spoke with @beckjake and he recommended that I make this custom materialization. I'm all for that! Closing this PR then.

@drewbanin
Copy link
Contributor

hey @liveandletbri - thanks for opening this PR! Glad you got in touch with Jake - this is a neat PR, but it's not something I anticipate adding support for in dbt-core. I think a custom materialization is a really good idea here :)

I did want to tell you that this PR compelled me to add some more info to our contributing guide for dbt. We want to be supportive of folks contributing code back to dbt, and I felt that our policies around how new features are contributed to the project were poorly specified. No action is required on your part here - just wanted to give you a heads up that this exists now :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants