-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support "cluster by" on Snowflake #634
Comments
This implementation would mirror the |
Issue dbt-labs#634 add cluster_by support for snowflake data warehouses.
I'm kicking this out of the 0.10.2 milestone. The code to add So, this is not valid in Snowflake: create table "dbt_dbanin".test_cluster
cluster by(id)
as (
select 1 as id
); Whereas this is valid: create table "dbt_dbanin".test_cluster (
id int
) cluster by(id)
as (
select 1 as id
); dbt doesn't have a good mechanism for adding a column list to the
dbt could run this as a sort of post-hook after the table is created. Conceivably, dbt could hide this implementation detail behind the Keen to discuss if anyone has any ideas or comments! |
It's been a while since the last comment, but this would be a very nice feature to have. Using It's hard to use a CTAS but I think we can do this via a view and a
Does this seem workable as a Snowflake-specific extension to the table materialization? Supporting changing clustering keys for incremental or non-destructive tables could be tricky - maybe that is a better use of automatic reclustering, or just only setting clustering keys when the table is first created. |
@gkushida thanks for the note -- I did see the Snowflake update about Your solution is pretty clever! I do think that would work, but I don't love the idea of rearchitecting dbt's approach to model building on Snowflake solely to support clustering. My hope would be that Snowflake could drop the constraint that a column list is provided to add clustering to a table, since your example makes it pretty clear that that's not strictly necessary. I'd say: If you desperately want to include clustering in your dbt models, you can do that by overriding dbt's implementation of You could do this by copying the snowflake__create_table_as macro into your There are some drawbacks to this approach - we sometimes need to make changes to these macros in dbt-core, and they may not be compatible with your own version of the macro. It is an option that's available to you if you want to use it! Let me know what you end up doing, or if you have other ideas here :) |
I suppose it depends on the size of your data - if your tables are small, then it doesn't really matter if you optimize clustering keys. But for large tables - especially time-series tables where users generally query a subset - we have found appropriate clustering keys to have a dramatic improvement in partitions-scanned and response time. IMO the improvements to response-time and cost-efficiency outweigh the benefits of adapter consistency. Not having clustering keys in Snowflake is kind of like not having distribution and sort keys in Redshift. Yes, Snowflake should do some auto-magic optimization, but I think relying on that violates the dbt spirit of trivially re-creatable models. Also, the BQ table materialization is already significantly different, and Snowflake is subtly different-enough from the base table materialization - so dbt already has incurred some overhead for changes to table materialization. I also ran into a similar issue with I was going to go down the 'forked-macro' route but that seemed like a last resort (better resort than waiting for Snowflake, though...). |
Hey @gkushida - thanks for following up! I buy the argument that clustering keys are important and compelling on large Snowflake tables. I've given this some more thought, and I think there is a simple way to accomplish effective table clustering on Snowflake. Curious to hear your thoughts about this:
When 1 & 2 are combined, I think this accomplishes effective clustering for Snowflake tables for both dbt's
This also necessitates that we add a config to enable automatic clustering for Snowflake tables, which I think would be a good idea. Given that [1] I unfortunately can't find a source for this, but I learned it from a Snowflake sales engineer on a call, and it empirically appears to be true! Last: our next release of dbt will actually include the ability to mark Snowflake tables as transient (docs, code). Just threw a lot at you here -- this is a super neat discussion! Excited to hear what you think |
@drewbanin Thank you for the thorough description! This definitely seems like a workable approach - we have heard similar things from our SA and had observed similar results in casual testing. Re: automatic clustering - my understanding is that this enabled by default for clustered tables if you are in the preview. So presumably it will also be enabled by default whenever it goes to GA. If you want to add a flag to disable automatic reclustering, I think it would have to call Off the top of my head that flag doesn't seem terribly useful, though - if you know the table isn't going to change, then you don't really need to suspend reclustering. If it is going to change, then you probably want it to be fully-clustered. Maybe you would want to suspend in a data-migration or large-update scenario but that kind of work seems to be outside the usual dbt scope. Another plus to your proposal is that we can do this today with an explicit I would suggest putting the cluster-keys set as part of the materialization I'll try and give the ad-hoc approach a shot soon -- also, apologies for the delays in responding, things are pretty busy here (as I assume they are over there)! |
Hey @gkushida - all of this sounds really good to me. If you're indeed able to try this, I'd love to hear about how it goes! I think this is super doable in dbt, and I'd agree with you that generating the If this is something you'd be interested in implementing eventually, I'd be super happy to help out however I can. Otherwise, let me know about your findings of the ad-hoc approach and I can prioritize this from there. Thanks again for moving this along! |
In lieu of having a Snowflake-specific clustering config, I think we're going to update our best practices to include an |
Sounds good! Might want to add the bit about |
I think I might wanna have a stab at the solution you proposed @drewbanin to wrap the query with an |
cool @bastienboutonnet - i think that sounds great! Definitely feel free to tag me a in a PR (or follow up here) if there's anything i can help out with |
This was fixed in #1689 |
Docs: https://docs.snowflake.net/manuals/sql-reference/sql/create-table.html
The
cluster by
keyword also requires a column definition list in thecreate table as
statement. Example:This can be implemented with a custom table materialization override for Snowflake.
The text was updated successfully, but these errors were encountered: