Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage data warehouse objects (other than tables and views) #3391

Closed
boxysean opened this issue May 25, 2021 · 8 comments
Closed

Manage data warehouse objects (other than tables and views) #3391

boxysean opened this issue May 25, 2021 · 8 comments
Labels
discussion enhancement New feature or request

Comments

@boxysean
Copy link
Contributor

Describe the feature

Add support for managing non-table, non-view data warehouse objects.

Objects that I am thinking of (non-exhaustive):

  • BigQuery Datasets
  • Snowflake Warehouses, File Formats, Streams
  • Databricks SQL endpoints

Related:

Describe alternatives you've considered

This Slack community thread has discussion of ways community members are currently managing their data warehouse objects (primarily Terraform for Snowflake, and some interest in Permifrost).

This post from Jeremiah Hansen suggests to use schemachange for Snowflake database migrations.

Additional context

Terraform providers:

Who will this benefit?

Folks managing data warehouses.

Are you interested in contributing this feature?

Yes :-)

@boxysean boxysean added enhancement New feature or request triage labels May 25, 2021
@codigo-ergo-sum
Copy link

There is definitely a general interest in being able to further simplify at least some level of object management with dbt. I know it would be useful for us in our projects.

One issue with schemachange in particular that I see is that it's intentionally not idempotent. It's intentionally the opposite of idempotent - it intentionally creates a database change table and uses that to manage deployment of specific objects and changes to the database. This makes sense for managing a transaction processing database like Oracle or SQL Server or postgres when the state of the database is gigantic - it's all the schema and however many rows through all the tables, and it can't be just dropped and recreated from an external reference source because the database is, itself, the most compact reference source for all of its state.

However, data warehouses are a totally different beast. For the most part (with some notable exceptions if you're using things like dbt snapshots) they are not systems of record that contain the representation of all the state of your business. They're pulling in data from external sources and doing a series of complicated transformations to produce useful output. And the power of dbt is that you can get rid of all that nasty state tracking by literally dropping and recreating all of your models every time in the database - no state tracking of what dimension or fact table or transformation table got updated when or not. All the state is coming in from other systems. Now, for performance reasons, there are compromises like incremental models. Also, in practice, a lot of places with large amounts of data aren't necessarily dropping and recreating everything every day.

Yes, there are friction points with the other objects like users, warehouses, file formats, etc. above and that's the reason for this ticket. It's just that in my humble opinion (and I am definitely interested in a good debate on this) non-idempotent solutions set us back and also make things even more confusing when interacting with dbt which has such a strong pro-idempotency philosophy.

@jtcohen6
Copy link
Contributor

jtcohen6 commented May 29, 2021

@boxysean Such a good topic! Thanks for opening.

Let's operate from the premise that every object we're talking about here can be managed via SQL. I don't think that's true today of Databricks SQL endpoints—these require a separate API request—but I hope that will be true, someday. In general, we like to take the view that if it's in a database/warehouse/query engine, and it's important, it ought to be SQL-addressable. The last year of BigQuery development has certainly vindicated that point of view.

So: If an object can be created, altered, and dropped by SQL, then dbt can manage it. The question is: What's the best dbt construct to use?

To paraphrase Count Leo, relational objects—tables, views, even materialized views—are all quite alike; every non-relational data warehouse object is strange in its own way. This is a distinction I find productive, because I think about these categories quite differently:

  1. Relational objects that encode logic
  2. Objects that encode no real logic on their own, but which serve as shells, conduits, or pointers for other objects which do

1. Logic-bearing objects

The first category includes relational objects (tables, views, materialized views), but I'd argue it might also include UDFs. (More on that in a second.) The best dbt construct for these is the SQL (or CSV) file: models, snapshots, seeds. I so agree with @codigo-ergo-sum's point that these are the things dbt expects to operate on, and to do so with idempotence; that this assumption powers so much of dbt's intuition today.

UDFs

Let's talk for a moment about UDFs, since there are a lot of upvotes (but not a lot of comments) over in #136. I've seen two compelling treatments of UDFs in the wild:

  • Macro-based approach: A UDF's full definition is wrapped in a macro, perhaps wrapped again by a more generic create_function macro. The latter macro can be deployed as a pre-hook on the model that needs it, or an on-run-start hook for the run could create all UDFs at once, as recommended by one of Claire's fantastic discourse posts. Since UDFs are custom database functions, and macros are the most functional-feeling thing in dbt projects, the mapping between them feels right.
  • Resource-based approach: A UDF can be a model, with configs for (e.g.) language and arguments, and a custom materialization that wraps its logic in the appropriate create function DDL. It's an odd sort of model, to be sure, not to be queried directly, but certainly to be used (referenced!) in other downstream models. By defining the UDF as a model, it can participate explicitly in the DAG, and in features like state:modified; macros participate in some senses, today, but much more implicitly. If this is the direction we think everyone ought to go in, it probably makes sense to create a new dbt resource type, function, since the model abstraction feels a little thin.

Some data warehouses, namely BigQuery, give the option of defining UDFs as temporary (defined in a query, used in that query, gone thereafter) or persistent (database object). That distinction roughly maps to the two options above. We're not the only ones who have some uncertainty here.

2. Shells, conduits, pointers

In this category I might include things like sources, exposures (?), schemas, Snowflake databases, compute resources (Snowflake warehouses or Databricks endpoints). These are the constructs which "standard" dbt operations expect to already be in place. In the ideal case, dbt gets to pick where some sort of Terraform-y process has left off. That setup process (e.g. Permifrost) promises to take a totally new, spotless, empty warehouse/project/account and put in place just the necessary scaffolding (e.g. our recommended Snowflake setup), just enough so that dbt can start building.

As far as dbt is concerned, the best way to define these objects is in YAML, not SQL: there's no business logic, just attributes. dbt gains a lot by knowing about these objects, and referencing them, but it doesn't expect to actually manage them. Yet here's the kicker: if dbt knows about them, it's 100% possible to write dbt macros that will manage them, as hooks and operations.

This is how I think about the dbt-external-tables package, which takes the opinionated view that external tables should be sources, not models. Sources don't define any transformation logic themselves, they simply serve as pointers to where raw data lives. In the case of external tables, those pointers have to be fairly detailed—there's some operational logic around accessing data in external storage—but still no business logic. The package takes one step beyond "standard" dbt behavior, by offering macros to manage those objects, with an explicit goal in mind: providing just enough scaffolding so that dbt can treat external data sources like any other sources, and get on with its idempotent business as usual.

I'd want to take a similar tack for other object types that are more input than output for dbt operations. It seems Snowflake has the majority of them: databases, warehouses, stages, file formats, streams, changes. It's easy enough to write the exact SQL needed to create any of them, wrap it in a macro, and call it from a hook or operation. But that's a lot of boilerplate SQL to write and rewrite. It would feel much better to define macros, such as create_file_format or create_warehouse, that take only the essential values as arguments, and return the SQL to create them if they do not exist.

So: where to define (or find) those essential values? dbt already knows about some of these (databases, warehouses) via configs on first-order objects (database, snowflake_warehouse). Even if dbt run takes the opinionated view that it ought to be creating schemas, and ought not to be creating databases, it would certainly be possible to write a macro/hook that loops over all the models in a project, extracts the list of used databases, and creates them if they do not exist.

Even then, the snowflake_warehouse config doesn't define warehouse properties such as size or auto-suspend. Other objects, such as file formats or stages, may be referenced implicitly (e.g. by source external properties) but can't be pulled out of the project manifest with such ease. What's the right way to define this supplemental static data? I think the best answer is YAML, but this isn't possible today (#1318):

version: 2

asset:
  - name: tsv_handler
    asset_type: file_format
    options:
      type: csv
      compression: auto
      record_delimiter: "\t"
  - name: json_handler
    asset_type: file_format
    options:
      ...

An answer that is possible today is to define a macro that exists only to wrap and return static data, pythonic arrays and dictionaries. For instance:

{% macro get_my_file_formats() %}

  {% do return([
    {
      "name": "tsv_handler",
      "options": {
        "type": "csv",
        "compression": "auto",
        "record_delimiter": "\t"
    },
    {
      "name": "json_handler",
      "options": ...
    },
    ...
  ]) %}

{% endmacro %}

For what it's worth, this is how the dbt-external-tables package started out: as a series of macros, some which defined static data, others which pulled in that static data and templated out DDL statements, and a final macro (stage_external_sources) which orchestrated the whole affair, to be called as a hook or operation from the project. Once it showed initial success, I formalized it with an external yaml property and wrapped those macros in a package. It started as a hack, a workaround; today, it's a dbt-onic solution.

I think there's plenty we can do to improve that workflow, and it feels to me like the right thread to pull on. We should be opinionated about what dbt should do by default, which is managing logic-bearing objects, while also cognizant of all the things dbt can do to set up its own scaffolding, comprised of all the other, stranger objects in the database.

@boxysean
Copy link
Contributor Author

boxysean commented Jun 2, 2021

Thank you @jtcohen6 and @codigo-ergo-sum for the thoughtful responses.

Yes from a 10,000 foot view I see that it is easy for me to group the non-table, non-view resources into a big bucket and suggest they all be managed with the same finesse as tables and external tables. Indeed, with each resource, we may need to make a nuanced opinion about how to manage the lifecycle of the resource, and whether it fits the idempotent quality that dbt exhibits today. I recognize this and appreciate the patience in your explanations.

Zooming in, I am particularly interested in a capability managing UDFs with dbt. I think dbt users and community would benefit from a similar dbt-external-tables package for storing UDFs in git and deploying with dbt operations for two main reasons, and two smaller reasons:

Main reason 1: Many data management professionals have successfully spent their careers writing procedural SQL (e.g., PL/SQL). I think it's safe to say there is an industry-wide re-platforming trend to cloud data warehouses, and many career data professionals expect that they can continue writing procedural SQL on their new platforms. (And they are able to! Albeit with primarily Javascript, at the time of this comment.)

Main reason 2: People who manage data re-platforming projects AND who are adding dbt to their data platform would appreciate managing UDFs (and similar) with dbt. They are interested in making their team effective, and teams of career data management professionals want to use procedural SQL. (I am recounting my experience talking with such managers here.)

Small reason 3: UDFs can be complementary to the opinionated way that dbt transforms data. For example, BigQuery and Snowflake Javascript UDFs allow programmers to use Javascript to work with JSON naturally.

Small reason 4: UDFs can be shared at the database-level so that dbt programmers and non-dbt programmers can share business logic.

I must admit both of my main reasons don't quite feel "good" as I personally think data management professionals should use dbt for most/all data transformation for general-purpose analysis use-cases. However, part of me wonders if adding UDF management could help the adoption of dbt in hesitant organizations and/or hesitant individuals; essentially dbt penetration increases and more people appreciate how dbt's opinionated data modeling process could benefit them.

Thanks for laying out the story of dbt-external-tables @jtcohen6 -- I think that's one worth retelling. It does indeed show that if there is demand for UDF management with dbt, the dbt-external-tables package lifecycle could be followed to create a first-class experience for UDFs within dbt.

@codigo-ergo-sum
Copy link

This is probably a bit tangential to the original point of the topic, but @boxysean - have you tested how performant UDFs are at scale? Or anyone else that you're aware of? I'm thinking if, for example, you're calling a UDF inline in a SQL query that, say, does some complicated string processing or a numerical calculation, and say the SQL query is querying a billion-row underlying table and needs to call the UDF once for each incoming row, so a billion times, then is Snowflake having to perform some sort of internal transition a billion times between the execution context of the SQL statement and the execution context of the UDF? In other words, is it similar to embedding a scalar subquery in the projection/SELECT portion of a SQL statement, where the scalar subquery needs to be executed once for each row which at scale is not performant? This is essentially moving back away from set-based processing to row-based processing, where in general I've found that performant SQL and usage of databases in general needs to be the opposite of this, which is moving from row-based logic to set-based logic.

I'm asking this because the vision you're describing above describes a scenario where UDFs allow elegant encapsulation of logic and therefore in some ways are a superior way to embedding logic in SQL statements which then in turn makes the case that dbt should have better native handling of UDFs as first-class dbt objects. But if this is not performant at scale then UDFs perhaps become less attractive in general. Don't get me wrong - it doesn't mean that we shouldn't continue having this discussion, it just means that UDFs might not be so useful in a wide range of scenarios.

This is probably something that could be verified by testing, I admit. I'm a bit swamped right now so don't have time to look at it but it would be interesting to see what Snowflake or BQ do if you call a UDF, say, a billion times in one SQL query.

@boxysean
Copy link
Contributor Author

boxysean commented Jun 9, 2021

@codigo-ergo-sum I have also not tested the performance of UDFs at-scale, and agreed that UDFs could be significantly inferior to plain ol' SQL in many/most cases.

Thank you for highlighting some of the tension of the "managing UDFs" as I've written it out. Reflecting back, I also do not find my previous comment a compelling argument for the prioritization of my issue here. The audience who would be happy to see this feature are not you and I, however, it's folks who have not used dbt.

Perhaps there's a better class of data warehouse objects to use as an example in this story? A class of objects both you, I, and folks who have not used dbt could be excited about? Snowflake Streams, perhaps?

@lyschoening
Copy link

lyschoening commented Feb 21, 2022

I feel there is an excellent case to be made to include management of row access policies (Snowflake in our case) and associating them with models.

dbt already creates schemas and it would be useful to also create databases (and roles). There is a need for some metadata management to handle deltas with these objects, so that unused objects are removed, and it would be good to know whether dbt will manage schemas and databases or whether that should be left to a third party tool like Terraform.

@boxysean
Copy link
Contributor Author

boxysean commented Feb 23, 2022

@lyschoening and I have been discussing how to manage complex Snowflake Users and Roles. Scenario: a security-conscious organization wants to tightly control Snowflake User access to data on an individual basis. Due to Snowflake's model of specifying a Role to execute a query under (docs), it means that a USE ROLE command must be executed prior to a dbt command to ensure proper permissions to execute the underlying SQL queries, especially for CREATE.

However, a limitation of the Snowflake OAuth integration is that an authorization occurs for a single role, not allowing for switching roles in a database session (e.g., in dbt Cloud, docs).

One solution could be to have a 1:1 user-to-role mapping (e.g., sean_role maps to sean_user), so that switching roles is not necessary and permissions are managed at the role. sean_user would always use the sean_role to execute Snowflake queries generated by dbt, at the permission level you'd expect.

Such an organization could benefit from managing these roles within dbt itself! Alternatives to this are Terraform management, or Permifrost management.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 23, 2022

@boxysean This is really helpful! Not the first time I've heard about the one-role-per-user pattern, and I'd like to develop a clear set of recommendations when using this pattern with dbt + Snowflake.

a USE ROLE command must be executed prior to a dbt command to ensure proper permissions to execute the underlying SQL queries, especially for CREATE.

This makes sense to me. It's very much like what we do with the snowflake_warehouse config, which runs a use warehouse command prior to the create (or insert, merge, etc) statement when materializing models. It allows object grants to still operate at the "group" level (i.e. secondary roles). However...

The OAuth limitation is a pretty serious one. Worth looking more into external OAuth, which doesn't have this limitation. External OAuth has come up before, in other contexts, as a preferable alternative to the current OAuth implementation built into Snowflake / dbt-snowflake.

Such an organization could benefit from managing these roles within dbt itself!

Agreed. The naive version of this in dbt might look like implementing reusable "role groups," as in, reusable vars or macros that just store a list of user-roles. When each model runs, it grants access to the appropriate group of user-roles.

The better version of this might look like the orchestration tool (dbt Cloud) providing those "groups," as env vars that it sets using information about users it knows about, and the permissions/license type each one has, in a given account/project. For instance: env_var('DBT_DEVELOPERS') returns a list of users with the Developer license type, and so on for DBT_READ_ONLY, DBT_ADMINS, etc.

I'm sure there are many limitations to this approach—it by no means represents a full permissioning system for the database—but something like that might get us 80% of what's needed, without spinning up a separate Terraform/Permifrost deployment. (@saimaddali I'd be curious to hear what you think about this, since we were discussing it the other week)

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants