Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: dbt compare subcommand #1135

Closed
mikekaminsky opened this issue Nov 15, 2018 · 14 comments
Closed

Feature: dbt compare subcommand #1135

mikekaminsky opened this issue Nov 15, 2018 · 14 comments
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@mikekaminsky
Copy link
Contributor

mikekaminsky commented Nov 15, 2018

Add a compare subcommand for identifying stale/unmanaged relations.

Feature description

Similar to #615, but maybe less objectionable!

dbt compare should inspect the code repository and determine which views and tables are described therein. DBT should compare those relations to the system tables, and produce some output identifying any discrepancies. Something like

$ dbt compare

Comparing local codebase to data warehouse.
...
Identified 2 relations in the warehouse that are not managed by DBT.
VIEW: schema_name.view_name
TABLE: schema_name.table_name

Who will this benefit?

This feature is useful for warehouse admins who want to identify and remove stale relations.

I frequently forget to clean up after myself if I've done some refactoring (e.g., renaming models or deleting old models that have been superseded by others). Since DBT has all of the requisite information to tell me about this, it seems like the right tool for the job.

Risks

DBT probably shouldn't be in the business of dropping relations, and there's some risk that this feature could be mis-used or the output misinterpreted (in particular, for people who have complex schema-renaming rules for different targets). @drewbanin can add some more detailed thoughts about where this might go wrong.

@mikekaminsky mikekaminsky changed the title Feature: dbt compare subcommand Feature: dbt compare subcommand Nov 15, 2018
@drewbanin
Copy link
Contributor

@mikekaminsky I think I'm into this idea. It would be convenient if the command would output a list of statements that could be run to drop the tables too. I think this would work really well as an operation. Maybe those should have access to the manifest so they can inspect the models in the project, and the state of the database. Then if users want to create an operation to actually drop the tables... that would be none of my business :)

For anyone else coming to this thread: there are heaps of ways that dbt could get this "wrong". If a user forgets to run dbt deps, or if they have configs to change the materialization setting of a model from table to ephemeral in dev, or if they aren't up to date with master, dbt could conceivably advise that current and in-use tables should be dropped. I want to enable a workflow for users, but I don't want dbt to make it easy to do something really bad by accident.

Thanks for the request @mikekaminsky!

@kmamykin
Copy link

kmamykin commented Jan 4, 2019

My 2 cents: would be useful to have a config setting per model/model folder how dbt should behave in regards to synching the list of models and the list of db objects. E.g. value safe - do not ever drop anything from target. except tables - drop views but not tables, sync - full sync dropping everything that is not in the list of current models. Something along those lines.

@elexisvenator
Copy link
Contributor

If dbt is going to provide features like this, then there needs to be something other than console output.
In the future dbt-as-api world then it should just be returning a model describing the changes, but in the meantime I thing a useful solution would be to be able to specify an output file (json?) to return results to.

@tayloramurphy
Copy link

Stumbled onto this after the tenth time of us deleting models but forgetting to drop them in the dbt-managed schema.

It would be helpful to have a command like this, maybe dbt diff, that shows what objects exist in dbt-managed schemas and spit out the commands that would need to be run to remove old relations.

@drewbanin
Copy link
Contributor

Thanks for the bump @tayloramurphy - I feel better about building this these days than I have in the past! The thing that's changed is that dbt's core constructs have coalesced around a handful of resources (seeds, models, snapshots). I actually think the obviously better version of this (one which both enumerates the relations, but also supports dropping the deleted ones) is within our reach too :)

My one hesitation is that adding a new subcommand (diff, compare, etc) is pretty heavy-weight! I feel ok about that in this particular case, but I wouldn't want to do that as a rule for future features similar to this one.

You buy all of that?

@tayloramurphy
Copy link

@drewbanin for sure I buy that! I don't have a strong opinion that this should be a separate command, I'm just not knowledgeable enough about the internal workings to know if it would make sense as a flag against run/test or via some other implementation.

The key thing for me on this one is, if we make the assumption that dbt is an infra as code tool where we define what the state of our warehouse should be, then there needs to be a way to sync the warehouse to what the code says should exist. Basically - here's what the information schema should say exists, here's what the information schema does say exists, and here's how to sync them.

@mikekaminsky
Copy link
Contributor Author

I'd be for pulling compare out of dbt-helper and into dbt-core since I don't have time to maintain dbt-helper any more if it's helpful to have it as reference:

https://github.com/mikekaminsky/dbt-helper/blob/master/core/compare.py

@jtcohen6
Copy link
Contributor

jtcohen6 commented May 13, 2020

In thinking about this for an upcoming sprint, I agree with the comments above that:

  • While we really hesitated about having dbt dropping database objects in the past, we feel better about its core constructs today, and we've spent years pressing the importance of having dbt's relational objects exist within a separate "dbt-managed" schema/database.
  • We should hesitate before adding a new dbt subcommand. dbt compare or dbt diff have some semantic overlap with other proposals (e.g. dbt run for new or modified models only #1677) to identify models that are new/changed since a previous run.

To my mind, one way we could accomplish the compare or diff operation would be as a flag to dbt ls. It can return console ouput, JSON, or (with extra specification) execute a series of drop statements. Here's what I could envision:

dbt ls --orphaned  # list database relations that do not map to current dbt models
dbt ls --orphaned --execute-drop  --dry-run  # list DDL dbt will run
dbt ls --orphaned --execute-drop  # execute drop statements

I admit that this is different behavior from current dbt ls commands, which returns a list of dbt resources, never database objects.

@beckjake
Copy link
Contributor

For what it's worth, dbt ls already requires a valid profile definition (though I don't think it requires that the profile actually be able to connect), so the break from current behavior is no big deal from a development perspective - it's not like we'd have to do anything along the lines of what deps does.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2022

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jan 5, 2022
@mlavoie-sm360
Copy link

I'm still interested, any chance this feature might get it's big break?

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2022

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jul 6, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

@joostboonzajerflaes
Copy link

joostboonzajerflaes commented May 4, 2023

I have chatGPT suggest me dbt diff. According to chatGPT it was available before 0.18.0. However, I installed dbt v0.17.2 and couldn't find anything on it. Anyone know where this is coming from? Just mostly curious at this point.

Screenshot 2023-05-04 at 09 18 09
Screenshot 2023-05-04 at 09 18 25
Screenshot 2023-05-04 at 09 47 19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

10 participants