Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalized dbt build command #2743

Closed
jtcohen6 opened this issue Sep 9, 2020 · 13 comments · Fixed by #3490
Closed

Generalized dbt build command #2743

jtcohen6 opened this issue Sep 9, 2020 · 13 comments · Fixed by #3490
Assignees
Labels
1.0.0 Issues related to the 1.0.0 release of dbt discussion enhancement New feature or request
Milestone

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 9, 2020

See also: #1054, #1227, #2234, this comment

Describe the feature

Each dbt node-resource type has a task-command associated with it:

  • models = dbt run
  • tests = dbt test
  • seeds = dbt seed
  • snapshots = dbt snapshot
  • sources = dbt source snapshot-freshness

Additionally, there could be a generalized command dbt build1 that would step through a DAG of multiple resource types and "build" them accordingly.

What would this look like? I imagine an argument syntax similar to dbt ls, i.e.

dbt build --select ... --exclude ... --resource-type ...

1 name subject to change, though for the ultimate command of the data build tool, it'd be hard to think of one more apropos...

Example

Let's imagine we had model_a that depends on a source (my_source.table) and a seed (my_seed), a snapshot (my_snapshot) of model_a, and then model_b which selected from my_snapshot. Of course, we also have tests on many of them. Roughly:

my_source.table --> my_seed --> model_a --> my_snapshot --> model_b

Within a single invocation, dbt build would go through motions analogous to running the following dbt commands. It would only proceed to the next numerical steps if all upstream steps succeed:

1a. dbt seed my_seed
1b. dbt source snapshot-freshness --select my_source.table
2a. dbt test --models my_seed
2b. dbt test --models source:my_source.table
3. dbt run --models model_a
4. dbt test --models model_a
5. dbt snapshot --select my_snapshot
6. dbt test --models my_snapshot
7. dbt run --models model_b
8. dbt test --models model_b

Complexities

  • Some of these tasks are already DAG aware (run, test, snapshot), some are not (seed, snapshot-freshness)
  • Commands support several different flags
    • How to expose when a flag is being used, and when it isn't?
    • What about same-named flags that do subtly different things across commands? e.g. dbt run --full-refresh vs. dbt seed --full-refresh
  • Node types are just about 1:1 with task types, though dbt test almost feels like an exception. Technically, dbt test operations on test nodes, but other node types can be passed into its selection syntax, with selector expansion as the last step, so it "feels" like you're testing a model or a snapshot. (Edit: this behavior may someday change.)
  • This risks a lot of our existing intuitions that come from having resource types nicely delineated. Put differently: what if it all just falls apart?
    • What if it works so well that 90% of dbt deployments are just dbt build? Should we be weary of creating one command to rule them all?

Describe alternatives you've considered

  • Doing a more particularized version of this, e.g. dbt run+test (as outlined in linked issues)
  • Not doing this at all, and leaving the federation of one resource type = one command/invocation. Is this a good abstraction that we should fight to keep?

Who will this benefit?

  • Bigger, more complex projects who want to run subsets of different resource types. Today, that can only be accomplished through complex selection syntax leveraging tags. YAML selectors improves this somewhat, but they're not the answer.
  • Projects with snapshots that participate in the middle of the DAG
  • Deployments that want to test upstream models before running downstream models, so as to alert earlier and save compute time/$$ in the event of failure
@bashyroger
Copy link

For me to verify: the goal of this is to:
...For the part of the DAG that you select using the model syntax, tags, state flag etc to:
run that whole chain, starting with the required source freshness checks / seeds, then running the models and directly after each model, a test if it exists?

@jtcohen6
Copy link
Contributor Author

That's the idea. And since snapshots can also participate in the middle of a DAG (model_a --> my_snapshot --> model_b), this command should have the ability to run, snapshot, run all in one invocation.

@drewbanin
Copy link
Contributor

drewbanin commented Sep 17, 2020

@jtcohen6 are we thinking about making tests participate more naturally in the DAG here? Ie:

model_a ---> test_model_a_unique_id ---> model_b

If test_model_a_unique_id failed, that would conceivably skip running model_b. I think we have at least one issue... somewhere... that talks about this, and I'm curious if it is in scope for a ticket like this one

Edit: basically just curious if #2234 is a part of this or a separate issue

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Sep 17, 2020

@drewbanin I'm thinking that's in scope here, yes. This command would be a more generalized version of the dbt run+test command proposed in #2234.

Both test_model_a_unique_id and model_b are first-order children of model_a. We'd want this command to run tests first and, if they fail, skip all other children of model_a.

@bashyroger
Copy link

@jtcohen6 are we thinking about making tests participate more naturally in the DAG here? Ie:

model_a ---> test_model_a_unique_id ---> model_b

If test_model_a_unique_id failed, that would conceivably skip running model_b. I think we have at least one issue... somewhere... that talks about this, and I'm curious if it is in scope for a ticket like this one

Edit: basically just curious if #2234 is a part of this or a separate issue

This exactly is the reason I was asking for clarification on the WHAT of this to ticket.
Basically, this would enabling processing a part of a complete DAG 'horizontally' / in logical sequence per dag thread VS 'vertically' / per 'layer'.
And it is good to have that freedom in loading patterns IMO!

@fabrice-etanchaud
Copy link

fabrice-etanchaud commented Apr 6, 2021

Hello, just to mention this post in 2234 : #2234 (comment).
IMHO this subject should distinguish between precondition's tests and postcondition's tests : which tests are mandatory to model's good execution (dumb example upstream_model.column >= 0, because current model computes its square root), and which tests are awaited on current model's new data (upstream_model.column ~= current_model.column * current_model.column) and can then be used as precondition for downstream models.

@drewbanin
Copy link
Contributor

drewbanin commented May 14, 2021

@jtcohen6 I have been stuck on this idea that I just cannot shake! Wanted to mention it here.

IF:

  • A project has sources configured AND
  • dbt is configured to run dbt source snapshot-freshness AND
  • dbt has a way to compare 1) freshness information and 2) model logic across invocations AND
  • dbt has knowledge of which materializations map to views vs. tables

THEN:

  • a generalized dbt build command would be well-positioned to skip running models where a rebuild would result in exactly the same database object that already exists in the database

I think there's some more formality / rigor to apply here, and I'm actually not 100% sure that this requires the existence of a dbt build command, but wanted to throw it out there for consideration.

To get more concrete, here are some of the examples I'm considering:
A view model only need to be built when:

  • its logic has changed
  • it does not already exist in the database

A table/incremental model only needs to be built when:

  • Its logic has changed OR
  • it does not already exist in the database OR
  • An upstream model's logic has changed OR
  • An upstream source's logic data has changed

I think that we can get at a lot of this stuff with the state: or config.materialized selectors, so really my thinking boils down to:

  • Something like a new freshness/source selector?
  • Maybe some sort of packaging that makes these types of selectors more concise?
  • A way to specify this logic as a part of the generalized dbt build command?

@jtcohen6
Copy link
Contributor Author

@drewbanin That's a really neat thought. We've talked about some kind of freshness:new selector in #2465 (comment), but in order to persist that information across tasks, dbt would need to be handling and comparing multiple source.json files. A pretty slick way around that would be to freshness-check sources in the same task/invocation as running their downstream resources. The trick is, I'm not sure if that could be handled via node selection, which happens before execution begins. Perhaps models downstream of an unchanged source could just be... SKIPPED?

Node selection also doesn't have a clear conception of "already exists in the database," since that tends to live in the materialization logic, but the --defer flag does take this into account. We may want to unbundle these complexities into explicit selection methods (state:, config:, freshness:, existing:) and then re-bundle them for the purposes of an opinionated workflow: dbt build -s what_needs_building.

@StephanGoergen
Copy link

StephanGoergen commented May 14, 2021

@drewbanin, teams that manage grants on objects via post-hooks may also want to re-build a model when a config (as opposed to just sql logic) changes. Our team has recently moved post-hook definitions from dbt_project.yml into the model files, so that will hopefully mark the model as state:modified after a permissions change. But that won't apply to any (default) grants governed by dbt_project.yml, so there would be a risk of drift between the database grants and the dbt config.

EDIT: it looks like the state method accounts for configs in general, but I'm not 100% sure that post-hooks are among them.

@boxysean
Copy link
Contributor

This Slack thread from Nadya Hrebinka asked "is there a way to [...] combine commands into one to save a couple of minutes?". My mind went to this GitHub issue.

Could Nadya and others experiencing long dbt project parse times use the dbt build command to parse their project once and eliminate duplicate parse time from their successively-executed dbt commands? (e.g., dbt seed, dbt run, dbt test)

@jtcohen6
Copy link
Contributor Author

Yes! By operating over multiple resource types in a single invocation, dbt build would only need to parse the project once. Another way to speed this up is by using partial parsing, which persists parse state between runs—and it's even better in v0.20.

@kosti-hokkanen-supermetrics

I just discovered that this command would be mostly answering the problems we have had with minor issues in base models blocking the whole project from running. I would imagine that in perfect situation I would have:

  • A build command, where I can define the operations I want to run and in which order, like run+test
  • Separate node selection syntaxes for each command, like --exclude-run config.materialized:view --exclude-test models/staging as for example we would want to run anything but views, but test everything (including views)
  • The command would work the DAG node by node, executing actions in the defined order, for example by first checking if the model matches node selection for running, running if match, then checking if it's matching the node selection for testing, running tests if matched and so on

This way an error in a node, either in running or testing, would only block the models that are downstream from it.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jul 1, 2021

@kosti-hokkanen-supermetrics Cool to hear what you're hoping to do with it! The first cut of dbt build won't allow much configuration, and its behavior will be defined by some opinionated rules, including:

  • run a model before testing it
  • failures in those tests block downstream models from running

That said, I believe all the right constructs are there. I bet you can combine several dbt build invocations, paired with test severity and thoughtful node selection, to accomplish the thing you're after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0.0 Issues related to the 1.0.0 release of dbt discussion enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants