Generalized `dbt build` command #2743

jtcohen6 · 2020-09-09T17:43:32Z

See also: #1054, #1227, #2234, this comment

Describe the feature

Each dbt node-resource type has a task-command associated with it:

models = dbt run
tests = dbt test
seeds = dbt seed
snapshots = dbt snapshot
sources = dbt source snapshot-freshness

Additionally, there could be a generalized command dbt build¹ that would step through a DAG of multiple resource types and "build" them accordingly.

What would this look like? I imagine an argument syntax similar to dbt ls, i.e.

dbt build --select ... --exclude ... --resource-type ...

¹ name subject to change, though for the ultimate command of the data build tool, it'd be hard to think of one more apropos...

Example

Let's imagine we had model_a that depends on a source (my_source.table) and a seed (my_seed), a snapshot (my_snapshot) of model_a, and then model_b which selected from my_snapshot. Of course, we also have tests on many of them. Roughly:

my_source.table --> my_seed --> model_a --> my_snapshot --> model_b

Within a single invocation, dbt build would go through motions analogous to running the following dbt commands. It would only proceed to the next numerical steps if all upstream steps succeed:

1a. dbt seed my_seed
1b. dbt source snapshot-freshness --select my_source.table
2a. dbt test --models my_seed
2b. dbt test --models source:my_source.table
3. dbt run --models model_a
4. dbt test --models model_a
5. dbt snapshot --select my_snapshot
6. dbt test --models my_snapshot
7. dbt run --models model_b
8. dbt test --models model_b

Complexities

Some of these tasks are already DAG aware (run, test, snapshot), some are not (seed, snapshot-freshness)
Commands support several different flags
- How to expose when a flag is being used, and when it isn't?
- What about same-named flags that do subtly different things across commands? e.g. dbt run --full-refresh vs. dbt seed --full-refresh
Node types are just about 1:1 with task types, though dbt test almost feels like an exception. Technically, dbt test operations on test nodes, but other node types can be passed into its selection syntax, with selector expansion as the last step, so it "feels" like you're testing a model or a snapshot. (Edit: this behavior may someday change.)
This risks a lot of our existing intuitions that come from having resource types nicely delineated. Put differently: what if it all just falls apart?
- What if it works so well that 90% of dbt deployments are just dbt build? Should we be weary of creating one command to rule them all?

Describe alternatives you've considered

Doing a more particularized version of this, e.g. dbt run+test (as outlined in linked issues)
Not doing this at all, and leaving the federation of one resource type = one command/invocation. Is this a good abstraction that we should fight to keep?

Who will this benefit?

Bigger, more complex projects who want to run subsets of different resource types. Today, that can only be accomplished through complex selection syntax leveraging tags. YAML selectors improves this somewhat, but they're not the answer.
Projects with snapshots that participate in the middle of the DAG
Deployments that want to test upstream models before running downstream models, so as to alert earlier and save compute time/$$ in the event of failure

The text was updated successfully, but these errors were encountered:

bashyroger · 2020-09-16T12:28:00Z

For me to verify: the goal of this is to:
...For the part of the DAG that you select using the model syntax, tags, state flag etc to:
run that whole chain, starting with the required source freshness checks / seeds, then running the models and directly after each model, a test if it exists?

jtcohen6 · 2020-09-16T12:51:50Z

That's the idea. And since snapshots can also participate in the middle of a DAG (model_a --> my_snapshot --> model_b), this command should have the ability to run, snapshot, run all in one invocation.

drewbanin · 2020-09-17T16:26:29Z

@jtcohen6 are we thinking about making tests participate more naturally in the DAG here? Ie:

model_a ---> test_model_a_unique_id ---> model_b

If test_model_a_unique_id failed, that would conceivably skip running model_b. I think we have at least one issue... somewhere... that talks about this, and I'm curious if it is in scope for a ticket like this one

Edit: basically just curious if #2234 is a part of this or a separate issue

jtcohen6 · 2020-09-17T16:47:07Z

@drewbanin I'm thinking that's in scope here, yes. This command would be a more generalized version of the dbt run+test command proposed in #2234.

Both test_model_a_unique_id and model_b are first-order children of model_a. We'd want this command to run tests first and, if they fail, skip all other children of model_a.

bashyroger · 2020-09-18T08:27:53Z

@jtcohen6 are we thinking about making tests participate more naturally in the DAG here? Ie:
model_a ---> test_model_a_unique_id ---> model_b
If test_model_a_unique_id failed, that would conceivably skip running model_b. I think we have at least one issue... somewhere... that talks about this, and I'm curious if it is in scope for a ticket like this one

Edit: basically just curious if #2234 is a part of this or a separate issue

This exactly is the reason I was asking for clarification on the WHAT of this to ticket.
Basically, this would enabling processing a part of a complete DAG 'horizontally' / in logical sequence per dag thread VS 'vertically' / per 'layer'.
And it is good to have that freedom in loading patterns IMO!

fabrice-etanchaud · 2021-04-06T08:59:20Z

Hello, just to mention this post in 2234 : #2234 (comment).
IMHO this subject should distinguish between precondition's tests and postcondition's tests : which tests are mandatory to model's good execution (dumb example upstream_model.column >= 0, because current model computes its square root), and which tests are awaited on current model's new data (upstream_model.column ~= current_model.column * current_model.column) and can then be used as precondition for downstream models.

drewbanin · 2021-05-14T13:04:41Z

@jtcohen6 I have been stuck on this idea that I just cannot shake! Wanted to mention it here.

IF:

A project has sources configured AND
dbt is configured to run dbt source snapshot-freshness AND
dbt has a way to compare 1) freshness information and 2) model logic across invocations AND
dbt has knowledge of which materializations map to views vs. tables

THEN:

a generalized dbt build command would be well-positioned to skip running models where a rebuild would result in exactly the same database object that already exists in the database

I think there's some more formality / rigor to apply here, and I'm actually not 100% sure that this requires the existence of a dbt build command, but wanted to throw it out there for consideration.

To get more concrete, here are some of the examples I'm considering:
A view model only need to be built when:

its logic has changed
it does not already exist in the database

A table/incremental model only needs to be built when:

Its logic has changed OR
it does not already exist in the database OR
An upstream model's logic has changed OR
An upstream source's ~~logic~~ data has changed

I think that we can get at a lot of this stuff with the state: or config.materialized selectors, so really my thinking boils down to:

Something like a new freshness/source selector?
Maybe some sort of packaging that makes these types of selectors more concise?
A way to specify this logic as a part of the generalized dbt build command?

jtcohen6 · 2021-05-14T13:28:17Z

@drewbanin That's a really neat thought. We've talked about some kind of freshness:new selector in #2465 (comment), but in order to persist that information across tasks, dbt would need to be handling and comparing multiple source.json files. A pretty slick way around that would be to freshness-check sources in the same task/invocation as running their downstream resources. The trick is, I'm not sure if that could be handled via node selection, which happens before execution begins. Perhaps models downstream of an unchanged source could just be... SKIPPED?

Node selection also doesn't have a clear conception of "already exists in the database," since that tends to live in the materialization logic, but the --defer flag does take this into account. We may want to unbundle these complexities into explicit selection methods (state:, config:, freshness:, existing:) and then re-bundle them for the purposes of an opinionated workflow: dbt build -s what_needs_building.

StephanGoergen · 2021-05-14T18:42:11Z

@drewbanin, teams that manage grants on objects via post-hooks may also want to re-build a model when a config (as opposed to just sql logic) changes. Our team has recently moved post-hook definitions from dbt_project.yml into the model files, so that will hopefully mark the model as state:modified after a permissions change. But that won't apply to any (default) grants governed by dbt_project.yml, so there would be a risk of drift between the database grants and the dbt config.

EDIT: it looks like the state method accounts for configs in general, but I'm not 100% sure that post-hooks are among them.

boxysean · 2021-06-30T07:48:41Z

This Slack thread from Nadya Hrebinka asked "is there a way to [...] combine commands into one to save a couple of minutes?". My mind went to this GitHub issue.

Could Nadya and others experiencing long dbt project parse times use the dbt build command to parse their project once and eliminate duplicate parse time from their successively-executed dbt commands? (e.g., dbt seed, dbt run, dbt test)

jtcohen6 · 2021-06-30T11:53:57Z

Yes! By operating over multiple resource types in a single invocation, dbt build would only need to parse the project once. Another way to speed this up is by using partial parsing, which persists parse state between runs—and it's even better in v0.20.

kosti-hokkanen-supermetrics · 2021-07-01T13:50:32Z

I just discovered that this command would be mostly answering the problems we have had with minor issues in base models blocking the whole project from running. I would imagine that in perfect situation I would have:

A build command, where I can define the operations I want to run and in which order, like run+test
Separate node selection syntaxes for each command, like --exclude-run config.materialized:view --exclude-test models/staging as for example we would want to run anything but views, but test everything (including views)
The command would work the DAG node by node, executing actions in the defined order, for example by first checking if the model matches node selection for running, running if match, then checking if it's matching the node selection for testing, running tests if matched and so on

This way an error in a node, either in running or testing, would only block the models that are downstream from it.

jtcohen6 · 2021-07-01T17:16:56Z

@kosti-hokkanen-supermetrics Cool to hear what you're hoping to do with it! The first cut of dbt build won't allow much configuration, and its behavior will be defined by some opinionated rules, including:

run a model before testing it
failures in those tests block downstream models from running

That said, I believe all the right constructs are there. I bet you can combine several dbt build invocations, paired with test severity and thoughtful node selection, to accomplish the thing you're after.

jtcohen6 added enhancement New feature or request discussion labels Sep 9, 2020

jtcohen6 mentioned this issue Sep 9, 2020

Add pseudo selectors that select models based on artifact states #2465

Closed

jtcohen6 mentioned this issue Nov 5, 2020

Command which 'run's, then 'test's (in the same command). #2234

Closed

jtcohen6 mentioned this issue Nov 24, 2020

Defer refers to outdated seed #2909

Closed

jtcohen6 added the 1.0.0 Issues related to the 1.0.0 release of dbt label Feb 11, 2021

jtcohen6 mentioned this issue Feb 22, 2021

Storing test results in the database #2593

Closed

This was referenced Mar 10, 2021

Run tests IFF all first-order parents are selected #2891

Closed

[Q1C2] More consistent, configurable tests #3066

Closed

jtcohen6 added this to the Oh-Twenty-One milestone Apr 13, 2021

friendofasquid mentioned this issue May 10, 2021

Run models and then their tests, before running downstream models, as part of dbt run #3332

Closed

leahwicz assigned iknox-fa May 18, 2021

leahwicz modified the milestones: Oh-Twenty-One, Q2 Cycle 2 2021 May 19, 2021

jtcohen6 mentioned this issue May 19, 2021

Detail and scope 1.0.0 issues #3370

Closed

18 tasks

iknox-fa mentioned this issue Jun 2, 2021

RAT - GraphRunnable tasks are suitable for #2743 #3415

Closed

leahwicz mentioned this issue Jun 10, 2021

[Tracking] dbt build command #3452

Closed

9 tasks

jtcohen6 mentioned this issue Jun 21, 2021

error on non-existing table referenced in snapshot (expected: warning) #3478

Closed

5 tasks

iknox-fa mentioned this issue Jun 22, 2021

Feature/dbt build #3488

Closed

iknox-fa mentioned this issue Jun 22, 2021

dbt build #3490

Merged

4 tasks

jtcohen6 mentioned this issue Jun 25, 2021

dbt 0.20.0rc1 not finding test on model itself #3496

Closed

5 tasks

jtcohen6 mentioned this issue Jul 15, 2021

Relationships tests do not run on 0.20.0 when -m model_name is used. #3571

Closed

5 tasks

iknox-fa closed this as completed in #3490 Jul 20, 2021

jtcohen6 mentioned this issue Oct 13, 2021

Selection method based on source freshness: new max_loaded_at, new data #4050

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalized `dbt build` command #2743

Generalized `dbt build` command #2743

jtcohen6 commented Sep 9, 2020 •

edited

Loading

bashyroger commented Sep 16, 2020

jtcohen6 commented Sep 16, 2020

drewbanin commented Sep 17, 2020 •

edited

Loading

jtcohen6 commented Sep 17, 2020 •

edited

Loading

bashyroger commented Sep 18, 2020

fabrice-etanchaud commented Apr 6, 2021 •

edited

Loading

drewbanin commented May 14, 2021 •

edited

Loading

jtcohen6 commented May 14, 2021

StephanGoergen commented May 14, 2021 •

edited

Loading

boxysean commented Jun 30, 2021

jtcohen6 commented Jun 30, 2021

kosti-hokkanen-supermetrics commented Jul 1, 2021

jtcohen6 commented Jul 1, 2021

Generalized dbt build command #2743

Generalized dbt build command #2743

Comments

jtcohen6 commented Sep 9, 2020 • edited Loading

Describe the feature

Example

Complexities

Describe alternatives you've considered

Who will this benefit?

bashyroger commented Sep 16, 2020

jtcohen6 commented Sep 16, 2020

drewbanin commented Sep 17, 2020 • edited Loading

jtcohen6 commented Sep 17, 2020 • edited Loading

bashyroger commented Sep 18, 2020

fabrice-etanchaud commented Apr 6, 2021 • edited Loading

drewbanin commented May 14, 2021 • edited Loading

jtcohen6 commented May 14, 2021

StephanGoergen commented May 14, 2021 • edited Loading

boxysean commented Jun 30, 2021

jtcohen6 commented Jun 30, 2021

kosti-hokkanen-supermetrics commented Jul 1, 2021

jtcohen6 commented Jul 1, 2021

Generalized `dbt build` command #2743

Generalized `dbt build` command #2743

jtcohen6 commented Sep 9, 2020 •

edited

Loading

drewbanin commented Sep 17, 2020 •

edited

Loading

jtcohen6 commented Sep 17, 2020 •

edited

Loading

fabrice-etanchaud commented Apr 6, 2021 •

edited

Loading

drewbanin commented May 14, 2021 •

edited

Loading

StephanGoergen commented May 14, 2021 •

edited

Loading