Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pseudo selectors that select models based on artifact states #2465

Closed
drewbanin opened this issue May 18, 2020 · 5 comments
Closed

Add pseudo selectors that select models based on artifact states #2465

drewbanin opened this issue May 18, 2020 · 5 comments
Labels
enhancement New feature or request node selection Functionality and syntax for selecting DAG nodes state Stateful selection (state:modified, defer)

Comments

@drewbanin
Copy link
Contributor

drewbanin commented May 18, 2020

See also #2172, #2425

Describe the feature

This change would be in support of:

  1. Improved dev experiences
  2. Slimmer CI builds

If dbt is provided artifacts (manifest, run_results) produced from a previous run of dbt, then dbt will be able to determine:

  1. New nodes
  2. Changed nodes
  3. Nodes that failed to build in a previous invocation

Here are some high-level example usage scenarios:

# Run new and changed models (and their descendants) in a CI build
$ dbt --state prod-target/ run --models @state:modified

# Re-run failed models and their children in development (or, re-run a prod job that failed)
$ dbt --state target/ run --models build:error+

# Re-run failed models and their children in development
# Note: --state is implied to be target/ here
$ dbt run --models build:error+

Implementation details

dbt is going to need to point to the artifacts from a previous invocation to compare manifests or determine build statuses from a previous run. To accomplish this, we could add a flag like --state which should point to a folder containing the manifest and run_results from a previous invocation of dbt. It will be the users responsibility to make sure these artifacts are present in their environment.

--state flag:

  • This flag probably makes the most sense as a flag to dbt, as it will apply to many subcommands (eg. compile, run, test, seed, snapshot, and ls). It can definitely be a flag to subcommands (or both) if that makes sense
  • The default value should be target/
  • If the expected state files are not present, dbt should run successfully, but selectors based on this state information should fail if used.
    • eg. dbt run --models build:error will fail with an appropriate error if the target/ dir does not exist

Selectors:

  • state:modified: Will select any nodes whose hashes have changed compared to the value present in the manifest artifact
  • state:new: Will select any nodes which are present in the project but are not present in the manifest artifact
  • We'll probably want to provide some shorthand that selects new & changed files for local dev
  • build:error: Will select any nodes which errored or were skipped in run_results state artifact
  • build:success: I don't know that there's a concrete use-case for something like this, but it seems sensible to implement selectors for different states

Determining nodes that have changed
This is a tricky problem! A very simple version of this functionality can be implemented with a git diff --name-only. That will get you pretty far, but it will not account for:

  • models that should be considered changed because they reference a macro that has changed
  • schema.yml files (it's tough to correlate .yml file changes to dbt nodes, at least as far as git is concerned)
  • the global impacts of changes to specific macros (eg. generate_schema_name) or the dbt_project.yml file

Describe alternatives you've considered

  • Git trickery: This is an incomplete solution and won't fare super well in CI envs, but might be hackable in local dev work

Who will this benefit?

  • People who run dbt jobs in their CI envs
  • People who are making iterative changes in development
  • We could add a "Rerun from failed" button in dbt Cloud, and folks running dbt in their own prod envs could do something similar (eg. in an Airflow error handler) for intermittent build failures
@bashyroger
Copy link

bashyroger commented Jun 19, 2020

This would indeed be great to have.
Especially when running a large pool of connected models, having the option to continue on failure point will save a lot of reprocessing overhead and thus time / costs

On the implementation side, I think this really requires tracking model run state in a database.
Which will bring an additional benefit on being able to report on run state / track the running over time. As in, make what is part of the optional DBT logging package part of the core of DBT...

@jtcohen6 jtcohen6 added this to the Marian Anderson milestone Jul 6, 2020
@jtcohen6 jtcohen6 changed the title Add psuedo selectors that select models based on artifact states Add pseudo selectors that select models based on artifact states Aug 9, 2020
@jtcohen6 jtcohen6 modified the milestones: Marian Anderson, 0.19.0 Aug 18, 2020
@jtcohen6
Copy link
Contributor

The state: selectors outlined in this issue were added in #2695. More specific subselectors are proposed in #2704.

The remainder of this issue is the addition of another selector method, build:, which consumes run_results.json and allows users to select models based on their status there.

@jtcohen6 jtcohen6 added the state Stateful selection (state:modified, defer) label Sep 9, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 9, 2020

Coming back to this just to say: I think we should pick a different name for the selection method that leverages run_results.json. I'm thinking result: instead of build:. (The latter word feels too vague, and in the spirit of "all the good words are already taken," it might be better used elsewhere.)

dbt run -m result:error --state target/

Up for debate:

  • Should we set the value of target-path to be the default value of --state across the board? This is a question we punted on previously
  • Should we broaden the possibilities of the result method? results.status:error, result.timing:>600, ...? I only really see a compelling use case for status-based selection, but if there's another good one, it's worth keeping the door open

@jtcohen6 jtcohen6 removed this from the Kiyoshi Kuromiya milestone Sep 21, 2020
@jtcohen6
Copy link
Contributor

Another idea that we've thrown around in some Slack discussions: Using the sources.json artifact to determine which sources have new data since the last time dbt run, thereby enabling a pseudoselector like:

$ dbt run -m sources:fresh+

I think this would require using multiple sources.json—one from a previous job, one from the current job (i.e. a preceding dbt source snapshot-freshness step)—to figure out which sources have a different value of max_loaded_at.

@jtcohen6
Copy link
Contributor

This issue saw us through some good times :) closing in favor of the more-specific proposals in #3891 and #4050

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request node selection Functionality and syntax for selecting DAG nodes state Stateful selection (state:modified, defer)
Projects
None yet
Development

No branches or pull requests

3 participants