Advanced node selection syntax #2172

jtcohen6 · 2020-02-28T19:33:46Z

We want to enable a mechanism of node selection that is:

More powerful, composable, extensible
In a structured data format
Possible to check into version control

We think that this is best implemented as YML. It should be similar to CLI --models and --select syntaxes, but it will also allow us to move beyond what's possible with CLI flags + arguments.

Selectors

resource name
resource type
model materialization type
tags
project/package
subdirectory
- file path literals
Node dependencies. We can make these more verbose in YML than the current selectors on the CLI:
- parents: +my_model
- children: my_model+
- children and all their parents: @my_model
- proposed: models that depend on macros (and their children?): my_macro+

Set logic

unions (inclusive OR): current default behavior
exclusion: possible on the CLI with --exclude
intersections (AND): not yet possible, proposed in Model intersection syntax #2167
exclusive OR will be possible as the combination of the three above (union(A,B) —exclude intersect(A,B))

Well defined "pseudo-selectors"

We can encode a dynamic selector that returns resources based on a set of conditions, which dbt uses to pick specific nodes at build time. I'm including a couple possibilities of varying complexity, mainly to spur the imagination:

this_package_only
- Only execute models + tests that are defined in the current "home? project
- Dynamic based on the project directory from which it's run.
build_if_missing
- Exclude model nodes that already exist as relations in the target database + schema
build_if_changed
- point to manifest.json from a different dbt build, and dbt can compare to infer changed resources
- sensible pattern: select all nodes with changes + their children
build_if_updated
- point to a manifest.json from a different dbt build, and the result of a more recent dbt source snapshot-freshness. dbt can determine whether

(Very) hypothetical spec

version: 2

selectors:

  - name: snowplow_marketing_nightly    # human-friendly name for this custom node grouping
    definition:
      - union:          # include nodes for which ANY of the selectors below is true
        - intersect:    # include nodes for which ALL of the selectors below are true
          - tag: nightly
          - tag: marketing
          - package: snowplow
          - materialized: incremental
        - union:
          - resource_name: snowplow_marketing_custom_events
          - file_path: "models/snowplow/marketing/custom_events.sql"
          - model_dir: "snowplow/marketing"
        - intersect:
          - resource_type: seed
          - package: snowplow
          - exclude:
              resource_name: country_codes
  
  - name: ci    # a different custom node grouping
    definition:
      - dynamic: build_if_changed
        parents: false
        children: true

dbt run --selector snowplow_marketing_nightly

dbt run --selector ci
dbt test --selector ci

Prior art

This carries on the legacy of several past issues (going back to #550, if not earlier). It's something we've been thinking about for some time.

Looking ahead, I believe that a good approach here will form the basis for features we're very interested in supporting:

complex workflows
smarter CI
dev/prod environments in larger/multi-package projects

The text was updated successfully, but these errors were encountered:

alanmcruickshank · 2020-04-14T21:49:41Z

@drewbanin @jtcohen6 - I'm very invested in this feature. I think it could meaningfully improve the incremental run times of our production DAG. Especially the ability to skip any view materialised models and just run a pruned DAG of incremental and table models. I'm really pleased to find such a well through approach detailed here and in the linked issues.

I looks like this depends on #2203, so I'm assuming there's nothing I can do to help right now, but I'm very keen to help out if I can - event if that's just constructing a bank of potential test cases. Please let me know if I can help. 😁

drewbanin · 2020-04-29T17:41:26Z

@beckjake to review and advise. Sounds like PowerShell and jq have good syntaxes for arbitrary selection over a list -- what do those look like, and can we be inspired by them?

aaronsteers · 2020-06-24T17:29:31Z

I'd like to propose a possible implementation for the "diff-only" (build_if_changed) feature which is based upon my own prior learnings with similar architectures. I'm not sure if this is already the plan but I wanted to document here in case it would be helpful.

During DBT run, the source code of this node is hashed, and the result is hashed with the hashes of all upstream models and the dbt version number. The output of this process is a unique hash can be stored for each model - either on the database itself, and/or in manifest.json as a uniqueness key hash for that run.
During subsequent executions to DBT run - the hashes again are calculated and compared. If --diff-only (or --skip-unchanged or similar) is specified, any object with an exactly matching hash is skipped.
Objects which are missing would always fail the comparison and would therefor be built.

Importantly, this can be performed using static code analysis and is sensitive to upstream model changes. The use cases supported here are:

I'm a developer (using dbt-cloud or similar incremental development process) and I don't want to wait for things to rebuild when we already know what their outputs will be.
- Developers could (and probably would), leave this as the default setting and only disable/override it when changes are applied outside of the DBT environment (e.g. new data loaded or raw table schemas updated).
I'm in production and I just released a bugfix to the main branch. Without rebuilding my entire environment, I want to automatically rebuild only objects who's source code definition has changed (along with its downstream models) - without having to manually identify which those objects are.

Would this type of "smart rebuild" be feasible and is this similar perhaps to what is already being planned?

ucg8j · 2020-07-13T15:21:04Z

This could also improve the data lineage usability in dbt docs.

I don't think this is covered above. When working with massive DAGs I don't want all children/parents recursively. But want to traverse the tree a level at a time or specify the depth I want to traverse.

Much like the nix command tree takes an argument to list X many levels deep OR recursive. This might look something like e.g.

dbt model_name^1 # only immediate children
dbt model_name^2 # immediate children and grandchildren
dbt 1^model_name # immediate parents

Raalsky · 2020-07-13T15:36:46Z

@ucg8j Check direct child model selector syntax added here: #2485 . It should be released in next feature release (maybe 0.18.0 or something)

jtcohen6 added the enhancement New feature or request label Feb 28, 2020

drewbanin mentioned this issue Mar 16, 2020

Rationalize test selection #2203

Closed

drewbanin added this to the Octavius Catto milestone Mar 24, 2020

drewbanin self-assigned this Mar 24, 2020

drewbanin modified the milestones: Octavius Catto, dbt-next Apr 29, 2020

drewbanin removed their assignment May 5, 2020

drewbanin mentioned this issue May 18, 2020

Add pseudo selectors that select models based on artifact states #2465

Closed

beckjake mentioned this issue Jul 22, 2020

Feature/yaml selections #2640

Merged

4 tasks

beckjake closed this as completed in #2640 Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced node selection syntax #2172

Advanced node selection syntax #2172

jtcohen6 commented Feb 28, 2020 •

edited

Loading

alanmcruickshank commented Apr 14, 2020

drewbanin commented Apr 29, 2020

aaronsteers commented Jun 24, 2020 •

edited

Loading

ucg8j commented Jul 13, 2020

Raalsky commented Jul 13, 2020

Advanced node selection syntax #2172

Advanced node selection syntax #2172

Comments

jtcohen6 commented Feb 28, 2020 • edited Loading

Selectors

Set logic

Well defined "pseudo-selectors"

(Very) hypothetical spec

Prior art

alanmcruickshank commented Apr 14, 2020

drewbanin commented Apr 29, 2020

aaronsteers commented Jun 24, 2020 • edited Loading

ucg8j commented Jul 13, 2020

Raalsky commented Jul 13, 2020

jtcohen6 commented Feb 28, 2020 •

edited

Loading

aaronsteers commented Jun 24, 2020 •

edited

Loading