-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rationalize test selection #2203
Comments
How does test selection work?Somewhat surprisingly, test selection is basically identical to model/snapshot/seed/etc selection! When selecting tests, dbt will:
Step two here is really surprising in the general case! dbt will do extra work to find all tests that depend directly on selected nodes only to either:
Forgetting the implementation details for a moment, the big takeaway here is that test selection occurs fundamentally at a node-level, and then dbt does one extra hop to find the tests that depend on those selected nodes directly. This is reflected in the name of the How could test selection workPart of this is a naming/UX problem. The
Beyond that, there are some substantive and wonky thinks about test selection that we should address: TagsSelecting schema vs. data tests by tags is a bad idea. We should add a first-class property to test nodes indicating if they are schema tests or data tests, and perform selection based on that property. Test nodes should no longer be supplied with these auto-tags. We could also make these proper selectors instead of CLI args. Coupled with #2167, I think we could replicate the existing functionality in Selectorsdbt should support selecting tests by:
Ideally, these would all be supported via the same system. Take this as a vignette and not a spec, but I'm picturing selectors like:
If no selector is provided, dbt would default to the One big consideration is that the graph modifiers
would, semantically, select all of the tests that reference So, what do we do about this?? One "fix" could be inject tests inline into the DAG. Picture something like:
A key benefit here is that a selector like
would "just work". In this example, tests applied to I think that our ultimate answer to test selection must include at least one of the following:
An example of a user-unfriendly test selector which is unambiguous and consistent:
I think this syntax is decidedly worse than the current implementation There's certainly more thinking to do here, but I wanted to open up this line of inquiry to discussion before going further down the rabbit hole. @beckjake, @jtcohen6, what say you? |
On its face, without a full appreciation for how hard it would be to implement, I support the change to how tests participate in the DAG. Most implications of this feel intuitive:
A hypothetical
If the schema test failed, it would skip all downstream nodes, including Some implications feel a little less intuitive. In the same scenario above, let's say a data test references both
But the build order for a DAG that includes only model nodes could run
|
@jtcohen6 I think your last example actually reflects the desired behavior!
If
We do this by building a transitive closure over the DAG and then pruning nodes that we want to exclude from the run. Some other thoughts around edge-cases:
@beckjake curious what you think about all of this |
I really don't love the idea of tests being inline in the DAG. I agree that it solves a number of problems around I think a more compelling thing would be to note tests as "special children", and in the That's a bit more convoluted internally as we'd have to rework some of the "node done" completeness stuff, but I think it will result in a more intuitive/consistent outcome. This would be an opportunity to revisit how dbt handles model/job scheduling, and potentially make it less opaque. |
Ok - i think that's fair - direct child selectors are a really good consideration! I'm happy for tests to remain special -- it just sounds like we need to make them more special than they currently are. So, is the goal for this issue to change the current implementation of how tests are selected out of the DAG? While I'm in favor of the change you're describing, I don't think this will necessarily have any semantics of the command line interface for selecting tests! |
Sorry for the essay, this took me a long time to think through and write about "intelligently". I guess I've become convinced that tests are "special", but I do believe their specialness should get changed. Let me lay out my reasoning here. In broad strokes, these are what I see as a representative sample of the key considerations:
Thinking about all that, it seems to me that node selection should work similar to how it currently does, except it should be more context-dependent. I'd like to conceptually split it into two parts:
"Node selection" would behave exactly like node selection today (barring any changes in #2172 and #2167), except:
"task-based graph creation" would be implemented per-task.
The wrinkle in my eyes is ephemerals. I know this is pretty vague and hand-wavy, but we should either lazily compile them or compute them up-front and rip through compiling them all immediately before we even start running. Either way, ephemeral nodes shouldn't be special-cased in node selection. implementing
|
Thanks for this writeup @beckjake - it's very helpful!!
Yep, I totally buy this. Let's call these things:
They are very related, but I agree, they're different tasks and we should not conflate the two!
I agree it would be great to not special case ephemerals in node selection. I'm open to either of the approaches you've outlined here. Last, the |
@beckjake to open a new issue to split out node selection from DAG creation |
great discussion! Closing in favor of #2328 |
Describe the challenge
Test selection is "novel and exciting" whereas it should instead be boring and straightforward. Let's fix that.
@drewbanin to complete this issue w/ info about what to change and how to change.
See also #2172, #2167
The text was updated successfully, but these errors were encountered: