Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Unit Tests Should Support ref & source statements when specifying rows with sql #10227

Open
3 tasks done
ernestoongaro opened this issue May 27, 2024 · 3 comments
Open
3 tasks done
Labels
enhancement New feature or request triage unit tests Issues related to built-in dbt unit testing functionality

Comments

@ernestoongaro
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Was speaking to Roberto Zagni. He has a table with some sample rows he'd like to easily reference, something like this:

given:
      - input: ref('stg_tpch_customers')
        format: sql
        rows: |
          select customer_key from {{ target.database ~ '.' ~ target.schema }}.stg_tpch_customers where is_testing_row = true

would like to use the ref or perhaps source statement instead

Describe alternatives you've considered

{{ target.database ~ '.' ~ target.schema }}.<table_name>

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

@ernestoongaro ernestoongaro added enhancement New feature or request triage unit tests Issues related to built-in dbt unit testing functionality labels May 27, 2024
@RobMcZagBDS
Copy link

Thank you @ernestoongaro

The current unit-test functionality is perfectly suited to very specific and narrow testing, where you just pick the few columns you need, mock the macro calls to the return value you expect and check that you get the desired output.
It is a lot like unit tests done mocking everything, but that one tiny bit of complex logic that you want to test.
Great power, but you can paint a building with the same brush you paint small details in fine art paintings.

The typical case I had in mind is when you are trying to have a wider validation of a model, so you got some sample data and you have isolated a few rows that cover the different use cases and you received or manually verified the expected output for them, so you would keep these input and expected rows in a table and use them, eventually adding more rows when new use cases are found. One great case is adding the use case data when you find out a bug and fix it.

In this situation it is also useful to note that the output / expectation of one model easily becomes the input for the next model to test, so you could almost visualize the sequence as a chain of known set of inputs and their expected outputs down the lineage line.

I would suggest a format of ref or source and then as rows we could put the query that references the ref or source.

@dbeatty10
Copy link
Contributor

@RobMcZagBDS and @ernestoongaro thank you both for raising this issue 🤩

After discussing with @graciegoheen, this isn't something we’d prioritize anytime soon, but we will continue to listen for how many folks are asking for this.

A large reason for our prioritization is the complexity that would be involved in implementing this. Here is a summary of some of the obstacles identified by @gshank:

  • SQL fixtures: Each one would need to be compiled, requiring additional code and refactoring.
  • Extra fields: New fields like compiled_sql would need to be added.
  • Dependency handling: Handling dependencies (depends_on) would be difficult because fixture nodes are created dynamically and don’t exist during the initial parsing stage. The unit testing manifest might need separate depends_on structures for each fixture.
  • Additional unknowns: there may be remaining things that would be difficult to handle or that we'd have a hard time detecting.

@RobMcZag
Copy link

Thank you Dough, Grace, Gerda and Ernesto for looking into it.
I understand the technical difficulties, and somehow we can cope with the current limitations even if it is not as elegant and maintainable as it would by being able to use a source() reference.

Maybe it is just my feeling, but I would prefer to have a "FIXTURES" schema with "TABLE_XXX" and "TABLE_XXX__EXPECTATION" (if the expectation it is not the same as next input in the pipeline "TABLE_YYY") and select the rows and columns with SQL than have a similar collection of CSV or SQL files in a folder inside the repository.
We can do that with the current SQL feature by hardcoding the DB & SCHEMA. Do we have variables in the context?

BTW in the docs it would be nice to have a better description of what you expect from a SQL file.
My gut feeling is a piece of SQL that when run returns the desired rows and columns, but not 100% sure and have not yet experimented with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage unit tests Issues related to built-in dbt unit testing functionality
Projects
None yet
Development

No branches or pull requests

4 participants