Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement schema tests by group/partition (WIP - not ready for review) #451

Closed
wants to merge 3 commits into from
Closed

Conversation

emilyriederer
Copy link
Contributor

This is a:

  • bug fix PR with no breaking changes — please ensure the base branch is master
  • new functionality — please ensure the base branch is the latest dev/ branch
  • a breaking change — please ensure the base branch is the latest dev/ branch

Description & motivation

This PR adds checks by groups as discussed in #450 . In short, the motivation is that some checks cannot be expressed at all without subgrouping, and other checks can be more rigorous at the group level.

Checklist

I will check off checklist items before asking for formal approval of this PR. This PR will consist of many small, similar pieces, and this currently contains only one such iteration.

  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I followed guidelines to ensure that my changes will work on "non-core" adapters by:
    • dispatching any new macro(s) so non-core adapters can also use them (e.g. the star() source)
    • using the limit_zero() macro in place of the literal string: limit 0
    • using dbt_utils.type_* macros instead of explicit datatypes (e.g. dbt_utils.type_timestamp() instead of TIMESTAMP
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md

@emilyriederer
Copy link
Contributor Author

emilyriederer commented Nov 29, 2021

@joellabes - I will leave this PR marked as a Draft for the foreseeable future until I've added the argument to all the in-scope tests, updated the docs, etc. In the meantime, I've pushed an example of the recency test. At your convenience, I'd appreciate any feedback on the general approach / style since the others will likely be similar.

Currently I have added:

  • an updated recency test macro with an optional group_vars argument
  • a new model in integration_tests which makes a simple dataset to test by group
  • an updated schema.yml to run a groupwise recency check on the new "model" dataset

Some areas where I'd specifically like input (in addition to anything that comes to mind):

UI

  • Are there internal "style" rules somewhere that I should reference when making an decision that affect the API?
  • Is there a better name than group_vars or a better way to provide grouping columns than a list of names (['col_a', 'col_b'])?

Tests

  • I see some tests-of-the-tests are set up using seeds and others using models. Are there guidelines when I should use which? (The recency check requires a current timestamp, so this one has to be a model. This is more of a general question for the others.)
  • I could have changed the test_recency model instead of making a new test_recency_group model. Is there a preference between "avoid duplication in tests" versus "map each model to exactly one test"?
  • Looking at the tests for test_recency, I see integration tests for schema tests that should pass. Is it standard to only test the "positive" case or is there a framework to also test for correct detection of failures?

@joellabes
Copy link
Contributor

Hi @emilyriederer, super late for me to come back to you but I have just spent some time looking at this. I love it! Big fan 🎉 I thought this was going to be much fiddlier than it has turned out to be, which is always a win.

Some answers to your questions:

  • Are there internal "style" rules somewhere that I should reference when making an decision that affect the API?

Not written down anywhere, but in general it's good to have common arguments named the same way and in the same order where possible. As an example, the metrics package has some secondary calculations which have arguments in common - they're all named the same way and in the same order, optimising globally as opposed to picking the perfect word to get each one's local maximum.

  • Is there a better name than group_vars or a better way to provide grouping columns than a list of names (['col_a', 'col_b'])?

I'm leaning towards something like group_by or group_by_columns, since vars has an existing meaning in dbt. If there's an existing pattern on other tests then feel free to adopt that though. I think providing the list of column names is the best way to handle it 👍

  • I see some tests-of-the-tests are set up using seeds and others using models. Are there guidelines when I should use which? (The recency check requires a current timestamp, so this one has to be a model. This is more of a general question for the others.)

You've nailed the difference - seeds are nicer than hand-crafted select statements, where the data is static. But when you need the current timestamp, it has to be a proper model.

  • I could have changed the test_recency model instead of making a new test_recency_group model. Is there a preference between "avoid duplication in tests" versus "map each model to exactly one test"?

Good question! I'm generally OK with the same input file being used for multiple output files, if it can be achieved without needing to contort the tests or the models too much. E.g. I was just looking at PR #507 which has multiple tests built on top of the same seed file, by using the where conditions. But the window functions vs standard functions made more sense to be in different seed files. No hard and fast rule, but if you're working too hard to keep it in one file, or doing huge amounts of copy-paste, you're probably leaning too far in one direction or the other.

  • Looking at the tests for test_recency, I see integration tests for schema tests that should pass. Is it standard to only test the "positive" case or is there a framework to also test for correct detection of failures?

We don't currently have a way to test for detection of failures 😢 The Core team are looking into improving testability of adapters at the moment, which will hopefully bleed over to packages as well (cc @jtcohen6)

And a question from me:

  • Why is your new model restricted to just postgres? Did you only have access to a Postgres database?

Thanks for your work on this so far 🤩

@joellabes
Copy link
Contributor

Also! A heads up that #521 will change a bunch of file names, so it's worth waiting for that to be merged before moving this any further forward

@emilyriederer
Copy link
Contributor Author

Thanks @joellabes for all of the helpful comments! I'll wait until #521 is merged then work on your updates and expanding to other tests. 🤓

To answer your question, this PR shouldn't be limited to any one database since the functionality is bound to be pretty basic changes to insert group by statements. I only checked BigQuery because that's what I currently have access to for testing.

@joellabes
Copy link
Contributor

To answer your question, this PR shouldn't be limited to any one database since the functionality is bound to be pretty basic changes to insert group by statements. I only checked BigQuery because that's what I currently have access to for testing.

I was meaning in relation to this line {% if target.type == 'postgres' %}, which would mean on any other target the model would be empty, right?

@emilyriederer
Copy link
Contributor Author

Ah yes, sorry! I see what you mean now. I'll clean that up before submitting. I initially set out to write a more complicated test (to have failure cases along with successes) and thought I might need to do it differently for Postgres so I'd started the if/else skeleton. However, as things stand, it's superfluous because the if and else parts of the code are identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants