Implement schema tests by group/partition (WIP - not ready for review) #451

emilyriederer · 2021-11-29T00:07:14Z

This is a:

bug fix PR with no breaking changes — please ensure the base branch is master
new functionality — please ensure the base branch is the latest dev/ branch
a breaking change — please ensure the base branch is the latest dev/ branch

Description & motivation

This PR adds checks by groups as discussed in #450 . In short, the motivation is that some checks cannot be expressed at all without subgrouping, and other checks can be more rigorous at the group level.

Checklist

I will check off checklist items before asking for formal approval of this PR. This PR will consist of many small, similar pieces, and this currently contains only one such iteration.

emilyriederer · 2021-11-29T00:17:31Z

@joellabes - I will leave this PR marked as a Draft for the foreseeable future until I've added the argument to all the in-scope tests, updated the docs, etc. In the meantime, I've pushed an example of the recency test. At your convenience, I'd appreciate any feedback on the general approach / style since the others will likely be similar.

Currently I have added:

an updated recency test macro with an optional group_vars argument
a new model in integration_tests which makes a simple dataset to test by group
an updated schema.yml to run a groupwise recency check on the new "model" dataset

Some areas where I'd specifically like input (in addition to anything that comes to mind):

UI

Are there internal "style" rules somewhere that I should reference when making an decision that affect the API?
Is there a better name than group_vars or a better way to provide grouping columns than a list of names (['col_a', 'col_b'])?

Tests

I see some tests-of-the-tests are set up using seeds and others using models. Are there guidelines when I should use which? (The recency check requires a current timestamp, so this one has to be a model. This is more of a general question for the others.)
I could have changed the test_recency model instead of making a new test_recency_group model. Is there a preference between "avoid duplication in tests" versus "map each model to exactly one test"?
Looking at the tests for test_recency, I see integration tests for schema tests that should pass. Is it standard to only test the "positive" case or is there a framework to also test for correct detection of failures?

joellabes · 2022-03-22T01:31:40Z

Hi @emilyriederer, super late for me to come back to you but I have just spent some time looking at this. I love it! Big fan 🎉 I thought this was going to be much fiddlier than it has turned out to be, which is always a win.

Some answers to your questions:

Are there internal "style" rules somewhere that I should reference when making an decision that affect the API?

Not written down anywhere, but in general it's good to have common arguments named the same way and in the same order where possible. As an example, the metrics package has some secondary calculations which have arguments in common - they're all named the same way and in the same order, optimising globally as opposed to picking the perfect word to get each one's local maximum.

Is there a better name than group_vars or a better way to provide grouping columns than a list of names (['col_a', 'col_b'])?

I'm leaning towards something like group_by or group_by_columns, since vars has an existing meaning in dbt. If there's an existing pattern on other tests then feel free to adopt that though. I think providing the list of column names is the best way to handle it 👍

I see some tests-of-the-tests are set up using seeds and others using models. Are there guidelines when I should use which? (The recency check requires a current timestamp, so this one has to be a model. This is more of a general question for the others.)

You've nailed the difference - seeds are nicer than hand-crafted select statements, where the data is static. But when you need the current timestamp, it has to be a proper model.

I could have changed the test_recency model instead of making a new test_recency_group model. Is there a preference between "avoid duplication in tests" versus "map each model to exactly one test"?

Good question! I'm generally OK with the same input file being used for multiple output files, if it can be achieved without needing to contort the tests or the models too much. E.g. I was just looking at PR #507 which has multiple tests built on top of the same seed file, by using the where conditions. But the window functions vs standard functions made more sense to be in different seed files. No hard and fast rule, but if you're working too hard to keep it in one file, or doing huge amounts of copy-paste, you're probably leaning too far in one direction or the other.

Looking at the tests for test_recency, I see integration tests for schema tests that should pass. Is it standard to only test the "positive" case or is there a framework to also test for correct detection of failures?

We don't currently have a way to test for detection of failures 😢 The Core team are looking into improving testability of adapters at the moment, which will hopefully bleed over to packages as well (cc @jtcohen6)

And a question from me:

Why is your new model restricted to just postgres? Did you only have access to a Postgres database?

Thanks for your work on this so far 🤩

joellabes · 2022-03-22T07:45:18Z

Also! A heads up that #521 will change a bunch of file names, so it's worth waiting for that to be merged before moving this any further forward

emilyriederer · 2022-03-23T23:08:12Z

Thanks @joellabes for all of the helpful comments! I'll wait until #521 is merged then work on your updates and expanding to other tests. 🤓

To answer your question, this PR shouldn't be limited to any one database since the functionality is bound to be pretty basic changes to insert group by statements. I only checked BigQuery because that's what I currently have access to for testing.

joellabes · 2022-03-23T23:22:07Z

To answer your question, this PR shouldn't be limited to any one database since the functionality is bound to be pretty basic changes to insert group by statements. I only checked BigQuery because that's what I currently have access to for testing.

I was meaning in relation to this line {% if target.type == 'postgres' %}, which would mean on any other target the model would be empty, right?

emilyriederer · 2022-03-24T01:45:01Z

Ah yes, sorry! I see what you mean now. I'll clean that up before submitting. I initially set out to write a more complicated test (to have failure cases along with successes) and thought I might need to do it differently for Postgres so I'd started the if/else skeleton. However, as things stand, it's superfluous because the if and else parts of the code are identical.

emilyriederer added 3 commits November 28, 2021 17:27

implement group_vars for recency

06de0c2

expose recency api at top level

d67dc13

debug test for bigquery

e794484

joellabes added the pending label Mar 22, 2022

emilyriederer closed this Aug 8, 2022

emilyriederer mentioned this pull request Aug 8, 2022

Implement group_by_columns argument for relevant tests #633

Merged

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement schema tests by group/partition (WIP - not ready for review) #451

Implement schema tests by group/partition (WIP - not ready for review) #451

emilyriederer commented Nov 29, 2021

emilyriederer commented Nov 29, 2021 •

edited

Loading

joellabes commented Mar 22, 2022

joellabes commented Mar 22, 2022

emilyriederer commented Mar 23, 2022

joellabes commented Mar 23, 2022

emilyriederer commented Mar 24, 2022

Implement schema tests by group/partition (WIP - not ready for review) #451

Implement schema tests by group/partition (WIP - not ready for review) #451

Conversation

emilyriederer commented Nov 29, 2021

Description & motivation

Checklist

emilyriederer commented Nov 29, 2021 • edited Loading

joellabes commented Mar 22, 2022

joellabes commented Mar 22, 2022

emilyriederer commented Mar 23, 2022

joellabes commented Mar 23, 2022

emilyriederer commented Mar 24, 2022

emilyriederer commented Nov 29, 2021 •

edited

Loading