Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore ability to utilize updated_at for check_cols snapshots #5077

Merged
merged 5 commits into from
Apr 21, 2022

Conversation

dbeatty10
Copy link
Contributor

@dbeatty10 dbeatty10 commented Apr 15, 2022

Restore ability to configure and utilize updated_at for snapshots using the check_cols strategy

resolves #5076

Description

#5076 describes a breaking change, and this PR converts it to a non-breaking change in the most minimal way possible.

Details

  • Verified that the functional test fails without the code change but passes after the code change
  • The check_relations_equal method excluded columns beginning with dbt_, which were the exact columns we want to compare, so modification was necessary
  • To make it a little easier to see the difference between where the snapshot dates are being sourced from, the timestamp_col column uses dates in January of 2016 and the updated_at configuration uses dates in July of 2016

Testing approach

  1. Create a table that represents the expected data after a series of snapshots
    • Use dbt seed to create the expected relation (snapshot_check_cols_updated_at_expected)
  2. Execute a series of snapshots to create the data
    • Use a series of (3) dbt snapshot commands to create the actual relation (snapshot_check_cols_updated_at_actual)
    • my_snapshot.sql contains logic that can switch between 3 different versions of the data
  3. Compare the two relations for equality

The functional test is designed to make sure dbt_updated_at matches the timestamp expression that we passed in through --vars.

The only change within the 3rd version of the data deletes a single row. Because this test doesn't cover the invalidate_hard_deletes=True option, the 3rd version of the data makes no updates. However, this version is included to make it easier to test for the expected behavior when invalidating hard deletes. See "future ideas to consider" below.

Pros

  • No (known) breaking changes
  • Uses the newer functional test conventions
  • Enables inclusion of dbt_* columns within the comparison of relations
  • Establishes the expected result for the default option when invalidate_hard_deletes=False

Cons

  • Not expected to work with adapters other than postgres
  • Still not possible to specify a subset of columns to exclude from comparison
  • Includes a 3rd version of the data which is appears not utilized

Future ideas to consider

  • The timestamp to utilize for hard deletes as an optional opt-in configuration for snapshots
  • For functional tests, allow the inclusion/exclusion of a subset of columns to compare between two relations
  • Make including all columns the default and allow selective exclusion of columns beginning with dbt_ only for those tests that need it.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have added information about my change to be included in the CHANGELOG.

@cla-bot
Copy link

cla-bot bot commented Apr 15, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Doug Beatty.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@jtcohen6
Copy link
Contributor

@cla-bot check

@cla-bot cla-bot bot added the cla:yes label Apr 15, 2022
@cla-bot
Copy link

cla-bot bot commented Apr 15, 2022

The cla-bot has been summoned, and re-checked this pull request!

@ChenyuLInx
Copy link
Contributor

The change looks good! Can we add a test for this case so we don't accidentally break this next time?

@dbeatty10
Copy link
Contributor Author

Will work on some test cases for this.

@dbeatty10 dbeatty10 marked this pull request as ready for review April 20, 2022 19:03
@dbeatty10 dbeatty10 requested a review from a team April 20, 2022 19:03
@dbeatty10 dbeatty10 requested review from a team as code owners April 20, 2022 19:03
Copy link
Contributor

@stu-k stu-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing stands out, though I'm not the most proficient in testing snapshots, so it might be a good idea to wait for one more approval 👍

@stu-k stu-k requested a review from a team April 20, 2022 19:12
Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks @dbeatty10 for the fast fix and test!
Left one question/comment to make sure I understood your test correctly.

}


def test_simple_snapshot(project):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure I understand the test case correctly here.
dbt seed would create thesnapshot_check_cols_updated_at_expected table. All data in snapshot_check_cols_updated_at_actual is purely generated by the logic in the snapshot.sql.

And for the command run_dbt(["snapshot", "--vars", "{version: 3, updated_at: 2016-07-03}"]) command there's no data in snapshot table got generated because no data got modified.

This test is also designed to make sure the time for dbt_updated_at actually has the time that we passed in through '--vars'. Which is the main change this PR is doing

This is a great test! I would recommend adding a few short comments to document the intention of each command.

Copy link
Contributor Author

@dbeatty10 dbeatty10 Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this excellent feedback @ChenyuLInx!

You are correct in your understanding of the test case. I've updated the description of the pull request to reflect your feedback. Or were you thinking of adding these as comments within the code itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to explain the testing approach.

@ChenyuLInx ChenyuLInx merged commit d09459c into main Apr 21, 2022
@ChenyuLInx ChenyuLInx deleted the dbeatty/fix-check-snapshot branch April 21, 2022 12:56
github-actions bot pushed a commit that referenced this pull request Apr 21, 2022
* Restore ability to configure and utilize `updated_at` for snapshots using the check_cols strategy

* Changelog entry

* Optional comparison of column names starting with `dbt_`

* Functional test for check cols snapshots using `updated_at`

* Comments to explain the test implementation

(cherry picked from commit d09459c)
ChenyuLInx added a commit that referenced this pull request Apr 21, 2022
…) (#5126)

* Restore ability to configure and utilize `updated_at` for snapshots using the check_cols strategy

* Changelog entry

* Optional comparison of column names starting with `dbt_`

* Functional test for check cols snapshots using `updated_at`

* Comments to explain the test implementation

(cherry picked from commit d09459c)

Co-authored-by: Doug Beatty <[email protected]>
agoblet pushed a commit to BigDataRepublic/dbt-core that referenced this pull request May 20, 2022
…-labs#5077)

* Restore ability to configure and utilize `updated_at` for snapshots using the check_cols strategy

* Changelog entry

* Optional comparison of column names starting with `dbt_`

* Functional test for check cols snapshots using `updated_at`

* Comments to explain the test implementation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-498] [Bug] updated_at ignored for check_cols snapshots
4 participants