-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1196] [Bug] Inconsistency between dbt_valid_from and dbt_valid_to #5867
Comments
One thing I quickly caught:
The changes fixing that issue (#4513 + #5077) were included in v1.1. They were not included in v1.0. |
@mats-wetterlund-pierce Thanks for opening this! 📅 🕙 Thanks to your detailed examples, I was able to reproduce everything you described. What is going onWhen using the
Where to go from hereI don't think this is a bug, but rather just the result of the timestamp strategy and hard-deletes expecting (I don't think that the check strategy can even have something like this happen since all of its timestamps are guaranteed to use the same source timezone and be monotonically increasing.) You proposed a few things that we can (and should!) take action on, namely:
There's also some things you can do (that you might have already discovered on your own):
What you could do, but we don't recommendYou asked about an attribute being added to the "snapshot config where one can specify what time zone should be used for the time stamps". Although not recommended, sharing the following for completeness' sake. Rather than transforming all timestamps into UTC, you can override the following macro in your local project: {% macro snowflake__snapshot_get_time() -%}
{{ log("snowflake__snapshot_get_time", info=True) }}
to_timestamp_ntz({{ current_timestamp() }})
{%- endmacro %} I didn't actually test this out, but I think you'd just replace Why is this not recommended? One reason is that (to the best of my knowledge) we consider this a "private" macro that is an implementation detail rather than being an intentional interface for user-configurable behavior. As such, it's liable to change in a future release without us considering it a breaking change. Next stepsI'm planning on closing this in favor of making some documentation updates instead. I'll create an issue in https://github.com/dbt-labs/dbt-docs/issues and tag you in it so you can track along. Thank you for highlighting where there are crucial details we should communicate better! |
Why not change the code for snapshots so all updated_at, dbt_valid_from and dbt_valid_to are in time zone format with UTC time zone? Instead of each developer needs to take this into consideration and make sure that the incoming data is transformed? As well when using the dbt_valid_to and dbt_valid_from from the snapshots it causes problems when they are depending on the in data used for the timestamps as some are in NTZ and some in TZ formats, as then as a developer you need to remember which are in NTZ format and convert them when using the data from snapshots and especially when combining data from multiple snapshots. So all in all I think there would be a lot less issues for developers if all timestamps in the snapshots where in UTC TZ format and ensured to be that by the dbt core code. |
Hi, I just ran into this issue too while making dbt snapshots with 'check' strategy. While the problem likely wouldn't result in a failing system for us, it was confusing to see snapshots' Therefore, I fully agree with @mats-wetterlund-pierce that it should also be stored as UTC TZ timestamp. |
I don't know if what I found now would be considered a separate bug or just a variation of what I submitted before. In short when an attribute in timestamptz format is used for updated_at the timestamp attributes in the snapshot are created as timestamptz but the code doesn't take that into consideration when updating the data in the snapshots. In our case the default timezone set in Snowflake is "America/Los_Angeles" -08:00, so we get i.e. "2023-01-30 03:19:10.863000000 -08:00" on hard delete instead of "2023-01-30 03:19:10.863000000 +00:00" causing us to get overlapping data if the same key comes back in the data again within the 8 hours added by dbt for the valid_to. Currently the only workaround I see is to ensure that the default timezone in the DB is UTC. The snapshot model is:
The output for first time build:
Output on subsequent build:
|
@mats-wetterlund-pierce and @RobbertDM I understand where you are coming from with wanting to standardize on As you know, our current implementation is standardized on Changing that current default implementation would be a breaking change, so it's not something we are willing to do. But I can show you how to use the Using
|
The first "flaw" in your comment is:
If it would be as stated there wouldn't be any issues as all timestamps would be converted with the same parameters and still be correct in relation to each other. The current implementation is a mix of TZ and NTZ attributes, requirements are undocumented and the implementation is flawed due to this and then with a approach to not correct the flaws.
So if there are bugs you preferer to keep them and not implement breaking changes that will improve the product? |
I'm sorry @mats-wetterlund-pierce -- it was not my intent for you to feel like your time has been wasted! We do take your feedback seriously and I agree that there are things we can do to improve the experience of using snapshots in dbt. I did some writing offline and tried to outline three categories of things:
The 1st category can be implemented with improvements to the docs here, and I've opened a new issue to cover the key information: For the 2nd and 3rd categories, I created a new GitHub Discussion to allow for further open-ended dialogue and exploration: I tried to capture and summarize things you've run into and reported. Would definitely invite your thoughts in that discussion on how snapshots could behave and how to accomplish it. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
Is this a new bug in dbt-core?
Current Behavior
I found an issue, or multiple, with snapshots using timestamp strategy. It gives inconsistency in data and some info is missing in documentation about the timestamps and how it works.
I have tried search for bug reports but haven't found much in the area except this issue #4347 which I may think is the reason for the problem I have, but as I haven't tested this before I don't know if this is a regression.
More detailed is that I get dbt_valid_to to be before dbt_valid_from on hard deletes.
The main reason in my opinion is that the data used for dbt_valid_from is the data that is used from the source to the snapshot and with no regards to time zone, the dbt_valid_to is set to UTC time zone on hard deletes, but there is no time zone in either timestamp in the snapshot table.
As well there is no validation that the dbt_valid_to is after or equal to dbt_valid_from.
Expected Behavior
That dbt_valid_to never is before dbt_valid_from for the same record.
That the documentation is updated that the field used for updated_at need to be in UTC time zone and that dbt_valid_to will be in UTC time zone on hard deletes, or that an attribute is added to the snapshot config where one can specify what time zone should be used for the time stamps.
Steps To Reproduce
Snapshot code - data_snapshot.sql:
Create a table with data:
Run snapshot:
Remove source data:
Run snapshot:
dbt_valid_to is before dbt_valid_to
To complicate it more, not perhaps a that common real world scenario.
Add data, the timestamp to be before the last_modified in the previous data but after the dbt_valid_to value in the snapshot.
INSERT INTO raw_table VALUES ('abc', '2022-09-16 14:20:00.000000000');
Run snapshot:
The dbt_valid_from on the second set of data is before the dbt_valid_from in the first set of data.
Additional update of the data, the timestamp is after the timestamp in the first data:
Run snapshot:
Relevant log output
No response
Environment
Which database adapter are you using with dbt?
snowflake
Additional Context
No response
The text was updated successfully, but these errors were encountered: