You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the current code, if the same customer performs the same activity multiple times at the same ts, the activity_occurrence and activity_repeated_at columns do not resolve deterministically and can vary from run to run depending on the DB. I see this happening in my own data on Snowflake when the source system records a date rather than a timestamp that I am then casting as a timestamp for the stream.
This behavior should have ~no impact on results, but does make equality testing between dev and prod environments painful via the audit-helper dbt package or data-diff framework since these two columns can change with each run.
I would expect that adding the activity_id to the order by would resolve this issue without introducing additional complications. Something like:
{# Creates the two activity occurrence columns: activity_occurrence and activity_repeated_at #}
{% macro activity_occurrence() %}
row_number() over (
partition by coalesce (
{{ safe_cast("customer", type_string()) }},
{{ safe_cast("anonymous_customer_id", type_string()) }}
) order by ts asc, activity_id asc) as activity_occurrence,
lead(ts) over (
partition by coalesce (
{{ safe_cast("customer", type_string()) }},
{{ safe_cast("anonymous_customer_id", type_string()) }}
) order by ts asc, activity_id asc) as activity_repeated_at
{% endmacro %}
What would be the best way for me to support resolving this issue?
The text was updated successfully, but these errors were encountered:
With the current code, if the same
customer
performs the same activity multiple times at the samets
, theactivity_occurrence
andactivity_repeated_at
columns do not resolve deterministically and can vary from run to run depending on the DB. I see this happening in my own data on Snowflake when the source system records a date rather than a timestamp that I am then casting as a timestamp for the stream.This behavior should have ~no impact on results, but does make equality testing between
dev
andprod
environments painful via theaudit-helper
dbt package ordata-diff
framework since these two columns can change with each run.I would expect that adding the
activity_id
to the order by would resolve this issue without introducing additional complications. Something like:What would be the best way for me to support resolving this issue?
The text was updated successfully, but these errors were encountered: