-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow start-end-time-range specification for BigQuery insert_overwrite #2396
Comments
Thanks for the issue. If I'm understanding you right, what you're asking for is a different way of configuring the static CurrentYou supply merge into my_dataset.my_incr_model as DBT_INTERNAL_DEST
using ( ... sql ...) as DBT_INTERNAL_SOURCE
on FALSE
when not matched by source
and DBT_INTERNAL_DEST.session_start in ('2020-05-01', '2020-05-02')
then delete
when not matched then insert
( ... columns ... )
values
( ... columns ... ) DesiredYou could specify merge into my_dataset.my_incr_model as DBT_INTERNAL_DEST
using ( ... sql ...) as DBT_INTERNAL_SOURCE
on FALSE
when not matched by source
and DBT_INTERNAL_DEST.session_start between '2020-04-01 00:00:00 PST' and '2020-05-01 00:00:00 PST'
then delete
when not matched then insert
( ... columns ... )
values
( ... columns ... ) I think this specific use case is something you could accomplish today, in your own project, by overriding the macro {% set predicate -%}
{{ partition_by.render(alias='DBT_INTERNAL_DEST') }} between
{{ partitions[0] }} and {{ partitions[1] }}
{%- endset %} General caseHere's what I'm trying to square:
To my mind, the more coherent and compelling resolution here would be to use a different incremental strategy, halfway between I'd be curious to hear what other BQers think! In the long run, I'm especially interested in building out a more extensible framework for users to write and use their own incremental strategies without needing to copy-paste override all the materialization code (#2366). |
yes, you are right about my desired approach, and thank you for the suggested overwrite. yes, that's what we are doing now to overwrite the existing behavior. For the general case, I would like to suggest that there is a general case need for time-range partition
|
@hui-zheng I think BigQuery may have beaten us both to the punch here: hourly partitioning is now in beta. |
indeed! |
@jtcohen6 Hi hope all is well. Just want to continue the discussion here. I don't think the BigQuery hourly partitioning is the final solution. hourly-partitioning is not ideal and shall not be used for long time range historical data, such as data over years. The fundamental limitation about the existing insert_overwrite() macro is that it assumes that the result for insert_overwrite always contains data of complete days in UTC time. if the to-be-inserted result contains any partial date data in it, the |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers. |
@hui-zheng do you remember what you ended up doing in this situation? Im in a similar situation 2 years later. |
Describe the feature
problem
Currently, the BigQuery
insert_overwrite
uses a list for partitions for partition replacement.In a static mode, it requires an config
partitions = partitions_to_replace
In a dymaic mode, it calculates the list of partitions from insert temp tables.
There is a limitation in some situations.
insert_overwrite
use-case is to re-process historical data (i.e. due to source data change or model logic change), however, the re-processing date range that doesn't fit precisely to a full-UTC-date range. For example, I only want to process data in 2020-04-01 00:00:00 PST to 2020-05-01 00:00:00 PST, in "America/Los_Angeles" timezone, not in UTC.I understand that Partitions have to be at the day level, but I don't want to be limited to only allow data replacement data at the day level. I would like the ability to specify a range of any timestamp.
I also understand that when upserting 6 hours of data, BQ is still scanning the whole day partition. As far as optimizing for in-demand cost (bytes scanned), day partition is the atomic unit.
However,
insert_overwrite
is more than just cost-optimization, it needs to first fulfill the business requirements. it's important that it is flexible to replace exactly the given time-range of data that requires to be replaced and not touch data outside that range. In my use case #1, that means I am fine with some over-scanning (scanning 1 day of data), but I only want to replace the last 2 hours data, I don't want to touch or change data outside that 2 hours. It's an important business requirement.proposed solution
To enhance static
insert_overwrite
to accept timestamp-range-based replacement_range, In addition to a list of utc-dates.something like below
in static insert_overwrite
Additional context
this is BigQuery-specific
Who will this benefit?
It would be beneficial for users who use BigQuery and dbt, where dbt incremental models are executed based on a timestamp-range provided by some external process, such as scheduler.
appendix #1 Our full use case and context
We use the scheduler (airflow/prefect/dagster) to run dbt incremental with a specifically-defined time-range for every dbt run, ( not using incremental on max(stamp) approach )
We run a lot of time-series data models. every hour, our scheduler (airflow/prefect/dagster) triggers a dbt run that performs an incremental update for data in the defined-time-range (i.e. the past 2 hours). We also often do re-processing and backfill on those models, where we do dbt incremental run for data only in a defined time-range. (i.e. Feb 2020, or year 2019). We want to be very precise about the data time range to update and want the caller to provide that information to dbt run. The > this.max(stamp) approach is not sufficient for us.
Below are the requirements for these time-range-defined incremental models
Currently, the dbt incremental materialization does not support these requirements natively. For now, we did many of these features using vars, in-model-logic and some customized get_merge_sql(), (due to the limitation of insert_overwrite feature as per above)
The text was updated successfully, but these errors were encountered: