-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
snapshots should handle hard deletes #249
Comments
this might need to be a completely separate run through the table, which would eat up additional resources. if that turns out to be true, we should allow people to opt out of this in configuration if they know the source data never performs deletes. |
let me have a think about this. definitely a good idea |
Just a note that this is a big problem for us, for some of our tables, but the issue isn't likely to be solvable by To get fully reliable data, we'll almost certainly have to solve the problem upstream, either by converting the table to soft deletes, or adding an audit table of deleted rows so we can do an anti-join later. |
@jthandy is this something you're still interested in? My instinct is that there are many different ways for hard deletes to be represented in warehouse. Depending on your replication, you might see:
Lmk if you have a different idea of how this would work |
@drewbanin it wasn't super-clear in my original issue writeup, but I meant hard deletes in the table on the warehouse side. this can occur for potentially two reasons:
I don't think this is an issue with burning criticality, but I have personally run into this a couple of times in client work. if we're touching archive right now we should consider this for inclusion. I won't be offended if we decide not to implement it, but if we don't implement it then we should probably #wontfix it. |
A regular archival job coupled with a join back to the source table will be able to detect hard deletes that happen on the warehouse side. Might be worth revisiting this after archival blocks are fully implemented. The new archival materialization will work with pluggable "strategies", and this could very well be one of them |
Hi! Pitching in from a question I made in the #support slack channel https://getdbt.slack.com/archives/C2JRRQDTL/p1563527481312500 I believe this feature would add great value in certain use cases. I been involved in multiple cases where this would solve a common problem (and also make the case for using dbt stronger since the platform would handle this instead of in the Extract processes) I´ve built many analytic platforms on top of ERP systems (often industry specific niche ERP´s) . A common process nowadays is to extract/replicate the source systems on a daily basis and from those replicas build downstream analytic models (perfect use cases for dbt...). Often there is a need for history tracking, including tracking hard deletes in the source systems - objects (orders, customers, actions) since the source systems more often than not does not track history. The data volumes I´ve come across in these cases are small to moderate. For tracking hard deletes you´ll have to replicate the entire source table and compare to what is already loaded, so it might get tough with huge amounts of data. However, in these cases, full extracts/replicas are made anyway. So I´ll definitely vote for further investegation of this issue :) Although agreeing with jthandy above that it might not be a burning issue |
Thanks for following up @alangsbo! I'm imagining this query taking some sort of time range as an argument. If a given record was last loaded more than X hours/days since the most recent loaded_at date, then dbt can consider it "deleted". Is that approximately what you had in mind? |
I was thinking about a simple join back to the land/replicated table from source system. Assuming that table gets dropped and reloaded every day, so that delted rows simply are missing. If the backwards join on specified key columns does not yield any rows. It is deleted. I guessing you suggestion assumes some kind of persistence of rows in land area, that would also be a good solution where the history would be kept also in the land layer. So maybe both? :) Br/A |
Ah! Ok - most of the ETL tools we used here will upsert new records into source tables. I didn't think about the drop + reload approach. I think that a join from the snapshot table back to the source table is a good approach here. |
I had cause to play around with some approaches this evening. There's a world where this is a supported |
@drewbanin I had a look at that world, it looks good. I have one tool that subscribes to a postgres WAL replication slot - equivalent to MySQL binlog - and loads to Redshift replicating all transactions using a delete + insert strategy. It does that because upserts are not a real thing in RS and behind the scenes an update involves marking pages dirty, inserting new pages and waiting for vacuum to clean up the mess, so we cut to the chase. So yeah we do exactly what Tristan described here:
I personally think that soft-deletes are the best option all round, but failing that what those other ETL providers do is very sensible, i.e. construct a soft-deleted flag or ideally a As an absolute last resort, the left join from snapshot to source works fine, and the strategy does work for both fully reloaded tables as well as propagated deletes. I think a worse position to be in, and where this can't help you, is to have a naive ETL tool that can't replicate hard deletes because they can't be detected using watermarks. The only solutions to that are soft-deletes, full table reloads, or switching to a transaction log shipping tool. The other challenge with deletes, both hard or soft, is propagating them up a chain of incremental dbt models that aggregate above the deleted-row-granularity. For example if you incrementally aggregate to a day-level, and there are timestamp rows subsequently deleted, how do you flush the deleted contribution out of those aggregates? Naively it implies periodically running a |
We are definitely running into this scenario. We do regular full-reloads from our source systems into our warehouse. Our source systems either don't support tracking history on source data or for whatever reason the feeds that we are getting in are just querying current state of various entities and not the historical tracking. They are SaaS platforms which we are querying via APIs -- we don't have access to anything like SQL querying abilities or the underlying transaction log monitoring/output. The volumes of data aren't very large - usually somewhere between a thousand and tens of thousands of rows so it's entirely practical to do trunc-and-load of the entities from the various source systems at least daily. We are creating the tables in our warehouse (BigQuery in this case) and then configuring them as sources in dbt. Then snapshotting the sources. But it would be great to have a comparison of the source data and if a row disappears, then to end-date the current "valid" record in the snapshot to support the hard-delete. We don't have soft-deletes from our current source systems but I can also see the need to have a differing potential strategies to handle a delete in the snapshot by detecting a row that has disappeared versus looking for a soft-flag. Probably also a strategy for handling deletions which doesn't per se. So maybe these options?
In the snapshot itself for any of these options, of course the "deletes" should always be soft. We would never want to actually hard-delete a record from the snapshot per se. |
I you can choose, of course soft deletes are to prefer. My client work in a very regulated business. In our use case we are not allowed to store data of “unsubscribed” users. We would very much like the hard-delete snapshots. Where a delete in the source, wipes all history of a user. I guess the use cases are all different which makes this harder to implement. |
We've been running into this same scenario at GitLab too. We have custom ETL dumping postgres data into snowflake, but the application does constant hard deletes. My current workaround is to keep all of the historical ETL data (3 runs per day) and parse through all of the historical runs in dbt to figure out when a record was deleted. This is not very scalable as the tables are getting very large. It'd be amazing if snapshots could support this. What @jrandrews describes in option 1 is what I'd be interested in. If a record is included in Airflow jobs A,B,C but not D, I would want |
@CaptainEli I'd love to hear more about what you do with custom ETL from postgres to snowflake. This is the wrong forum though, can I pm you? I'm currently migrating our (similar) custom tool from targeting Redshift to now targeting Snowflake. The more I think about it the less I think dbt should be responsible for this. Further to my comments on propagating deletes through a dependency chain of aggregated tables, I think for historical aggregates that it is both unnecessary and wrong. For example if you have users/customer records that you are obliged to delete for PII reasons, that the records are now gone doesn't mean they weren't valid records at the time aggregates were calculated. Better than hard deleting is scrubbing the PII if reproducing old aggregate calculations is important. This is not a new problem in the sphere of data warehousing; slowly-changing-dimensions is an old Kimball idea, and data-vault methodology handles this very well. |
@davehowell yea for sure or you can email me at [email protected]. I didn't build it, but the ETL code lives here https://gitlab.com/gitlab-data/analytics/-/tree/master/extract/postgres_pipeline |
I came a across source system hard deletes with one of my clients and I am currently in the process of manually checking for and adding rows for the deleted source rows. Having dbt be able to handle this would defiantly be an improvement for me in this use case.
|
…ix_test Fix the `unpivot` test to correct behaviour of `remove`, Fix the data to align with it. fka dbt-labs#209
Currently,
dbt archive
stores a series of state changes for records in covered tables, but it does not store a state change for deleted records. Hard deletes in an OLTP database will likely be one of the primary use cases for archival, and we therefore want to make sure that we're capturing record deletion as a state change.The text was updated successfully, but these errors were encountered: