-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reset on connectors do not reset the SCD tables #5417
Comments
Thinking this might occur in regular connectors as well but not sure. |
Addditionally, seeing all the |
@danieldiamond this is happening because of how we are resetting jobs; the mechanism was never updated to look for the SCD tables. What is the impact of this? This is pretty low on our end - definitely open to moving it up if this is causing you pain. |
This is not critical at all. Also am happy to help try solve this one if you can point me to the right file(s) |
@davinchia actually running into issues which would be a massive concern to users and makes me think this should be refactored to critical. When resetting a connector, the SCD tables dont get deleted, which means for Note: If you have an extremely large connector and need to go through the pain of resetting it e.g. when adding new tables or modifying underlying schema. To go through a reset and full sync, only to find that the SCD now contains previous and current data (duplicated), this would be particularly unfortunate. |
@sherifnada @davinchia if either of you point me to the related codebase (mechanism where this is missed) i'm happy to take a look and try to submit a PR. imagine this isn't a very heavy task |
@ChristopheDuong is this something you can point to? |
I believe reset uses the same code path regardless of CDC, although I believe SCD is a result of the incremental dedup function. Are you using that Daniel? This would be here by setting an empty source. We could always check for these tables and try to overwrite them but that isn't too clear. Since these are generated by DBT, maybe we add a DBT operation in reset to do so? WDYT Chris? |
yeah im using incremental dedup. i also wonder if there are any implications with the normalization revamp going on now |
Yes, additional logic should be added in normalization when reset are triggered The implementation strategy as described in the comments from: Line 79 in 80e0a8c
Do we want normalization to always delete |
@oustynova This looks like a pretty pressing issue that was reported again recently. Can we have someone from the team investigate this and determine the root cause? Thanks! |
FYI @ChristopheDuong and team, this is a serious issue for database replication syncs using CDC. when airbyte loses the positiion of the binlog (when airbyte dies for a while for example), it needs to start again and sync the entire table. however in the SCD table it incorrectly labels an active row. this is a bug. and means the DB replication is selecting the wrong record as the active record. i'm 95% sure that any airbyte users using CDC on db replication are experiencing this issue and are not aware of it |
This issue was this long ago? I recently discovered this for Snowflake destination, for the latest version of 0.39.37-alpha, normalization docker: airbyte/normalization-snowflake:0.2.8. |
yeah TBH I'm not sure why this isn't resolved yet and can only assume it's because not enough airbyte users actually use incremental + dedupe. |
cc @grishick in case the team wasn't aware of this issue |
Zendesk ticket #1527 has been linked to this issue. |
Comment made from Zendesk by Marcos Marx on 2022-08-01 at 17:37:
|
Related support tickets that need this feature for a feasible workaround: https://github.com/airbytehq/oncall/issues/587, https://github.com/airbytehq/oncall/issues/581, https://github.com/airbytehq/oncall/issues/580, https://github.com/airbytehq/oncall/issues/574 |
Notes from grooming: |
Hey team! Please add your planning poker estimate with Zenhub @edgao @rodireich @ryankfu @subodh1810 @tuliren @akashkulk |
@rodireich might you be able to look into this in the current sprint? |
definitely. Assigned it to myself to take a look |
@rodireich how did you go with this? |
i honestly cant understand why more users aren't sharing this exact concern. either it's a deterrent from anyone wishing to use incremental+dedupe or users just are not aware |
Something that came out of data analytics office hours : |
This issue is over a year old. Can I ask what complexity is involved with working on a solution here? Is it not simply a matter of automatically deleting the SCD table in addition to the _AIRBYTE_RAW and final tables? @sherifnada @marcosmarxm @ChristopheDuong can I also confirm that the impact of this issue is clearly understood? Reviewing the comments from users above - IMO this should be top priority given it's quite literally bad data & incorrect final tables - a clear malfunctioning of any incremental + dedupe DB connectors where reset is involved (particularly as they're starting to be classified as GA). |
@danieldiamond yes the impact of this is understood. I believe this is in the current sprint for the Destination's team. The complexity here is Airbyte doesn't actually have a direct way to reset tables. We currently do so via simulating an empty source, and relying on downstream behaviour to reset the tables i.e. empty source -> empty raw tables -> empty final tables. Unfortunately, not as simple as adding the scd tables to the wipe list. We discussed internally and the long term solution is to add a new 'reset' operation to Destinations (see https://github.com/airbytehq/airbyte-internal-issues/issues/999). The short term solution is to implement this as part of normalisation. However it's not terribly straight-forward since the append+dedup implementations aren't that simple to begin with. All in all, it's on team's backlog and work should start in the next few weeks. @grishick is the manager and can provide more context! |
@davinchia thanks so much for the reply and for that context. Really appreciated. |
Enviroment
Current Behavior
Reset job in CDC does not reset SCD table e.g.
mytable
is empty,mytable_scd
is not empty,_airbyte_raw_mytable
is empty.Expected Behavior
All tables associated with connector should be reset
Logs
If applicable, please upload the logs from the failing operation.
For sync jobs, you can download the full logs from the UI by going to the sync attempt page and
clicking the download logs button at the top right of the logs display window.
LOG
Steps to Reproduce
Are you willing to submit a PR?
Unfortunately not at this time
┆Issue is synchronized with this Asana task by Unito
The text was updated successfully, but these errors were encountered: