Add dead-letter queue functionality when contract mode == discard_row
#1980
Labels
question
Further information is requested
discard_row
#1980
Feature description
Currently, when the data_mode
discard_row
is selected, either the row's index is simply deleted or, in the case of normalizationNone
is returned in the case of an offending row or column.See for example here: https://github.com/dlt-hub/dlt/blob/devel/dlt/normalize/items_normalizers.py#L77
dlt is performing a data contract check anyways as part of the compute, so just discarding the results is quite wasteful.
Such offending records/ outliers are most of the time of high interest for pipeline creators, either giving quick and actionable insights into new business logic being implemented in the source.
Instead of sorting offenders out silently, collecting offenders in a separate data structure and exposing them to users is a huge opportunity to implement a powerful feature which would be extremely valuable for building pipeline/ data logic based on offending rows: providing the equivalent of a dead-letter queue.
Are you a dlt user?
Yes, I use it for fun.
Use case
Quickly understand the specific data quality integrity of a source integrated via dlt.
This would allow me to quickly and actionably circle back to source stakeholders and remedy inconsistencies.
It would also greatly simplify data pipeline complexity as getting the benefit of a DLQ so far typically requires some sort of streaming solution in place which needlessly duplicates a lot of the already existing features of dlt (schema contracts/ versioning, data piping).
Proposed solution
Instead of returning
None
and simply deleting data objects in memory I, as the data engineer, can choose to either use an in-memory data structure of offending rows caught bydiscard_row
for further processing or can push them to a storage location.Related issues
No response
The text was updated successfully, but these errors were encountered: