Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dead-letter queue functionality when contract mode == discard_row #1980

Open
thenaturalist opened this issue Oct 23, 2024 · 0 comments
Open
Assignees
Labels
question Further information is requested

Comments

@thenaturalist
Copy link

Feature description

Currently, when the data_mode discard_row is selected, either the row's index is simply deleted or, in the case of normalization None is returned in the case of an offending row or column.

See for example here: https://github.com/dlt-hub/dlt/blob/devel/dlt/normalize/items_normalizers.py#L77

dlt is performing a data contract check anyways as part of the compute, so just discarding the results is quite wasteful.

Such offending records/ outliers are most of the time of high interest for pipeline creators, either giving quick and actionable insights into new business logic being implemented in the source.

Instead of sorting offenders out silently, collecting offenders in a separate data structure and exposing them to users is a huge opportunity to implement a powerful feature which would be extremely valuable for building pipeline/ data logic based on offending rows: providing the equivalent of a dead-letter queue.

Are you a dlt user?

Yes, I use it for fun.

Use case

Quickly understand the specific data quality integrity of a source integrated via dlt.
This would allow me to quickly and actionably circle back to source stakeholders and remedy inconsistencies.

It would also greatly simplify data pipeline complexity as getting the benefit of a DLQ so far typically requires some sort of streaming solution in place which needlessly duplicates a lot of the already existing features of dlt (schema contracts/ versioning, data piping).

Proposed solution

Instead of returning None and simply deleting data objects in memory I, as the data engineer, can choose to either use an in-memory data structure of offending rows caught by discard_row for further processing or can push them to a storage location.

Related issues

No response

@thenaturalist thenaturalist changed the title Add dead-letter queue functunality when contract mode == discard_row Add dead-letter queue functionality when contract mode == discard_row Oct 23, 2024
@rudolfix rudolfix added the question Further information is requested label Oct 28, 2024
@rudolfix rudolfix self-assigned this Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: Planned
Development

No branches or pull requests

2 participants