Consider exposing config like `dry_run_record_limit` #1366

aaronsteers · 2023-01-30T23:17:42Z

We could theoretically expose a record limit in config for SDK-based taps. This would basically be a more advanced implementation of --test CLI flag.

However, it is important to properly disclaim the following caveats for any users that opt into this behavior:

This mode will very likely break referential integrity expectations.
- The 'first 100' records from the orders table, for instance, might not match all - or even any - of the customer IDs from the first 100 records of the customers table.
This mode will break incremental sync expectations for many taps.
- There's no guarantee that this will align with paging requirements of the upstream source, and for many taps, you would not be able to simply "run it again" to get the next 100 records per stream.
The total number of records sent for child streams will be indeterminant, and could be unexpectedly more or less than the max.
- If the first n parents don't have any children, then no records for the child stream would be synced. Example: I'm syncing with max count of 10. My parent stream is Issues and my child stream is Issue comments. If my first 10 issues do not have comments, then exactly zero Issue comments records will be synced, even though the max for that stream type was never met.
- Conversely, if the 10 parent streams each have exactly 5 child records (for instance), then the total records streams for the child stream will be 50, since each pass through the child streams would reset the counter. The total records synced could be therefor be much more than the max.
- Both of these behaviors are tolerable if communicated clearly to users. Most example cases would not mind 50 records instead of 10 for child streams. And conversely, if child items are sparse, the user could increase the limit to something large enough to ensure at least one child item is sent.

For these reasons, I suggest we pay special attention to naming. I propose above a dry_run_ prefix but there are other ways this could be communicated as well.

A similar approach would be something like dry_run: true to enable a developer-specified default (or the SDK default of something like 1000), possibly also allowing an integer instead of the boolean value dry_run: 10 - with docs clearly explaining that true uses the built-in defaults and any other n will override the limit at the users' request.

Alternative using the CLI

An alternative using just the CLI would be to accept this integer as an optional input to --test. So, --test today tries to sync zero or one records per stream, where-as if --test=N were specified, we'd try to sync (at least) N records per stream.

cc @kgpayne, @edgarrmondragon

bug: default implementations of RESTStream (and others) do not adhere to _MAX_RECORDS_LIMIT #1349

The text was updated successfully, but these errors were encountered:

pnadolny13 · 2023-02-07T14:28:11Z

@aaronsteers you make good points here but is this different from #1333? It sounds like a proposal for implementing that feature. I'll put my thoughts here but it might fragment our discussion if this is in two issues.

Around naming when I hear dry_run to me it means that nothing is actually running like a compile step whereas this dry run would actually sync real records but just not as many of them. I like the idea of including test in the naming, showing that it shouldnt be used in production to sync small batches.

if the first n parents don't have any children, then no records for the child stream would be synced. Example: I'm syncing with max count of 10. My parent stream is Issues and my child stream is Issue comments. If my first 10 issues do not have comments, then exactly zero Issue comments records will be synced, even though the max for that stream type was never met.

It seems like there could be 2 different implementations that we can either chose from or implement both and let the user chose whats best for their use case:

test_stream_max_records - stop syncing after max records are synced. Meaning child streams might end with zero records because the parent reached its maximum and ended before the child got enough records.
test_stream_min_records - stop syncing after a minimum threshold. Meaning a parent stream could have far more than the minimum before the child reaches its stopping point i.e. 1000 slack messages (parent) and 100 thread (child) messages.

I guess my initial thought of how this would work is closer to test_stream_min_records where it would keep running until it reaches the minimum required records. So the slack tap might find 100 messages before it finds 100 thread messages, leading it to continue extracting 1000 messages before it gets the minimum 100 thread messages and is able to stop. I wondered if it would be better to not emit those extra 900 records or instead make it a min_record_count. In this case theres still the issue that potentially all records need to be extracted from the source (regardless of whether they are emitted) before the minimum threshold is reached or its never reached, having stream level configs might make this easier if we could say "for this child stream the minimum is 5 records and everything else is 100" but ultimately thats the user's job to use this properly.

My initial idea for this in #1333 was for testing where I want it to stop after X records because my start date isnt sufficient to limit the size of my dataset. Using test_stream_min_records while keeping my start_date would probably solve it. Worst case scenario I have to sync the day of data (status quo, not a big deal) and best case I end after 1 min of syncing because I have enough data. I know this might not fit everyone's use cases though.

What do you think? I'm not married to these ideas but wanted to share my thoughts in case its helpful context.

stale · 2023-07-18T03:10:14Z

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

pnadolny13 · 2023-07-19T12:47:14Z

Still relevant

This was referenced Jan 30, 2023

Consider exposing nth_record or similar in stream maps #1367

Closed

bug: default implementations of RESTStream (and others) do not adhere to _MAX_RECORDS_LIMIT #1349

Closed

feat: standardise per-stream config approach #1350

Open

pnadolny13 mentioned this issue Feb 7, 2023

[Feature]: default max_records config for testing #1333

Open

aaronsteers mentioned this issue Feb 20, 2023

feat: Set Stream._MAX_RECORDS_LIMIT during tap testing #1399

Closed

stale bot added the stale label Jul 18, 2023

stale bot removed the stale label Jul 19, 2023

edgarrmondragon added the Accepting Pull Requests label Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider exposing config like `dry_run_record_limit` #1366

Consider exposing config like `dry_run_record_limit` #1366

aaronsteers commented Jan 30, 2023 •

edited

Loading

pnadolny13 commented Feb 7, 2023 •

edited

Loading

stale bot commented Jul 18, 2023

pnadolny13 commented Jul 19, 2023

Consider exposing config like dry_run_record_limit #1366

Consider exposing config like dry_run_record_limit #1366

Comments

aaronsteers commented Jan 30, 2023 • edited Loading

Alternative using the CLI

pnadolny13 commented Feb 7, 2023 • edited Loading

stale bot commented Jul 18, 2023

pnadolny13 commented Jul 19, 2023

Consider exposing config like `dry_run_record_limit` #1366

Consider exposing config like `dry_run_record_limit` #1366

aaronsteers commented Jan 30, 2023 •

edited

Loading

pnadolny13 commented Feb 7, 2023 •

edited

Loading