Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider exposing config like dry_run_record_limit #1366

Open
aaronsteers opened this issue Jan 30, 2023 · 3 comments
Open

Consider exposing config like dry_run_record_limit #1366

aaronsteers opened this issue Jan 30, 2023 · 3 comments

Comments

@aaronsteers
Copy link
Contributor

aaronsteers commented Jan 30, 2023

We could theoretically expose a record limit in config for SDK-based taps. This would basically be a more advanced implementation of --test CLI flag.

However, it is important to properly disclaim the following caveats for any users that opt into this behavior:

  1. This mode will very likely break referential integrity expectations.
    • The 'first 100' records from the orders table, for instance, might not match all - or even any - of the customer IDs from the first 100 records of the customers table.
  2. This mode will break incremental sync expectations for many taps.
    • There's no guarantee that this will align with paging requirements of the upstream source, and for many taps, you would not be able to simply "run it again" to get the next 100 records per stream.
  3. The total number of records sent for child streams will be indeterminant, and could be unexpectedly more or less than the max.
    • If the first n parents don't have any children, then no records for the child stream would be synced. Example: I'm syncing with max count of 10. My parent stream is Issues and my child stream is Issue comments. If my first 10 issues do not have comments, then exactly zero Issue comments records will be synced, even though the max for that stream type was never met.
    • Conversely, if the 10 parent streams each have exactly 5 child records (for instance), then the total records streams for the child stream will be 50, since each pass through the child streams would reset the counter. The total records synced could be therefor be much more than the max.
    • Both of these behaviors are tolerable if communicated clearly to users. Most example cases would not mind 50 records instead of 10 for child streams. And conversely, if child items are sparse, the user could increase the limit to something large enough to ensure at least one child item is sent.

For these reasons, I suggest we pay special attention to naming. I propose above a dry_run_ prefix but there are other ways this could be communicated as well.

A similar approach would be something like dry_run: true to enable a developer-specified default (or the SDK default of something like 1000), possibly also allowing an integer instead of the boolean value dry_run: 10 - with docs clearly explaining that true uses the built-in defaults and any other n will override the limit at the users' request.

Alternative using the CLI

An alternative using just the CLI would be to accept this integer as an optional input to --test. So, --test today tries to sync zero or one records per stream, where-as if --test=N were specified, we'd try to sync (at least) N records per stream.

cc @kgpayne, @edgarrmondragon

Related:

@pnadolny13
Copy link
Contributor

pnadolny13 commented Feb 7, 2023

@aaronsteers you make good points here but is this different from #1333? It sounds like a proposal for implementing that feature. I'll put my thoughts here but it might fragment our discussion if this is in two issues.

Around naming when I hear dry_run to me it means that nothing is actually running like a compile step whereas this dry run would actually sync real records but just not as many of them. I like the idea of including test in the naming, showing that it shouldnt be used in production to sync small batches.

if the first n parents don't have any children, then no records for the child stream would be synced. Example: I'm syncing with max count of 10. My parent stream is Issues and my child stream is Issue comments. If my first 10 issues do not have comments, then exactly zero Issue comments records will be synced, even though the max for that stream type was never met.

It seems like there could be 2 different implementations that we can either chose from or implement both and let the user chose whats best for their use case:

  • test_stream_max_records - stop syncing after max records are synced. Meaning child streams might end with zero records because the parent reached its maximum and ended before the child got enough records.
  • test_stream_min_records - stop syncing after a minimum threshold. Meaning a parent stream could have far more than the minimum before the child reaches its stopping point i.e. 1000 slack messages (parent) and 100 thread (child) messages.

I guess my initial thought of how this would work is closer to test_stream_min_records where it would keep running until it reaches the minimum required records. So the slack tap might find 100 messages before it finds 100 thread messages, leading it to continue extracting 1000 messages before it gets the minimum 100 thread messages and is able to stop. I wondered if it would be better to not emit those extra 900 records or instead make it a min_record_count. In this case theres still the issue that potentially all records need to be extracted from the source (regardless of whether they are emitted) before the minimum threshold is reached or its never reached, having stream level configs might make this easier if we could say "for this child stream the minimum is 5 records and everything else is 100" but ultimately thats the user's job to use this properly.

My initial idea for this in #1333 was for testing where I want it to stop after X records because my start date isnt sufficient to limit the size of my dataset. Using test_stream_min_records while keeping my start_date would probably solve it. Worst case scenario I have to sync the day of data (status quo, not a big deal) and best case I end after 1 min of syncing because I have enough data. I know this might not fit everyone's use cases though.

What do you think? I'm not married to these ideas but wanted to share my thoughts in case its helpful context.

@stale
Copy link

stale bot commented Jul 18, 2023

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Jul 18, 2023
@pnadolny13
Copy link
Contributor

Still relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants