-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add paradigm for stream filter constraints and max_records constraints #1119
Comments
@edgarrmondragon and @kgpayne - I updated the above with a possible Would love to get your thoughts. |
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the |
Still relevant |
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the |
Currently, developers are left to inspect
self.replication_key
,self.get_starting_replication_key_value()
,self.get_starting_timestamp()
themselves, withinget_records()
orget_batches()
. This isn't great for a number of reasons.To make the filters expected much more explicit, this proposal suggests that we pass something like a
StreamFilter
object to methods likeget_records()
andget_batches()
, and perhaps also to methods likeget_url_params()
which may be able to pass down queries to the API call.end_date
support in generic tap config #922, which in turn unlocks parallel processing of time partitions, as noted above.filtered_records = filters.apply(unfiltered_records)
andinclude: bool = filters.eval(record_dict)
.StreamFilter.filters
set, and handle each filter in a custom manner if needed. (Such as sending eligible constraints as filters to the remote API.)Psuedocode
Some possible psuedocode to get a feel for how this might look:
Details
Implementation for SQL taps
For SQL taps, we obviously would not use the inline Python-based evaluators, but instead we could map the filter constraints to
WHERE
clause filters andLIMIT
restrictions, passed generically to SQLAlchemy.As it relates to the
get_batches()
method, this could be introduced as a breaking change (sooner the better). Since there are no 'stable'BATCH
message implementations as of this writing, it should be acceptable to make this change.Implementing for
get_records()
implementations in a backwards-compatible manner.In regards to existing taps that already implement
get_records()
:Stream.filter_records()
method that automatically applies the filterset to the records produced byStream.get_records()
- probably afterStream.post_process()
, to ensure the properties are in the expected place.Stream.filter_records()
to no-op any filters they've already handled inget_records()
.get_records()
, perhaps still preserving the genericStream.filter_records()
for convenience in the default implementation.Related
This probably also resolves:
And would make this a pretty easy fast follow:
end_date
support in generic tap config #922The text was updated successfully, but these errors were encountered: