Add paradigm for stream filter constraints and max_records constraints #1119

aaronsteers · 2022-10-26T23:30:33Z

Currently, developers are left to inspect self.replication_key, self.get_starting_replication_key_value(), self.get_starting_timestamp() themselves, within get_records() or get_batches(). This isn't great for a number of reasons.

To make the filters expected much more explicit, this proposal suggests that we pass something like a StreamFilter object to methods like get_records() and get_batches(), and perhaps also to methods like get_url_params() which may be able to pass down queries to the API call.

By localizing the filter to each method call, we also open up options for multiple partitions of the same dataset to be queried simultaneously.
By handling the filter rulesets generically, we unlock use cases that also want a "max" constraint, such as Feature: Add end_date support in generic tap config #922, which in turn unlocks parallel processing of time partitions, as noted above.
Developers get the option of using the generic apply, like filtered_records = filters.apply(unfiltered_records) and include: bool = filters.eval(record_dict).
Alternatively, developers can loop through the StreamFilter.filters set, and handle each filter in a custom manner if needed. (Such as sending eligible constraints as filters to the remote API.)

Psuedocode

Some possible psuedocode to get a feel for how this might look:

Details

class Stream
    # ...
    def get_batches(
        self,
        batch_config: BatchConfig,
        context: dict | None = None,
        filters: StreamFilters,
    ) -> Iterable[tuple[BaseBatchFileEncoding, list[str]]]:
        """Batch generator function.

        Developers are encouraged to override this method to customize batching
        behavior for databases, bulk APIs, etc.

        Args:
            batch_config: Batch config for this stream.
            context: Stream partition or context dictionary.
            filters: A StreamFilters object defining any restrictions which should be applied to the dataset.

        Yields:
            A tuple of (encoding, manifest) for each batch.
        """

from operator import le, lt, gt, ge, eq, ne

class RecordFilter:
    def __init__(self, property_name: str, property_value: Any, operator: Callable[[Any, Any], bool]):
        # Only support a finite set of operators:
        if operator not in [le, lt, gt, ge, eq, ne]:
            raise ValueError("Unsupported operator: {operator.__name__}")

        self.property_name = property_name
        self.property_value = property_value
        self.operator = operator

    def eval(self, record:dict) -> bool:
        """Return True to keep, False to exclude."""
        return self.operator(record[self.property_name], self.property_value)

class StreamFilter:
    filters: list[RecordFilter]
    max_record_limit: int | None

    def eval(self, record: dict) -> bool:
        """Return True to keep the record, False to exclude."""
        return all((filter.eval(record) for filter in self.filters))

    def apply(self, records: Iterable[dict]) -> Iterable[dict]
         """Can be called against a set of records to return only those which match."""
         if self.max_record_limit:
             yield from itertools.islice((record for record in records if self.filter(record), self.max_record_limit)
         else:
             yield from (record for record in records if self.filter(record))

Implementation for SQL taps

For SQL taps, we obviously would not use the inline Python-based evaluators, but instead we could map the filter constraints to WHERE clause filters and LIMIT restrictions, passed generically to SQLAlchemy.

As it relates to the get_batches() method, this could be introduced as a breaking change (sooner the better). Since there are no 'stable' BATCH message implementations as of this writing, it should be acceptable to make this change.

Implementing for `get_records()` implementations in a backwards-compatible manner.

In regards to existing taps that already implement get_records():

Internally we can add a new Stream.filter_records() method that automatically applies the filterset to the records produced by Stream.get_records() - probably after Stream.post_process(), to ensure the properties are in the expected place.
For performance reasons, we can advise developers to override Stream.filter_records() to no-op any filters they've already handled in get_records().
When SDK 1.0 releases, we would update the signature of get_records(), perhaps still preserving the generic Stream.filter_records() for convenience in the default implementation.

@edgarrmondragon and @kgpayne - I updated the above with a possible get_records() implementation proposal, complemented by a new Stream.filter_records() that applies any not-yet-applied rules after get_records() and post_preocess() are complete. This would allow us to add things like end_date support with no additional work from the developer, and without immediately having to change their get_records() implementations.

Would love to get your thoughts.

stale · 2023-07-18T05:09:46Z

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

edgarrmondragon · 2023-07-20T23:19:48Z

Still relevant

stale · 2024-07-20T03:16:44Z

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

aaronsteers mentioned this issue Oct 26, 2022

fix: Initialize max_replication_key_value via SELECT max(<replication_key>) ... before starting a native BATCH sync #976

Open

stale bot added the stale label Jul 18, 2023

stale bot removed the stale label Jul 20, 2023

stale bot added the stale label Jul 20, 2024

stale bot closed this as completed Aug 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add paradigm for stream filter constraints and max_records constraints #1119

Add paradigm for stream filter constraints and max_records constraints #1119

aaronsteers commented Oct 26, 2022 •

edited

Loading

aaronsteers commented Oct 27, 2022 •

edited

Loading

stale bot commented Jul 18, 2023

edgarrmondragon commented Jul 20, 2023

stale bot commented Jul 20, 2024

Add paradigm for stream filter constraints and max_records constraints #1119

Add paradigm for stream filter constraints and max_records constraints #1119

Comments

aaronsteers commented Oct 26, 2022 • edited Loading

Psuedocode

Implementation for SQL taps

Implementing for get_records() implementations in a backwards-compatible manner.

Related

aaronsteers commented Oct 27, 2022 • edited Loading

stale bot commented Jul 18, 2023

edgarrmondragon commented Jul 20, 2023

stale bot commented Jul 20, 2024

aaronsteers commented Oct 26, 2022 •

edited

Loading

Implementing for `get_records()` implementations in a backwards-compatible manner.

aaronsteers commented Oct 27, 2022 •

edited

Loading