Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enable Vector database and retrieve_online_documents API #4061

Merged
merged 41 commits into from
Apr 15, 2024

Conversation

HaoXuAI
Copy link
Collaborator

@HaoXuAI HaoXuAI commented Mar 31, 2024

What this PR does / why we need it:
RFC: https://docs.google.com/document/d/18IWzLEA9i2lDWnbfbwXnMCg3StlqaLVI-uRpQjr_Vos/edit#heading=h.9gaqqtox9jg6

Which issue(s) this PR fixes:

Fixes #

#3965

Copy link
Member

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great, I left some comments in the doc and think the only thing that's worth doing is us aligning on some of the naming conventions to be more similar to the industry and to what Feast is already doing.

Great work and can't wait to get this out! 🚀

top_k: int,
) -> OnlineResponse:
"""
Retrieves the top k cloeses document features.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Retrieves the top k cloeses document features.
Retrieves the top k closest document features.

def get_top_k_document_features(
self,
feature: str,
document: Union[str, np.ndarray],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is document the user input? often referred to as the query?

@HaoXuAI HaoXuAI requested review from DvirDukhan and a team as code owners April 9, 2024 07:18
@HaoXuAI HaoXuAI requested review from sfc-gh-madkins and removed request for a team April 9, 2024 07:18
@HaoXuAI HaoXuAI changed the title [Draft] feat: document store feat: document store Apr 9, 2024
top_k: int,
) -> OnlineResponse:
"""
Retrieves the top k closest document features.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Retrieves the top k closest document features.
Retrieves the top k closest document features. Note, embeddings are a subset of features.

Args:
feature: The list of document features that should be retrieved from the online document store. These features can be
specified either as a list of string document feature references or as a feature service. String feature
references must have format "feature_view:feature", e.g, "document_fv:document_embedding_feature".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
references must have format "feature_view:feature", e.g, "document_fv:document_embedding_feature".
references must have format "feature_view:feature", e.g, "document_fv:document_embeddings".

@@ -1690,6 +1690,67 @@ def _get_online_features(
)
return OnlineResponse(online_features_response)

@log_exceptions_and_usage
def retrieve_online_documents(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably something to be said about having a configurable distance metric to let the user choose which way to get the top_k

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, there are a bunch of different algorithms/configs for Postgresql to retrieve the documents. We can support it in the future after this PR

"""

# Convert the embedding to a string to be used in postgres vector search
query_embedding_str = f"'[{','.join(str(el) for el in embedding)}]'"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best serialization we can do? This feels pretty brittle but I get it.

@@ -47,6 +47,7 @@ def __init__(self, config: RepoConfig):
self.repo_config = config
self._offline_store = None
self._online_store = None
self._document_store = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer necessary

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the issue. One question, though: what if we want to continue using Redis or any other online store for usual features, and use PG vector solely for embedding and search? Do we have the option to use the online store and the document store in the feature_store.yaml, both?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently Feast doesn't support multiple online store. but that would be a good feature to add.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could get complicated but agreed it'd be good to add. I could imagine a Redis + another DB layer would be an obvious one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again, guys, for these amazing features. Yes, having multiple online stores will make it easier to use the right database layer for the appropriate use case! 🙌

requested_feature: str,
embedding: List[float],
top_k: int,
) -> List[Tuple[Optional[datetime], Optional[Dict[str, ValueProto]]]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need Optional[Dict[str, ValueProto]]? If only a single feature can be searched, wouldn't Optional[ValueProto] be sufficient?

@HaoXuAI HaoXuAI changed the title feat: document store feat: Enable Vector database and retrieve_online_documents API Apr 11, 2024

@pytest.mark.integration
@pytest.mark.universal_online_stores(only=["postgres"])
def test_retrieve_online_documents(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you be outputting the cosine similarity as well? That would be useful possibly for debugging. Would be good to be able to test that the engine computes it...maybe not doable though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be possible. Somehow just the integration test doesn't startup the Postgres container. And I'm debugging it.

Copy link
Member

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@HaoXuAI
Copy link
Collaborator Author

HaoXuAI commented Apr 15, 2024

lgtm

thanks!

@HaoXuAI
Copy link
Collaborator Author

HaoXuAI commented Apr 15, 2024

@franciscojavierarceo @tokoko fyi this is the result of the API. there are still some TODOs I've added into the code directly. Will add to the document as well. Also feels like having a layer to abstract the vector online store might help to control the input and output flow. Anyway we have a working solution now.

Screenshot 2024-04-14 at 3 24 34 PM Screenshot 2024-04-14 at 3 24 30 PM

@HaoXuAI HaoXuAI merged commit ec19036 into master Apr 15, 2024
16 checks passed
franciscojavierarceo pushed a commit that referenced this pull request Apr 16, 2024
# [0.36.0](v0.35.0...v0.36.0) (2024-04-16)

### Bug Fixes

* Add __eq__, __hash__ to SparkSource for correct comparison ([#4028](#4028)) ([e703b40](e703b40))
* Add conn.commit() to Postgresonline_write_batch.online_write_batch ([#3904](#3904)) ([7d75fc5](7d75fc5))
* Add missing __init__.py to embedded_go ([#4051](#4051)) ([6bb4c73](6bb4c73))
* Add missing init files in infra utils ([#4067](#4067)) ([54910a1](54910a1))
* Added registryPath parameter documentation in WebUI reference ([#3983](#3983)) ([5e0af8f](5e0af8f)), closes [#3974](#3974) [#3974](#3974)
* Adding missing init files in materialization modules ([#4052](#4052)) ([df05253](df05253))
* Allow trancated timestamps when converting ([#3861](#3861)) ([bdd7dfb](bdd7dfb))
* Azure blob storage support in Java feature server ([#2319](#2319)) ([#4014](#4014)) ([b9aabbd](b9aabbd))
* Bugfix for grabbing historical data from Snowflake with array type features. ([#3964](#3964)) ([1cc94f2](1cc94f2))
* Bytewax materialization engine fails when loading feature_store.yaml ([#3912](#3912)) ([987f0fd](987f0fd))
* CI unittest warnings ([#4006](#4006)) ([0441b8b](0441b8b))
* Correct the returning class proto type of StreamFeatureView to StreamFeatureViewProto instead of FeatureViewProto. ([#3843](#3843)) ([86d6221](86d6221))
* Create index only if not exists during MySQL online store update ([#3905](#3905)) ([2f99a61](2f99a61))
* Disable minio tests in workflows on master and nightly ([#4072](#4072)) ([c06dda8](c06dda8))
* Disable the Feast Usage feature by default. ([#4090](#4090)) ([b5a7013](b5a7013))
* Dump repo_config by alias ([#4063](#4063)) ([e4bef67](e4bef67))
* Extend SQL registry config with a sqlalchemy_config_kwargs key ([#3997](#3997)) ([21931d5](21931d5))
* Feature Server image startup in OpenShift clusters ([#4096](#4096)) ([9efb243](9efb243))
* Fix copy method for StreamFeatureView ([#3951](#3951)) ([cf06704](cf06704))
* Fix for materializing entityless feature views in Snowflake ([#3961](#3961)) ([1e64c77](1e64c77))
* Fix type mapping spark ([#4071](#4071)) ([3afa78e](3afa78e))
* Fix typo as the cli does not support shortcut-f option. ([#3954](#3954)) ([dd79dbb](dd79dbb))
* Get container host addresses from testcontainers ([#3946](#3946)) ([2cf1a0f](2cf1a0f))
* Handle ComplexFeastType to None comparison ([#3876](#3876)) ([fa8492d](fa8492d))
* Hashlib md5 errors in FIPS for python 3.9+ ([#4019](#4019)) ([6d9156b](6d9156b))
* Making the query_timeout variable as optional int because upstream is considered to be optional ([#4092](#4092)) ([fd5b620](fd5b620))
* Move gRPC dependencies to an extra ([#3900](#3900)) ([f93c5fd](f93c5fd))
* Prevent spamming pull busybox from dockerhub ([#3923](#3923)) ([7153cad](7153cad))
* Quickstart notebook example ([#3976](#3976)) ([b023aa5](b023aa5))
* Raise error when not able read of file source spark source ([#4005](#4005)) ([34cabfb](34cabfb))
* remove not use input parameter in spark source ([#3980](#3980)) ([7c90882](7c90882))
* Remove parentheses in pull_latest_from_table_or_query ([#4026](#4026)) ([dc4671e](dc4671e))
* Remove proto-plus imports ([#4044](#4044)) ([ad8f572](ad8f572))
* Remove unnecessary dependency on mysqlclient ([#3925](#3925)) ([f494f02](f494f02))
* Restore label check for all actions using pull_request_target ([#3978](#3978)) ([591ba4e](591ba4e))
* Revert mypy config ([#3952](#3952)) ([6b8e96c](6b8e96c))
* Rewrite Spark materialization engine to use mapInPandas ([#3936](#3936)) ([dbb59ba](dbb59ba))
* Run feature server w/o gunicorn on windows ([#4024](#4024)) ([584e9b1](584e9b1))
* SqlRegistry _apply_object update statement ([#4042](#4042)) ([ef62def](ef62def))
* Substrait ODFVs for online ([#4064](#4064)) ([26391b0](26391b0))
* Swap security label check on the PR title validation job to explicit permissions instead ([#3987](#3987)) ([f604af9](f604af9))
* Transformation server doesn't generate files from proto ([#3902](#3902)) ([d3a2a45](d3a2a45))
* Trino as an OfflineStore Access Denied when BasicAuthenticaion ([#3898](#3898)) ([49d2988](49d2988))
* Trying to import pyspark lazily to avoid the dependency on the library ([#4091](#4091)) ([a05cdbc](a05cdbc))
* Typo Correction in Feast UI Readme ([#3939](#3939)) ([c16e5af](c16e5af))
* Update actions/setup-python from v3 to v4 ([#4003](#4003)) ([ee4c4f1](ee4c4f1))
* Update typeguard version to >=4.0.0 ([#3837](#3837)) ([dd96150](dd96150))
* Upgrade sqlalchemy from 1.x to 2.x regarding PVE-2022-51668. ([#4065](#4065)) ([ec4c15c](ec4c15c))
* Use CopyFrom() instead of __deepycopy__() for creating a copy of protobuf object. ([#3999](#3999)) ([5561b30](5561b30))
* Using version args to install the correct feast version ([#3953](#3953)) ([b83a702](b83a702))
* Verify the existence of Registry tables in snowflake before calling CREATE sql command. Allow read-only user to call feast apply. ([#3851](#3851)) ([9a3590e](9a3590e))

### Features

* Add duckdb offline store ([#3981](#3981)) ([161547b](161547b))
* Add Entity df in format of a Spark Dataframe instead of just pd.DataFrame or string for SparkOfflineStore ([#3988](#3988)) ([43b2c28](43b2c28))
* Add gRPC Registry Server ([#3924](#3924)) ([373e624](373e624))
* Add local tests for s3 registry using minio ([#4029](#4029)) ([d82d1ec](d82d1ec))
* Add python bytes to array type conversion support proto ([#3874](#3874)) ([8688acd](8688acd))
* Add python client for remote registry server ([#3941](#3941)) ([42a7b81](42a7b81))
* Add Substrait-based ODFV transformation ([#3969](#3969)) ([9e58bd4](9e58bd4))
* Add support for arrays in snowflake ([#3769](#3769)) ([8d6bec8](8d6bec8))
* Added delete_table to redis online store ([#3857](#3857)) ([03dae13](03dae13))
* Adding support for Native Python feature transformations for ODFVs ([#4045](#4045)) ([73bc853](73bc853))
* Bumping requirements ([#4079](#4079)) ([1943056](1943056))
* Decouple transformation types from ODFVs ([#3949](#3949)) ([0a9fae8](0a9fae8))
* Dropping Python 3.8 from local integration tests and integration tests ([#3994](#3994)) ([817995c](817995c))
* Dropping python 3.8 requirements files from the project. ([#4021](#4021)) ([f09c612](f09c612))
* Dropping the support for python 3.8 version from feast ([#4010](#4010)) ([a0f7472](a0f7472))
* Dropping unit tests for Python 3.8 ([#3989](#3989)) ([60f24f9](60f24f9))
* Enable Arrow-based columnar data transfers  ([#3996](#3996)) ([d8d7567](d8d7567))
* Enable Vector database and retrieve_online_documents API ([#4061](#4061)) ([ec19036](ec19036))
* Kubernetes materialization engine written based on bytewax ([#4087](#4087)) ([7617bdb](7617bdb))
* Lint with ruff ([#4043](#4043)) ([7f1557b](7f1557b))
* Make arrow primary interchange for offline ODFV execution ([#4083](#4083)) ([9ed0a09](9ed0a09))
* Pandas v2 compatibility ([#3957](#3957)) ([64459ad](64459ad))
* Pull duckdb from contribs, add to CI ([#4059](#4059)) ([318a2b8](318a2b8))
* Refactor ODFV schema inference ([#4076](#4076)) ([c50a9ff](c50a9ff))
* Refactor registry caching logic into a separate class ([#3943](#3943)) ([924f944](924f944))
* Rename OnDemandTransformations to Transformations ([#4038](#4038)) ([9b98eaf](9b98eaf))
* Revert updating dependencies so that feast can be run on 3.11. ([#3968](#3968)) ([d3c68fb](d3c68fb)), closes [#3958](#3958)
* Rewrite ibis point-in-time-join w/o feast abstractions ([#4023](#4023)) ([3980e0c](3980e0c))
* Support s3gov schema by snowflake offline store during materialization ([#3891](#3891)) ([ea8ad17](ea8ad17))
* Update odfv test ([#4054](#4054)) ([afd52b8](afd52b8))
* Update pyproject.toml to use Python 3.9 as default ([#4011](#4011)) ([277b891](277b891))
* Update the Pydantic from v1 to v2 ([#3948](#3948)) ([ec11a7c](ec11a7c))
* Updating dependencies so that feast can be run on 3.11. ([#3958](#3958)) ([59639db](59639db))
* Updating protos to separate transformation ([#4018](#4018)) ([c58ef74](c58ef74))

### Reverts

* Reverting bumping requirements ([#4081](#4081)) ([1ba65b4](1ba65b4)), closes [#4079](#4079)
* Verify the existence of Registry tables in snowflake… ([#3907](#3907)) ([c0d358a](c0d358a)), closes [#3851](#3851)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants