Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registering ODFV UDFs that operate on lists of numbers (e.g., cosine similarity of embeddings/vectors) throws an error #1995

Closed
Agent007 opened this issue Nov 4, 2021 · 5 comments · Fixed by #2002

Comments

@Agent007
Copy link
Contributor

Agent007 commented Nov 4, 2021

Expected Behavior

Registering ODFV UDFs that operate on lists of numbers (e.g., cosine similarity of embeddings/vectors) should not
throw errors.

Current Behavior

Defining a ODFV UDF such as cosine similarity and then running feast apply will result in the following error:

def feast_value_type_to_pandas_type(value_type: ValueType) -> Any:
         value_type_to_pandas_type: Dict[ValueType, str] = {
         ValueType.FLOAT: "float",
         ValueType.INT32: "int",
         ValueType.INT64: "int",
         ValueType.STRING: "str",
         ValueType.DOUBLE: "float",
         ValueType.BYTES: "bytes",
         ValueType.BOOL: "bool",
         ValueType.UNIX_TIMESTAMP: "datetime",
         }
         if value_type in value_type_to_pandas_type:
         return value_type_to_pandas_type[value_type]
         raise TypeError(
         >           f"Casting to pandas type for type {value_type} failed. "
         f"Type {value_type} not found"
         )
         E       TypeError: Casting to pandas type for type ValueType.DOUBLE_LIST failed. Type ValueType.DOUBLE_LIST not found

Steps to reproduce

Define an ODFV UDF for cosine similarity and try to register it:

from feast import Entity, Feature, FeatureView, ValueType
from feast.data_source import RequestDataSource
from feast.infra.offline_stores.file_source import FileSource
from feast.on_demand_feature_view import on_demand_feature_view
from google.protobuf.duration_pb2 import Duration

import numpy as np
import pandas as pd


item = Entity(
    name="item_id", 
    value_type=ValueType.INT64, 
    description="item ID",
)

items_fv = FeatureView(
    name="items",
    entities=["item"],
    features=[
        Feature(name="embedding", dtype=ValueType.DOUBLE_LIST),
    ],
    batch_source=FileSource(
        path="YOUR_PATH",
        event_timestamp_column="event_timestamp",
        created_timestamp_column="created",
    ),
    online=True,
    ttl=Duration(),
    tags={},
)

similarity_req = RequestDataSource(
    name="similarity_input", 
    schema={
        "vector": ValueType.DOUBLE_LIST,
    },
)

@on_demand_feature_view(
    inputs={
        "items": items_fv,
        "similarity_req": similarity_req,
    },
    features=[
        Feature(name="cos", dtype=ValueType.DOUBLE),
    ],
)
def similarity(features_df: pd.DataFrame) -> pd.DataFrame:
    if features_df.size == 0:
        return pd.DataFrame({"cos": [0.0]})  # give hint to Feast about return type
    vectors_a = features_df["embedding"].apply(np.array)
    vectors_b = features_df["vector"].apply(np.array)
    dot_products = vectors_a.mul(vectors_b).apply(sum)
    norms_q = vectors_a.apply(np.linalg.norm)
    norms_doc = vectors_b.apply(np.linalg.norm)
    df = pd.DataFrame()
    df["cos"] = dot_products / (norms_q * norms_doc)
    return df

Specifications

  • Version: 0.14.0
  • Platform: all
  • Subsystem: Python SDK

Possible Solution

Add the following 2 lines to feast_value_type_to_pandas_type() in type_map.py:

ValueType.FLOAT_LIST: "object",
ValueType.DOUBLE_LIST: "object",
@adchia
Copy link
Collaborator

adchia commented Nov 4, 2021

thanks for filing this! we'll take a look at this

@Agent007
Copy link
Contributor Author

Agent007 commented Nov 4, 2021

@adchia Thanks! I actually have the fix ready 😄 But feel free to submit it before I do.

@Agent007
Copy link
Contributor Author

Agent007 commented Nov 4, 2021

Similar to #1640 . @judahrand @karlhigley , FYI, you may be interested in this as well.

@Agent007 Agent007 changed the title Registering ODFV UDFs that operating on lists of numbers (e.g., cosine similarity of embeddings/vectors) throws an error Registering ODFV UDFs that operate on lists of numbers (e.g., cosine similarity of embeddings/vectors) throws an error Nov 4, 2021
@judahrand
Copy link
Member

judahrand commented Nov 4, 2021

Yeah, I ran into this today - it looks like an easy fix. I'll stick a PR in over the next few days if you don't get it in. There are plenty of other issues with ODFVs though.

@Agent007
Copy link
Contributor Author

Agent007 commented Nov 12, 2021

@judahrand 's comment in #2013 is a good point. Adopting Arrow's data types as a standard could potentially prevent a whole class of type conversion-related bugs, which the linked PR here was related to. And if Arrow Flight RPC is adopted all the way through the client SDKs, that could also potentially remove the need for type conversion related code and improve performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants