Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substrait-based on demand feature views #3945

Closed
tokoko opened this issue Feb 11, 2024 · 2 comments · Fixed by #3969
Closed

Substrait-based on demand feature views #3945

tokoko opened this issue Feb 11, 2024 · 2 comments · Fixed by #3969
Labels
kind/feature New feature or request

Comments

@tokoko
Copy link
Collaborator

tokoko commented Feb 11, 2024

Is your feature request related to a problem? Please describe.
On demand feature views as implemented right now are very limited. The only way to specify odfvs is through a python function that takes in pandas Dataframe as input and outputs another pandas Dataframe. This leads to problems for both offline and online interfaces:

  • Even the most scalable offline stores are forced to collect the whole dataset as a single pandas Dataframe to apply odfv function. There's no way for offline stores to incorporate computation in their engines.
  • udfs in odfvs are inherently bound to pandas and python runtime. Non-python feature servers are stuck with the problem of figuring out how to run this functions if necessary. Java feature server for example has a separate python transformation service only for this reason, but that's obviously a subpar solution as the whole point of a java feature server was to avoid python runtime in feature serving in the first place.

Describe the solution you'd like
Allow constructing odfvs as substrait plans. Substrait is a protobuf-based serialization format for relational algebra operations. It is meant to be used as a cross-language and cross-engine format for sharing logical or physical execution plans. It has a number of producers (tools that can generate substrait) and consumers (engines that can run substrait) in different languages.

  • Different offline stores will be able to inspect and incorporate substrait plans in their transformations. Even if that's impossible the default implementation inside feast to apply these functions will avoid pandas.
  • Most importantly, non-python feature servers like a java feature server will be able to apply the functions without a separate python component. Apache Arrow java implementation comes with java bindings to Acero query engine that can consume substrait plans. (https://arrow.apache.org/docs/java/substrait.html#executing-queries-using-substrait-plans)

The example code in my PoC implementation looks something like this:

def generate_substrait():
    import ibis
    from ibis_substrait.compiler.core import SubstraitCompiler

    compiler = SubstraitCompiler()

    t = ibis.table([("conv_rate", "float"), ("acc_rate", "float")], "t")

    expr = t.select((t['conv_rate'] + t['acc_rate']).name('conv_rate_plus_acc_substrait'))

    return compiler.compile(expr).SerializeToString()

substrait_odfv = OnDemandFeatureView(
    name='substrait_view',
    sources=[driver_stats_fv],
    schema=[
        Field(name="conv_rate_plus_acc_substrait", dtype=Float64)
    ],
    substrait_plan=generate_substrait()
)

Substait plan object that feast accepts is bytes and introduces no external dependency. I'm using ibis and ibis-substrait to generate the plan. Right now that's the most practical way to generate substrait plan in python with DataFrame-like API, but this could have been any other substrait producer.

Describe alternatives you've considered
An obvious alternative to substrait is sql-based odfvs, but using SQL has a number of important downsides:

  1. The presence of different sql dialects means that, it will be especially hard to ensure that sql-based feature functions will behave the same way across different offline store and online store implementations.
  2. The user is implicitly bound to their offline store and online store of choice, because the dialect used in sql strings has to match offline store engine.

Having said that, it probably makes sense to support both substrait-based and sql-based odfvs, because at the moment it might be easier for sql-based logic to be incorporated inside offline store engines.

@HaoXuAI
Copy link
Collaborator

HaoXuAI commented Feb 14, 2024

This can be a really good feature.
One step backward, how about only use ibis? i'm not so familiar with the library but looks like it provides a Dataframe API on top of many engines?

@tokoko
Copy link
Collaborator Author

tokoko commented Feb 14, 2024

That's a great question, thanks.

If we're talking about how it should be stored in the registry:

  • ibis can run on many engines, but it's still a python library. Plus, it has no good way to serialize it's DataFrames other than using pickle or dill. If we were to do that, there would be no way for a java process to either read or process whatever is stored in the registry, similar to how it can't process udfs stored in the registry right now. Another thing is that ibis eventually plans to use substrait as an internal representation, so once that happens, it will be a moot point.
  • It also leaves the opportunity to use other substrait producers in the future. AFAIK, there are no others with python interface right now, but it still leaves the door open for it.

If we're talking about user-facing interface, I agree that we can make on_demand_feature_view decorator smarter to distinguish between functions based on signature and create transformation object accordingly. something like this would create OnDemandSubstraitTransformation for example:

from ibis.expr.types import Table
@on_demand_feature_view(
   sources=...,
   schema=...
)
def transformed_conv_rate(t) -> Table:
    return t.select((t['conv_rate'] + t['acc_rate']).name('conv_rate_plus_acc_substrait'))

The reason why we can't do this now is that ibis follows NEP 29 and has dropped python 3.8 support a while ago. If we were to depend on ibis right now, it would be on version ibis-framework==5.1.0 which is ancient history for a library that's still not very mature. I suggested following NEP 29 ourselves on our slack, maybe you chould chime in there :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants