Modularize ingestion distributed compute engine support #444

ches · 2020-01-27T05:28:04Z

This is a companion to #402 and the larger topic of storage engine modularization which was realized in #529 and subsequent PRs that implemented the new interfaces.

Just as adding support for new storage engines tends to cause a dependency explosion for Feast ingestion & serving, the same is true for Beam Runner / job management adapter glue in core (this all could move to serving with future plans, but that won't change the fundamental problem this issue is about).

So for both storage and compute engines, I feel that some modularity strategy is needed for loose binding at build time, configurable for runtime. The goals would be to:

Minimize dependency pains that developers and contributors to Feast need to deal with if they are not actively working on a particular stack. The dependency trees are often large and fragile, especially in the Hadoop ecosystem, such as Hive and Spark.
Reduce deployment bloat if operators wish to package Feast internally with only the module JARs they need to support their organization's stack. (IIRC last I checked, hadoop-common or hadoop-client leave you with close to 200MB of jars, and beam-runners-spark and beam-sdks-java-io-hcatalog among others have these deps [as provided scope, but the point stands I believe]).

Possibilities might be OSGi or java.util.ServiceLoader (and Spring integration or alternatives thereof). Open to other ideas!

Relates to #362

The text was updated successfully, but these errors were encountered:

woop · 2020-01-29T16:07:24Z

Agreed with this problem. I think we should be defining the exact extension points here very clearly though.

It can be hard to talk about in the abstract, so the questions I see are

Which specific compute or storage engines do we already see a need to cover?
Which specific points in the code base do we need to integrate a modularization layer?

Then more than that. I can see the introduction of this layer introducing a lot of overhead and complexity in the short term, even though it will pay dividends if teams are starting to fork the code base now (which might be the case already with 0.3). I would want to make sure we get alignment on the future direction of Feast so that we can stabilize the architecture before solidifying these modularization points.

stale · 2020-03-29T16:20:36Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dr3s · 2020-06-05T01:05:24Z

I'm extremely wary of this type of complexity within the service. I'm pretty biased but I would prefer something more along the lines of these options:

Optimize for one implementation based upon open source (flink and jdbc) which is packaged with feast + connectors for managed services a la dataflow and bigquery. That should minimize the client libraries.
Modularity should be at the microservice boundary. For instance, spark may be able to be supported by running https://github.com/spark-jobserver/spark-jobserver that bundles the spark libraries. Even dataflow and bigquery could be other microservices and be optional parts of the installation.

I don't know what the state of the art is these days with OSGi or other SP frameworks but I don't think the complexity they bring is worth it.

woop · 2021-02-08T00:55:54Z

Closing this issue since it is now stale. The Job Service manages jobs, and we have different launcher implementations available. Currently we are purely using Spark.

ches added kind/techdebt area/job-management labels Jan 27, 2020

ches mentioned this issue Mar 8, 2020

Feast API: Adding a new historical store #482

Closed

stale bot added the wontfix This will not be worked on label Mar 29, 2020

woop added the keep-open label Mar 29, 2020

stale bot removed the wontfix This will not be worked on label Mar 29, 2020

ches mentioned this issue Apr 19, 2020

Feast API: Sources #633

Closed

ches mentioned this issue May 15, 2020

Add AWS native support for Feast #367

Closed

woop closed this as completed Feb 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modularize ingestion distributed compute engine support #444

Modularize ingestion distributed compute engine support #444

ches commented Jan 27, 2020 •

edited

Loading

woop commented Jan 29, 2020

stale bot commented Mar 29, 2020

dr3s commented Jun 5, 2020

woop commented Feb 8, 2021

Modularize ingestion distributed compute engine support #444

Modularize ingestion distributed compute engine support #444

Comments

ches commented Jan 27, 2020 • edited Loading

woop commented Jan 29, 2020

stale bot commented Mar 29, 2020

dr3s commented Jun 5, 2020

woop commented Feb 8, 2021

ches commented Jan 27, 2020 •

edited

Loading