Ingestion with Spark: Job Management for Beam Spark Runner #362

ches · 2019-12-13T02:15:42Z

We would like to run ingestion on Spark (Streaming), i.e. with the Beam Spark Runner. Thus, an implementation of Feast's job management is needed.

There are a couple of factors that make this a bit less straightforward than Google Cloud Dataflow:

There is not a standard remote/HTTP API for job submission and management built into Spark*.
The Beam Spark Runner does not upload your executable job artifact and submit it for you like it does for Dataflow, because of 1 and because there is no assumption of a cloud service like GCS for where to put it—conventions vary depending on how & where organizations run Spark: they might use S3, HDFS, or an artifact repository to ferry job packages to where they're accessible from the runtime (YARN, Mesos, Kubernetes, EMR).

* Other than starting a SparkContext connected to the remote cluster, in-process in Feast Core. I feel that isn't workable for a number of reasons, not least of which are heavy dependencies on Spark as a library, and the lifecycle of streaming ingestion jobs being unnecessarily coupled to that of the Feast Core instance.

Planned Approach

Job Management

We initially plan to implement JobManager using ~~the Java client library for~~ Apache Livy, a REST interface to Spark. This will use only an HTTP client, so it is light on dependencies and shouldn't get in the way of alternative JobManagers for Spark, should another organization wish to implement one for something other than Livy. (Edit: turns out that Livy's livy-http-client artifact still depends on Spark as a library, it's not a plain REST client, so we'll avoid that…)

We have internal experience and precedent using Livy, but not for Spark Streaming applications, so we have some uncertainties about whether it can work well. In case that it doesn't, we'll probably look to try spark-jobserver which does explicitly claim support for Streaming jobs.

Ingestion Job Artifact

We're a bit less certain about how users should get the Feast ingestion Beam job artifact to their Spark cluster, due to the above mentioned variation in deployments.

Roughly speaking, Feast Ingestion would be packaged as an assembly JAR that includes beam-runners-spark as well. So, a new ingestion-spark module may be added to the Maven build which is simply a POM for doing just that.

Deployment itself may then need to rely on documentation.

Beam Spark Runner

A minor note, but we will use the "legacy", non-portable Beam Spark Runner. As the Beam docs cover, the runner based on Spark Structured Streaming is incomplete and only supports batch jobs, and the non-portable runner is still recommended for Java-only needs.

In theory this is runtime configuration for Feast users: if they want to try the portable runner, it should be possible, but we'll most likely be testing with the non-portable one.

cc @smadarasmi

Reference issues to keep tabs on during implementation: #302, #361.

The text was updated successfully, but these errors were encountered:

ches · 2019-12-13T02:48:59Z

This can be labeled "enhancement" I think. I'd be glad to carry the triage role responsibly ☺️

woop · 2020-01-29T15:33:37Z

Other than starting a SparkContext connected to the remote cluster, in-process in Feast Core. I feel that isn't workable for a number of reasons, not least of which are heavy dependencies on Spark as a library, and the lifecycle of streaming ingestion jobs being unnecessarily coupled to that of the Feast Core instance.

Agreed. I want to move to a world where Feast Core requires no privileged access and is just a simple registry. We can contain the complexity in either serving or in jobs (setting aside dependency management for a second).

Edit: turns out that Livy's livy-http-client artifact still depends on Spark as a library, it's not a plain REST client, so we'll avoid that…

Does this mean you are going for a different client, or a different server (non Livy)?

We're a bit less certain about how users should get the Feast ingestion Beam job artifact to their Spark cluster, due to the above mentioned variation in deployments.

This does seem tricky. I dont have enough familiarity with the variation in deployment methods above to know how we can deal with this. My instinct is to either reign in the complexity by standardizing on a single solution (Flink or a Spark implementation), or by providing an interface or extension layer to development teams.

stale · 2020-03-29T16:20:37Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dr3s · 2020-06-05T00:50:46Z

Seems like https://github.com/spark-jobserver/spark-jobserver address both concerns in this issue and is used by datastax

dr3s · 2020-06-09T02:06:56Z

This might be another option for modularity using a spark operator - https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/sparkctl/README.md#create

woop · 2021-02-08T00:51:04Z

Closing this issue. We have ingestion support for Spark with ERM, Dataproc, and Spark on K8s.

woop added kind/discussion area/job-management kind/feature New feature or request labels Jan 26, 2020

ches mentioned this issue Jan 27, 2020

Modularize ingestion distributed compute engine support #444

Closed

woop mentioned this issue Feb 15, 2020

Add AWS native support for Feast #367

Closed

This was referenced Mar 7, 2020

To cross build, or not to cross build Ingestion #517

Closed

Feast API: Adding a new historical store #482

Closed

stale bot added the wontfix This will not be worked on label Mar 29, 2020

woop added the keep-open label Mar 29, 2020

stale bot removed the wontfix This will not be worked on label Mar 29, 2020

algattik mentioned this issue Jun 2, 2020

Databricks Spark runner #764

Closed

woop closed this as completed Feb 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingestion with Spark: Job Management for Beam Spark Runner #362

Ingestion with Spark: Job Management for Beam Spark Runner #362

ches commented Dec 13, 2019 •

edited

Loading

ches commented Dec 13, 2019

woop commented Jan 29, 2020

stale bot commented Mar 29, 2020

dr3s commented Jun 5, 2020

dr3s commented Jun 9, 2020

woop commented Feb 8, 2021

Ingestion with Spark: Job Management for Beam Spark Runner #362

Ingestion with Spark: Job Management for Beam Spark Runner #362

Comments

ches commented Dec 13, 2019 • edited Loading

Planned Approach

Job Management

Ingestion Job Artifact

Beam Spark Runner

ches commented Dec 13, 2019

woop commented Jan 29, 2020

stale bot commented Mar 29, 2020

dr3s commented Jun 5, 2020

dr3s commented Jun 9, 2020

woop commented Feb 8, 2021

ches commented Dec 13, 2019 •

edited

Loading