Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support better integration between Ray and Spark in passing ObjectRef without actually moving data #32

Open
klwuibm opened this issue Jul 21, 2021 · 3 comments
Assignees
Labels
ray-related Related to Ray

Comments

@klwuibm
Copy link
Contributor

klwuibm commented Jul 21, 2021

Overview

As a Codeflare user, I want to use Ray and Spark alternately to execute my end-to-end ML jobs. Some steps might be executed more efficiently using Ray, while others using Spark. The plasma store in Ray seems to provide an efficient way to share ObjectRef between Ray and Spark. Currently, RayDP project supports from Spark to Ray in some limited way, by running Spark as a Ray actor. However, ObjectRef cannot be shared easily in both directions, Spark-2-Ray and Ray-2-Spark.

Acceptance Criteria

  • Pandas dataframe created by remote tasks in local Ray plasma stores can be passed with ObjectRef to the Spark driver to create a Spark dataframe containing list of ObjectRef.
  • Once that is done, on the Spark side, the executors of Spark can then access to the original Pandas dataframe locally.
  • From Spark to Ray: Spark preserves groupby() partition semantics and writes these partitions to plasma store, instead of using hashPartition().

Questions

  • In RayDP, only the driver node knows about and can access Ray. The executors of PySpark doesn't have access to Ray. This will prevent the PySpark executors from accessing the Ray plasma store. As a result, it is not possible to seamlessly pass ObjectRef between Ray workers and Spark executors.

Assumptions

  • Ray and Spark can share data seamlessly by exchanging ObjectRef among Ray workers and Spark executors.

Reference

[Reference] I have opened an issue on the RayDP repo: oap-project/raydp#164

@klwuibm klwuibm added the ray-related Related to Ray label Jul 21, 2021
@raghukiran1224
Copy link
Contributor

@klwuibm Suggest that you fill in the rest of the issue template? :)

@raghukiran1224
Copy link
Contributor

Thanks @klwuibm !

@klwuibm
Copy link
Contributor Author

klwuibm commented Aug 4, 2021

This feature can be supported via the Ray Datasets (currently on alpha with some missing methods, such as ray.data.from_spark() and ds.to_spark()). For example, to exchange data from Spark to Pandas, one can do ds = ray.data.from_spark() followed by pdf = ds.to_pandas(). Similarly, from Pandas to Spark, one can do ds = ray.data.from_pandas() followed by sdf = ds.to_spark().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ray-related Related to Ray
Projects
None yet
Development

No branches or pull requests

3 participants