-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can I use spark.createDataFrame() with a list of ObjectRef from various remote Ray workers? #164
Comments
You might want to take a look at |
@kira-lin thanks for the comments. I wonder if you could clarify two questions. (1) Is there any connection between |
|
@kira-lin Thanks very much. Appreciated the help. Here is what I did, but still not working as I expected.
The error messages are:
|
What you want to do is probably this: import ray
import pandas as pd
@ray.remote
def create_small_dataframe(i):
return pd.DataFrame(data=np.random.randint(5*i, size=(3, 4)))
# these are ObjectRef[pd.Dataframe]
obj_ref1 = create_small_dataframe.remote(1)
obj_ref2 = create_small_dataframe.remote(2)
# use cloudpickle to serialize them
ser_obj_ref1 = ray.cloudpickle.dumps(obj_ref1)
ser_obj_ref2 = ray.cloudpickle.dumps(obj_ref2)
obj_refs = [[ser_obj_ref1], [ser_obj_ref2]]
# start spark
import raydp
raydp.stop_spark()
spark = raydp.init_spark('dataframe_with_obj_ref',
num_executors=2,
executor_cores=2,
executor_memory='1G')
# create the dataframe
from pyspark.sql.types import BinaryType, StructType, StructField
schema = StructType([StructField('Pandas_df_ref', BinaryType(), True)])
sdf = spark.createDataFrame(obj_refs, schema)
raydp.stop_spark()
ray.shutdown() What do you want to do with this dataframe? You can check out this PR #166 since it's very similar. |
@kira-lin thanks very much. I will take a look at the #166 PR. The idea is to pass various |
oh yes, I see. Just FYI, in ray-nightly, a feature similar to this is under development, in ray.experimental.data. If you don't need it to be a spark dataframe, you might just use their Ray Dataset, and they have a from_pandas function. You can also use to_spark after you use from_pandas, but to_spark has not been merged yet. |
@kira-lin Thanks for your help in getting Spark to successfully create a
I got an error that puzzles me: it seems to indicate that ray is not started. I don't think I have stopped ray at that point. Here are some of the error messages.
|
The python function in Besides, if you see many |
@kira-lin I think the problem is that an executor of PySpark doesn't have access to Ray, even if I do a
It fails again on the next step:
I think due to lazy evaluation in Spark, the execution of My question to you is: Is there a way in RayDP to pass Ray access from the PySpark driver to the PySpark workers during the spark initialization? Or only the spark driver has access to Ray? I think in order to support good integration between Ray and Spark, the workers of PySpark need to be able to access Ray, especially for accessing data in the local plasma store. This way we can pass |
Have you tried this: def map_func(x):
# command for executors to connect to ray cluster
# ray.init will also work
ray.client().connect()
# actual work using ray
ray.get(ray.cloudpickle.loads(x['Pandas_df_ref']))
myrdd = sdf.rdd.flatMap(map_func) A normal process can access ray if it connects to the ray cluster by |
@kira-lin Many thanks again. I tried your suggestion of initiating
However, I found that the original Pandas
|
You can try to use |
Here is what I tried:
And here is what I got:
|
Notice that def _convert_blocks_to_dataframe(blocks):
# connect to ray
if not ray.is_initialized():
ray.client().connect()
for block in blocks:
dfs = []
for b in block["ref"]:
ref = ray.cloudpickle.loads(b)
data = ray.get(ref)
dfs.append(data.to_pandas())
yield pd.concat(dfs) |
@kira-lin thanks. I think I got it to work in my
The only difference is I didn't do More importantly, in your pull request, do you know if data locality is observed when your My original motivation is to avoid movement of big data from one node to another node in converting a |
I think there is not big difference. We need to use
You are right, it is possible that data would be fetched from remote nodes in this implementation. If we instead use remote tasks and objectref as its arguments, ray will try to schedule based on locality. But pyspark workers are not aware of this information, thus it has no locality. This is not ready for huge dataset, and we are also looking into other better ways to implement this, maybe spark Datasource. |
close as stale. This has been fixed by implementing getPreferredLocations. |
I am trying to see if I can create a Spark DataFrame with a list of ObjectRef from several remote Ray tasks that have put large objects into their local plasma stores. I know that Spark can take a Pandas dataframe and turn it into a Spark dataframe. However, I don't want the Spark driver to collect all the large objects from various Ray workers in remote nodes, because it can easily cause the Spark driver to be out of memory. But, it turns out that ObjectRef is not one of the supported DataTypes in Spark.createDataFrame(). Is there any workaround?
Here is the code that I was trying to do:
And the following are the error messages:
The text was updated successfully, but these errors were encountered: