You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! Thank you for awesome library that helps me use Spark and Ray advantages.
When I transform Spark Dataframe to ray Dataset, I have only 2 options for specifying the owner of serialized partitions:
Each executor owns its own partitions (_use_owner=False)
Ray DP Master is the owner of all serialized partitions (_use_owner=True).
I will give a usage scenario when none of the ownership options can be satisfactory.
I want to do some preprocessing in Spark, convert a preprocessed DataFrame into a ray Dataset, and stop Spark (call raydp.stop_spark()) to free up ray cluster resources. But after stopping Spark, I can't use the created ray Dataset because the owner of the serialized tables has died.
I suggest adding a function that can accept an actor who should become the owner of serialized partitions. For example:
@dataclassclassObjectsOwner:
# Actor owner nameactor_name: str# Function that set serialized parquet objects to actor owner state and return result of .remote() callingset_reference_as_state: Callable[[ray.actor.ActorHandle, List[ObjectRef]], ObjectRef]
def_save_spark_df_to_object_store(df: sql.DataFrame, use_batch: bool=True,
objects_owner: Optional[ObjectsOwner] =None):
# call java function from pythonjvm=df.sql_ctx.sparkSession.sparkContext._jvmjdf=df._jdfobject_store_writer=jvm.org.apache.spark.sql.raydp.ObjectStoreWriter(jdf)
ifobjects_ownerisNone:
records=object_store_writer.save(use_batch, "")
else:
records=object_store_writer.save(use_batch, objects_owner.actor_name) # !record_tuples= [(record.objectId(), record.ownerAddress(), record.numRecords())
forrecordinrecords]
blocks, block_sizes=_register_objects(record_tuples)
ifactor_ownerisnotNone:
actor=ray.get_actor(objects_owner.actor_name)
ray.get(objects_owner.set_reference_as_state(actor, blocks)) # !returnblocks, block_sizes
I hope that my suggestion will be useful.
The text was updated successfully, but these errors were encountered:
Hi @max-509 , thanks for using RayDP!
In this case, you can assign ownership to RayDPMaster, and use raydp.stop_spark(cleanup_data=False) to stop the session and free up the resources. By setting cleanup_data to False, RayDPMaster is actually not killed, so the data is still accessible.
But yes, your suggestion makes sense, ownership should be able to be assigned to a user specified actor. This should be very easy, are you willing to submit a PR?
Hello! Thank you for awesome library that helps me use Spark and Ray advantages.
When I transform Spark Dataframe to ray Dataset, I have only 2 options for specifying the owner of serialized partitions:
I will give a usage scenario when none of the ownership options can be satisfactory.
I want to do some preprocessing in Spark, convert a preprocessed DataFrame into a ray Dataset, and stop Spark (call raydp.stop_spark()) to free up ray cluster resources. But after stopping Spark, I can't use the created ray Dataset because the owner of the serialized tables has died.
I suggest adding a function that can accept an actor who should become the owner of serialized partitions. For example:
I hope that my suggestion will be useful.
The text was updated successfully, but these errors were encountered: