Add the ability to set the owner of objects when calling _save_spark_df_to_object_store() #375

max-509 · 2023-08-17T17:28:25Z

Hello! Thank you for awesome library that helps me use Spark and Ray advantages.

When I transform Spark Dataframe to ray Dataset, I have only 2 options for specifying the owner of serialized partitions:

Each executor owns its own partitions (_use_owner=False)
Ray DP Master is the owner of all serialized partitions (_use_owner=True).

I will give a usage scenario when none of the ownership options can be satisfactory.

I want to do some preprocessing in Spark, convert a preprocessed DataFrame into a ray Dataset, and stop Spark (call raydp.stop_spark()) to free up ray cluster resources. But after stopping Spark, I can't use the created ray Dataset because the owner of the serialized tables has died.
I suggest adding a function that can accept an actor who should become the owner of serialized partitions. For example:

@dataclass
class ObjectsOwner:
    # Actor owner name
    actor_name: str 
    # Function that set serialized parquet objects to actor owner state and return result of .remote() calling
    set_reference_as_state: Callable[[ray.actor.ActorHandle, List[ObjectRef]], ObjectRef]

def _save_spark_df_to_object_store(df: sql.DataFrame, use_batch: bool = True,
                                   objects_owner: Optional[ObjectsOwner] = None):
    # call java function from python
    jvm = df.sql_ctx.sparkSession.sparkContext._jvm
    jdf = df._jdf
    object_store_writer = jvm.org.apache.spark.sql.raydp.ObjectStoreWriter(jdf)
    if objects_owner is None:
        records = object_store_writer.save(use_batch, "")
    else:
        records = object_store_writer.save(use_batch, objects_owner.actor_name) # !

    record_tuples = [(record.objectId(), record.ownerAddress(), record.numRecords())
                        for record in records]
    blocks, block_sizes = _register_objects(record_tuples)

    if actor_owner is not None:
        actor = ray.get_actor(objects_owner.actor_name)
        ray.get(objects_owner.set_reference_as_state(actor, blocks)) # !

    return blocks, block_sizes

I hope that my suggestion will be useful.

kira-lin · 2023-08-18T01:27:39Z

Hi @max-509 , thanks for using RayDP!
In this case, you can assign ownership to RayDPMaster, and use raydp.stop_spark(cleanup_data=False) to stop the session and free up the resources. By setting cleanup_data to False, RayDPMaster is actually not killed, so the data is still accessible.

But yes, your suggestion makes sense, ownership should be able to be assigned to a user specified actor. This should be very easy, are you willing to submit a PR?

max-509 mentioned this issue Aug 20, 2023

Added possibility to set custom actor owner when convert spark dataframe to ray dataset #376

Merged

kira-lin closed this as completed Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to set the owner of objects when calling _save_spark_df_to_object_store() #375

Add the ability to set the owner of objects when calling _save_spark_df_to_object_store() #375

max-509 commented Aug 17, 2023

kira-lin commented Aug 18, 2023

Add the ability to set the owner of objects when calling _save_spark_df_to_object_store() #375

Add the ability to set the owner of objects when calling _save_spark_df_to_object_store() #375

Comments

max-509 commented Aug 17, 2023

kira-lin commented Aug 18, 2023