RayDP will create a ray actor called RayDPSparkMaster
, which will then launch the java process,
acting like a Master in a tradtional Spark cluster.
By default, this actor could be scheduled to any node in the ray cluster.
If you want it to be on a particular node, you can assign some custom resources to that node,
and request those resources when starting RayDPSparkMaster
by setting
spark.ray.raydp_spark_master.resource.*
in init_spark
.
As an example:
import raydp
raydp.init_spark(...,
configs = {
# ... other configs
'spark.ray.raydp_spark_master.actor.resource.CPU': 0,
'spark.ray.raydp_spark_master.actor.resource.spark_master': 1, # Force Spark driver related actor run on headnode
})
In cluster config yaml:
available_node_types:
ray.head.default:
resources:
CPU: 0 # We intentionally set this to 0 so no executor is on headnode
spark_master: 100 # Just gave it a large enough number so all drivers are there
Similar to master actors node affinity, you can also schedule Spark executor to a specific set of nodes
using custom resource, using configuration spark.ray.raydp_spark_executor.actor.resource.[RESOURCE_NAME]
:
import raydp
spark = raydp.init_spark(...,
configs = {
# ...
'spark.ray.raydp_spark_executor.actor.resource.spark_executor': 1, # Schedule executor on nodes with custom resource spark_executor
})
And here is the cluster YAML with the customer resource:
# ... other Ray cluster config
available_node_types:
spark_on_spot: # Spark only nodes
resources:
spark_executor: 100 # custom resource, with name matches the one set in spark.ray.raydp_spark_executor.actor.resource.*
min_workers: 2
max_workers: 10 # changing this also need to change the global max_workers
node_config:
# ....
general_spot: # Nodes for general Ray workloads
min_workers: 2
max_workers: 10 # changing this also need to change the global max_workers
node_config:
# ...
One thing worth to note is you can use spark.ray.raydp_spark_executor.actor.resource.cpu
to oversubscribe
CPU resources, setting logical CPU smaller than the number of cores per executor. In this case, you can
schedule more executor cores than the total vCPU in a node, which is useful if your workload is not CPU bound:
import raydp
spark = raydp.init_spark(app_name='RayDP Oversubscribe Example',
num_executors=1,
executor_cores=3, # The executor can run 3 tasks in parallel
executor_memory=1 * 1024 * 1024 * 1024,
configs = {
# ...
'spark.ray.raydp_spark_executor.actor.resource.cpu': 1, # The actor only occupy 1 logical CPU slots from Ray
})
RayDP supports External Shuffle Serivce. To enable it, you can either set spark.shuffle.service.enabled
to true
in spark-defaults.conf
, or you can provide a config to raydp.init_spark
, as shown below:
raydp.init_spark(..., configs={"spark.shuffle.service.enabled": "true"})
The user-provided config will overwrite those specified in spark-defaults.conf
. By default Spark will load spark-defaults.conf
from $SPARK_HOME/conf
, you can also modify this location by setting SPARK_CONF_DIR
.
Similarly, you can also enable Dynamic Executor Allocation this way. However, currently you must use Dynamic Executor Allocation with data persistence. You can write the data frame in spark to HDFS as a parquet as shown below:
ds = RayMLDataset.from_spark(..., fs_directory="hdfs://host:port/your/directory")
As raydp starts the java executors, the classpath will contain the absolute path of ray, raydp and spark by default. When you run ray cluster on yarn(Deploying on Yarn), the jar files are stored on HDFS, which have different absolute path in each node. In such cases, jvm cannot find the main class and ray workers will fail to start.
To solve such problems, users can specify extra classpath when init _spark
by configuring raydp.executor.extraClassPath
. Make sure your jar files are distributed to the same path(s) on all nodes of the ray cluster.
raydp.init_spark(..., configs={"raydp.executor.extraClassPath": "/your/extra/jar/path:/another/path"})
RayDP provides a substitute for spark-submit in Apache Spark. You can run your java or scala application on RayDP cluster by using bin/raydp-submit
. You can add it to PATH
for convenience. When using raydp-submit
, you should specify number of executors, number of cores and memory each executor by Spark properties, such as --conf spark.executor.cores=1
, --conf spark.executor.instances=1
and --conf spark.executor.memory=500m
. raydp-submit
only supports Ray cluster. Spark standalone, Apache Mesos, Apache Yarn are not supported, please use traditional spark-submit
in that case. Besides, RayDP does not support cluster as deploy-mode.
Here is an example:
- To use
raydp-submit
, you need to start your ray cluster in advance. Let's say your ray address is1.2.3.4:6379
- You should use a ray config file to provide your ray cluster configuration to
raydp-submit
. You can create it using this script:
ray.init(address="auto")
node = ray.worker.global_worker.node
options = {}
options["ray"] = {}
options["ray"]["run-mode"] = "CLUSTER"
options["ray"]["node-ip"] = node.node_ip_address
options["ray"]["address"] = node.address
options["ray"]["session-dir"] = node.get_session_dir_path()
ray.shutdown()
conf_path = "ray.conf"
with open(conf_path, "w") as f:
json.dump(options, f)
The file should look like this:
{
"ray": {
"run-mode": "CLUSTER",
"node-ip": "1.2.3.4",
"address": "1.2.3.4:6379",
"session-dir": "/tmp/ray/session_xxxxxx"
}
}
- Run your application, such as
raydp-submit --ray-conf /path/to/ray.conf --class org.apache.spark.examples.SparkPi --conf spark.executor.cores=1 --conf spark.executor.instances=1 --conf spark.executor.memory=500m $SPARK_HOME/examples/jars/spark-examples.jar
. Note that--ray-conf
must be specified right after raydp-submit, and before any spark arguments.
RayDP can leverage Ray's placement group feature and schedule executors onto spcecified placement group. It provides better control over the allocation of Spark executors on a Ray cluster, for example spreading the spark executors onto seperate nodes or starting all executors on a single node. You can specify a created placement group when init spark, as shown below:
raydp.init_spark(..., placement_group=pg)
Or you can just specify the placement group strategy. RayDP will create a coreesponding placement group and manage its lifecycle, which means the placement group will be created together with SparkSession and removed when calling raydp.stop_spark()
. Strategy can be "PACK", "SPREAD", "STRICT_PACK" or "STRICT_SPREAD". Please refer to Placement Groups document for details.
raydp.init_spark(..., placement_group_strategy="SPREAD")
RayDP works the same way when using ray client. However, spark driver would be on the local machine. This is convenient if you want to do some experiment in an interactive environment. If this is not desired, e.g. due to performance, you can define an ray actor, which calls init_spark
and performs all the calculation in its method. This way, spark driver will be in the ray cluster, and is rather similar to spark cluster deploy mode.
RayDP can read or write Hive, which might be useful if the data is stored in HDFS.If you want to enable this feature, please configure your environment as following:
- Install spark to ray cluster's each node and set ENV SPARK_HOME
- COPY your hdfs-site.xml and hive-site.xml to $SPARK_HOME/conf. If using hostname in your xml file, make sure /etc/hosts is set properly
- Test: You can test if Hive configuration is successful like this
from pyspark.sql.session import SparkSession
spark = SparkSession.builder().enableHiveSupport()
spark.sql("select * from db.xxx").show() # db is database, xxx is exists table
RayDP using Hive example
ray.init("auto")
spark = raydp.init_spark(...,enable_hive=True)
spark.sql("select * from db.xxx").show()
You can use RayDP with Ray autoscaling. When you call raydp.init_spark
, the autoscaler will try to increase the number of worker nodes if the current capacity of the cluster can't meet the resource demands. However currently there is a known issue #20476 in Ray autoscaling. The autoscaler's default strategy is to avoid launching GPU nodes if there aren't any GPU tasks at all. So if you configure a single worker node type with GPU, by default the autoscaler will not launch nodes to start Spark executors on them. To resolve the issue, you can either set the environment variable "AUTOSCALER_CONSERVE_GPU_NODES" to 0 or configure multiple node types that at least one is CPU only node.
- Driver Log: By default, the spark driver log level is WARN. After getting a Spark session by running
spark = raydp.init_spark
, you can change the log level for examplespark.sparkContext.setLogLevel("INFO")
. You will also see some AppMaster INFO logs on the driver. This is because Ray redirects the actor logs to driver by default. To disable logging to driver, you can set it in Ray initray.init(log_to_driver=False)
- Executor Log: The spark executor logs are stored in Ray's logging directory. By default they are available at /tmp/ray/session_*/logs/java-worker-*.log
- Spark and Ray may use different log4j versions. For example, Spark 3.2 or older uses log4j 1 and Ray uses log4j 2 for long time. And they use different log4j configuration files. We can treat them in two groups, Spark driver and Ray worker. For Spark driver, we follow Spark's log4j version since we may screw up console output otherwise. For Ray worker (rest of JVM processes), we follow Ray's log4j version so that Spark logs can be printed correctly within Ray's realm.
- To use your customized log4j versions for Spark driver and Ray worker, you can set
spark.preferClassPath
andspark.ray.preferClassPath
respectively to include your log4j jars wheninit_spark
as long as your log4j versions are compatible with Spark and Ray respectively. For example, you want to use your own log4j 2 version, like log4j-core-2.17.2.jar in RayDp with Spark 3.3. You can init spark as showing below.raydp.init_spark(..., configs={'spark.preferClassPath': '<your path...>/log4j-core-2.17.2.jar'})
- For log4j config files, you can set
spark.log4j.config.file.name
andspark.ray.log4j.config.file.name
for Spark driver and Ray worker respectively. For example, you can set Spark's log4j config file tolog4j-cust.properties
and Ray's tolog4j2-cust.xml
like below. Just make sure they are loadable from classpath. You can put them in the preferred classpath.You can also set environment variableraydp.init_spark(..., configs={'spark.log4j.config.file.name': '<your path...>/log4j-cust.properties', 'spark.ray.log4j.config.file.name': '<your path...>/log4j2-cust.xml'})
SPARK_LOG4J_CONFIG_FILE_NAME
andRAY_LOG4J_CONFIG_FILE_NAME
to achieve the same:And you can callexport SPARK_LOG4J_CONFIG_FILE_NAME="<your path...>/log4j-cust.properties" export RAY_LOG4J_CONFIG_FILE_NAME="<your path...>/log4j2-cust.xml"
init_spark
without having the override code in Python:raydp.init_spark(..., configs={})