Can you pass spark.driver.extraClassPath directly into raydp.init()? #414

tanderson10 · 2024-09-04T20:33:35Z

Hi,

I am trying to create a spark session using raydp.init() and I am running running into an issue. I believe the issue stems from passing in spark.driver.extraClassPath into init. When I remove spark.driver.extraClassPath from the config in the code below, the code works with no issues. I want to pass jars into the spark.driver.extraClassPath because I am using jars that require the driver to have access to them.

From my understanding, spark on ray runs in client mode. The documentation on spark.driver.extraClassPath states

Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.

I have set spark.driver.extraClassPath in my default properties file but this did not seem to fix the issue.

Is there a way to set spark.driver.extraClassPath before the spark driver JVM starts?

Please let me know if I need to provide more information.

Environment Details

python==3.9.13
pyspark==3.5.1
ray==2.5.0
raydp==1.6.1

Reproducible Script

import raydp
import pyspark
import numpy as np
import glob
import os

spark_home = os.environ.get("SPARK_HOME", os.path.dirname(pyspark.__file__))
spark_jars = os.path.abspath(os.path.join(spark_home, "jars/*"))


spark_config = {
        'spark.driver.host': '127.0.0.1', 
        'spark.driver.bindAddress': '0.0.0.0', 
        'spark.driver.memory': '10G', 
        'spark.driver.maxResultSize': '6G', 
        'spark.ui.port': '4041',
    }

spark_config.update(
        {
                "spark.jars": ",".join(glob.glob(spark_jars)),
                "spark.driver.extraClassPath": ":".join(glob.glob(spark_jars)),
                "spark.executor.extraClassPath": ":".join(glob.glob(spark_jars)),
                "raydp.executor.extraClassPath": ":".join(glob.glob(spark_jars)),
        }
)



app_name = "RAYAPPTEST"
spark_app_id = f"Spark for {app_name} - {np.random.randint(1, 1000)}"
num_executors = 1
cores_per_executor = 1
memory_per_executor = "2G"
spark = raydp.init_spark(spark_app_id, num_executors, cores_per_executor, memory_per_executor, configs=spark_config)
print(spark)

Stack Trace

Exception in thread "main" org.apache.spark.SparkException: Master must either be yarn or start with spark, mesos, k8s, or local
	at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:1047)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:256)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:964)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
  File "reproduce_ray_issue.py", line 23, in <module>
    spark = raydp.init_spark(spark_app_id, num_executors, cores_per_executor, memory_per_executor, configs=spark_config)
  File "/env/lib/python3.9/site-packages/raydp/context.py", line 215, in init_spark
    return _global_spark_context.get_or_create_session()
  File "/env/lib/python3.9/site-packages/raydp/context.py", line 122, in get_or_create_session
    self._spark_session = spark_cluster.get_spark_session()
  File "/env/lib/python3.9/site-packages/raydp/spark/ray_cluster.py", line 189, in get_spark_session
    spark_builder.appName(self._app_name).master(self.get_cluster_url()).getOrCreate()
  File "/env/lib/python3.9/site-packages/pyspark/sql/session.py", line 497, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/env/lib/python3.9/site-packages/pyspark/context.py", line 515, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/env/lib/python3.9/site-packages/pyspark/context.py", line 201, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/env/lib/python3.9/site-packages/pyspark/context.py", line 436, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/env/lib/python3.9/site-packages/pyspark/java_gateway.py", line 107, in launch_gateway
    raise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you pass spark.driver.extraClassPath directly into raydp.init()? #414

Can you pass spark.driver.extraClassPath directly into raydp.init()? #414

tanderson10 commented Sep 4, 2024

Can you pass spark.driver.extraClassPath directly into raydp.init()? #414

Can you pass spark.driver.extraClassPath directly into raydp.init()? #414

Comments

tanderson10 commented Sep 4, 2024

Environment Details

Reproducible Script

Stack Trace