Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you pass spark.driver.extraClassPath directly into raydp.init()? #414

Open
tanderson10 opened this issue Sep 4, 2024 · 0 comments
Open

Comments

@tanderson10
Copy link

Hi,

I am trying to create a spark session using raydp.init() and I am running running into an issue. I believe the issue stems from passing in spark.driver.extraClassPath into init. When I remove spark.driver.extraClassPath from the config in the code below, the code works with no issues. I want to pass jars into the spark.driver.extraClassPath because I am using jars that require the driver to have access to them.

From my understanding, spark on ray runs in client mode. The documentation on spark.driver.extraClassPath states

Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.

I have set spark.driver.extraClassPath in my default properties file but this did not seem to fix the issue.

Is there a way to set spark.driver.extraClassPath before the spark driver JVM starts?

Please let me know if I need to provide more information.

Environment Details

python==3.9.13
pyspark==3.5.1
ray==2.5.0
raydp==1.6.1

Reproducible Script

import raydp
import pyspark
import numpy as np
import glob
import os

spark_home = os.environ.get("SPARK_HOME", os.path.dirname(pyspark.__file__))
spark_jars = os.path.abspath(os.path.join(spark_home, "jars/*"))


spark_config = {
        'spark.driver.host': '127.0.0.1', 
        'spark.driver.bindAddress': '0.0.0.0', 
        'spark.driver.memory': '10G', 
        'spark.driver.maxResultSize': '6G', 
        'spark.ui.port': '4041',
    }

spark_config.update(
        {
                "spark.jars": ",".join(glob.glob(spark_jars)),
                "spark.driver.extraClassPath": ":".join(glob.glob(spark_jars)),
                "spark.executor.extraClassPath": ":".join(glob.glob(spark_jars)),
                "raydp.executor.extraClassPath": ":".join(glob.glob(spark_jars)),
        }
)



app_name = "RAYAPPTEST"
spark_app_id = f"Spark for {app_name} - {np.random.randint(1, 1000)}"
num_executors = 1
cores_per_executor = 1
memory_per_executor = "2G"
spark = raydp.init_spark(spark_app_id, num_executors, cores_per_executor, memory_per_executor, configs=spark_config)
print(spark)

Stack Trace

Exception in thread "main" org.apache.spark.SparkException: Master must either be yarn or start with spark, mesos, k8s, or local
	at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:1047)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:256)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:964)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
  File "reproduce_ray_issue.py", line 23, in <module>
    spark = raydp.init_spark(spark_app_id, num_executors, cores_per_executor, memory_per_executor, configs=spark_config)
  File "/env/lib/python3.9/site-packages/raydp/context.py", line 215, in init_spark
    return _global_spark_context.get_or_create_session()
  File "/env/lib/python3.9/site-packages/raydp/context.py", line 122, in get_or_create_session
    self._spark_session = spark_cluster.get_spark_session()
  File "/env/lib/python3.9/site-packages/raydp/spark/ray_cluster.py", line 189, in get_spark_session
    spark_builder.appName(self._app_name).master(self.get_cluster_url()).getOrCreate()
  File "/env/lib/python3.9/site-packages/pyspark/sql/session.py", line 497, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/env/lib/python3.9/site-packages/pyspark/context.py", line 515, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/env/lib/python3.9/site-packages/pyspark/context.py", line 201, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/env/lib/python3.9/site-packages/pyspark/context.py", line 436, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/env/lib/python3.9/site-packages/pyspark/java_gateway.py", line 107, in launch_gateway
    raise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant