-
Using a certain number of GPUs for Spark on YARN This question is an extension of #9440, which is asking how to use a certain number of GPUs for standalone mode. I have migrated to YARN recently as I had issues getting RAPIDS to work with k8s, but I found out that the previous approach for standalone mode does not seem to work. The discovery script seems to only be a suggestion since it still sets up an executor even if the discovery script returns an empty array. By that I mean, if I have a set of nodes, and I want it to run on a specific node by setting the number of GPUs for the other nodes to 0 on the discovery script, it would still use one of those nodes. I tried writing another discovery script as well, but it still does not work. Custom discovery script #!/usr/bin/env ruby
# frozen_string_literal: true
require 'json'
require 'rb_json5'
require 'socket'
NUM_GPUS = RbJSON5.parse('{"node1":0,"node2":0,"node3":0,"node4":1}')
hostname = Socket.gethostname
puts({
name: 'gpu',
addresses: `nvidia-smi --query-gpu=index --format=csv,noheader`.split("\n").first(NUM_GPUS[hostname])
}.to_json) YARN would not necessarily choose node4. I suppose I could edit yarn-site.xml by setting the maximum-allocation to a value I set, which I couldn't get to work probably due to using the wrong option, but I prefer if I don't change any node-level files/options to not affect other applications since I want this to apply to the application I am going to run. I use 24.08 for spark-rapids. I originally used 24.10 for the spark-rapids jar since I am using Spark 3.5.2, but I encountered other weird issues, so I switched back to Spark 3.5.1. I am not sure if these issues are still there though as now I cannot use the Python daemon. If it matters, this is my Spark configuration. Both the exclusive mode discovery plugin and docker options are there since YARN does not seem to use isolation with Docker even with the configuration. Spark configuration``` spark.master yarn spark.rapids.sql.concurrentGpuTasks 2 spark.driver.memory 100g spark.executor.memory 50g spark.executor.cores 4 spark.executor.resource.gpu.amount 1 spark.task.cpus 1 spark.task.resource.gpu.amount 0.25 spark.rapids.memory.pinnedPool.size 50G spark.sql.files.maxPartitionBytes 512m spark.plugins com.nvidia.spark.SQLPlugin spark.executor.resource.gpu.discoveryScript ./get_gpus_resources.rb spark.executorEnv.PYTHONPATH [PATH]/rapids-4-spark_2.12-24.08.1.jar spark.jars [PATH]/rapids-4-spark_2.12-24.08.1.jar spark.rapids.ml.uvm.enabled true spark.dynamicAllocation.enabled false spark.executor.extraJavaOptions "-Duser.timezone=UTC" spark.driver.extraJavaOptions "-Duser.timezone=UTC" spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer spark.rapids.memory.gpu.pool NONE spark.sql.execution.sortBeforeRepartition false spark.rapids.sql.python.gpu.enabled true spark.rapids.memory.pinnedPool.size 50g spark.rapids.sql.batchSizeBytes 128m spark.sql.adaptive.enabled false spark.executorEnv.UCX_ERROR_SIGNALS "" spark.executorEnv.UCX_MEMTYPE_CACHE n spark.executorEnv.UCX_IB_RX_QUEUE_LEN 1024 spark.executorEnv.UCX_TLS cuda_copy,cuda_ipc,tcp,rc spark.executorEnv.UCX_RNDV_SCHEME put_zcopy spark.executorEnv.UCX_MAX_RNDV_RAILS 1 spark.executorEnv.NCCL_DEBUG INFO spark.resources.discoveryPlugin com.nvidia.spark.ExclusiveModeGpuDiscoveryPlugin spark.rapids.shuffle.manager com.nvidia.spark.rapids.spark351.RapidsShuffleManager spark.rapids.shuffle.mode UCX spark.shuffle.service.enabled false spark.executorEnv.CUPY_CACHE_DIR /tmp/cupy_cache spark.executorEnv.NCCL_DEBUG INFO spark.submit.deployMode client spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE docker spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE nvidia/cuda:12.6.1-devel-ubi9 spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE docker spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE nvidia/cuda:12.6.1-devel-ubi9 spark.hadoop.fs.s3a.access.key ... spark.hadoop.fs.s3a.secret.key ... spark.hadoop.fs.s3a.endpoint ... spark.hadoop.fs.s3a.connection.ssl.enabled true spark.hadoop.fs.s3a.path.style.access true spark.hadoop.fs.s3a.attempts.maximum 1 spark.hadoop.fs.s3a.connection.establish.timeout 1000000 spark.hadoop.fs.s3a.connection.timeout 1000000 spark.hadoop.fs.s3a.connection.request.timeout 0 spark.eventLog.enabled true spark.eventLog.dir file:///tmp/spark-events spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.fs.logDiretory file:///tmp/spark-events spark.history.fs.update.interval 10s spark.history.ui.port 18080 spark.logConf true spark.executor.heartbeatInterval 1000000s spark.network.timeout 10000001s spark.sql.broadcastTimeout 1000000 spark.executorEnv.NCCL_DEBUG WARN spark.driverEnv.NCCL_DEBUG WARN spark.pyspark.python [PATH]/.venv/bin/python spark.pyspark.driver.python [PATH]/.venv/bin/python ``` |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
so it sounds like you are having issues setting up the resource manager side of things (k8s or yarn). There are a lot of factors that come into play here. k8s and yarn have their own ways of discovering and scheduling for resources which are intertwined with what Spark is doing for it to find a specific GPU. If you are running on YARN, the best approach is to configure it for cgroups and isolation. (https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/on-premise.html#yarn-3-1-3-with-isolation-and-gpu-scheduling-enabled). Its not clear to me if you got that all working? YARN docs - https://hadoop.apache.org/docs/r3.1.3/hadoop-yarn/hadoop-yarn-site/UsingGpus.html. This can certainly bit a bit tricky to setup and lots of different options that we don't go into because its outside the scope of just running Spark with Spark Rapids. If you did not get isolation configured then see - https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/on-premise.html#yarn-without-isolation You can use that as a workaround and in this case YARN doesn't schedule for the GPUs so you have to make sure your spark app is put onto nodes with GPUs and then the discovery script along with the GPUs in EXCLUSIVE_PROCESS mode will just try to use what is there. This way gets tricky because you have to make sure you set all the resources you are requesting appropriate so you don't get to many executors on a node and not enough GPUs.
If you have a set of nodes and are trying to target a specific one, using the startup script will not work. You have to work through the mechanisms YARN has to do this. For instance you could use node labels or queues. Why do you want to target a specific node? Does it have specific hardware on it you want? |
Beta Was this translation helpful? Give feedback.
generally assigning nodes with either Resource Manager isn't recommended. I get the affinity part though where I think really you want to tell it to prefer consolidating containers vs spreading them across the nodes. But I am curious, are you seeing a bit performance difference in your applications by grouping them onto the same nodes?
There is still an issue open in spark to support yarn placement constraints (https://issues.apache.org/jira/browse/SPARK-26867). Otherwise I don't know how to do that easily.
Kubernetes has some node affinity and pod topology stuff. There are other add on schedulers too but I haven't tried either to do what you are asking here so don't have an answer for yo…