Using a certain number of GPUs for Spark on YARN #11566

an-ys · 2024-10-08T05:41:46Z

an-ys
Oct 8, 2024

Using a certain number of GPUs for Spark on YARN

This question is an extension of #9440, which is asking how to use a certain number of GPUs for standalone mode.

I have migrated to YARN recently as I had issues getting RAPIDS to work with k8s, but I found out that the previous approach for standalone mode does not seem to work. The discovery script seems to only be a suggestion since it still sets up an executor even if the discovery script returns an empty array. By that I mean, if I have a set of nodes, and I want it to run on a specific node by setting the number of GPUs for the other nodes to 0 on the discovery script, it would still use one of those nodes.

I tried writing another discovery script as well, but it still does not work.

Custom discovery script

  #!/usr/bin/env ruby
  # frozen_string_literal: true
  
  require 'json'
  require 'rb_json5'
  require 'socket'
  
  NUM_GPUS = RbJSON5.parse('{"node1":0,"node2":0,"node3":0,"node4":1}')
  
  hostname = Socket.gethostname
  
  puts({
    name: 'gpu',
    addresses: `nvidia-smi --query-gpu=index --format=csv,noheader`.split("\n").first(NUM_GPUS[hostname])
  }.to_json)

YARN would not necessarily choose node4.

I suppose I could edit yarn-site.xml by setting the maximum-allocation to a value I set, which I couldn't get to work probably due to using the wrong option, but I prefer if I don't change any node-level files/options to not affect other applications since I want this to apply to the application I am going to run.

I use 24.08 for spark-rapids. I originally used 24.10 for the spark-rapids jar since I am using Spark 3.5.2, but I encountered other weird issues, so I switched back to Spark 3.5.1. I am not sure if these issues are still there though as now I cannot use the Python daemon.

If it matters, this is my Spark configuration. Both the exclusive mode discovery plugin and docker options are there since YARN does not seem to use isolation with Docker even with the configuration.

Spark configuration

``` spark.master yarn spark.rapids.sql.concurrentGpuTasks 2 spark.driver.memory 100g spark.executor.memory 50g spark.executor.cores 4 spark.executor.resource.gpu.amount 1 spark.task.cpus 1 spark.task.resource.gpu.amount 0.25 spark.rapids.memory.pinnedPool.size 50G spark.sql.files.maxPartitionBytes 512m spark.plugins com.nvidia.spark.SQLPlugin spark.executor.resource.gpu.discoveryScript ./get_gpus_resources.rb spark.executorEnv.PYTHONPATH [PATH]/rapids-4-spark_2.12-24.08.1.jar spark.jars [PATH]/rapids-4-spark_2.12-24.08.1.jar spark.rapids.ml.uvm.enabled true spark.dynamicAllocation.enabled false spark.executor.extraJavaOptions "-Duser.timezone=UTC" spark.driver.extraJavaOptions "-Duser.timezone=UTC" spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer spark.rapids.memory.gpu.pool NONE spark.sql.execution.sortBeforeRepartition false spark.rapids.sql.python.gpu.enabled true spark.rapids.memory.pinnedPool.size 50g spark.rapids.sql.batchSizeBytes 128m spark.sql.adaptive.enabled false spark.executorEnv.UCX_ERROR_SIGNALS "" spark.executorEnv.UCX_MEMTYPE_CACHE n spark.executorEnv.UCX_IB_RX_QUEUE_LEN 1024 spark.executorEnv.UCX_TLS cuda_copy,cuda_ipc,tcp,rc spark.executorEnv.UCX_RNDV_SCHEME put_zcopy spark.executorEnv.UCX_MAX_RNDV_RAILS 1 spark.executorEnv.NCCL_DEBUG INFO spark.resources.discoveryPlugin com.nvidia.spark.ExclusiveModeGpuDiscoveryPlugin spark.rapids.shuffle.manager com.nvidia.spark.rapids.spark351.RapidsShuffleManager spark.rapids.shuffle.mode UCX spark.shuffle.service.enabled false spark.executorEnv.CUPY_CACHE_DIR /tmp/cupy_cache spark.executorEnv.NCCL_DEBUG INFO spark.submit.deployMode client spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE docker spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE nvidia/cuda:12.6.1-devel-ubi9 spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE docker spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE nvidia/cuda:12.6.1-devel-ubi9 spark.hadoop.fs.s3a.access.key ... spark.hadoop.fs.s3a.secret.key ... spark.hadoop.fs.s3a.endpoint ... spark.hadoop.fs.s3a.connection.ssl.enabled true spark.hadoop.fs.s3a.path.style.access true spark.hadoop.fs.s3a.attempts.maximum 1 spark.hadoop.fs.s3a.connection.establish.timeout 1000000 spark.hadoop.fs.s3a.connection.timeout 1000000 spark.hadoop.fs.s3a.connection.request.timeout 0 spark.eventLog.enabled true spark.eventLog.dir file:///tmp/spark-events spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.fs.logDiretory file:///tmp/spark-events spark.history.fs.update.interval 10s spark.history.ui.port 18080 spark.logConf true spark.executor.heartbeatInterval 1000000s spark.network.timeout 10000001s spark.sql.broadcastTimeout 1000000 spark.executorEnv.NCCL_DEBUG WARN spark.driverEnv.NCCL_DEBUG WARN spark.pyspark.python [PATH]/.venv/bin/python spark.pyspark.driver.python [PATH]/.venv/bin/python ```

Answered by tgravescs

Oct 11, 2024

generally assigning nodes with either Resource Manager isn't recommended. I get the affinity part though where I think really you want to tell it to prefer consolidating containers vs spreading them across the nodes. But I am curious, are you seeing a bit performance difference in your applications by grouping them onto the same nodes?

There is still an issue open in spark to support yarn placement constraints (https://issues.apache.org/jira/browse/SPARK-26867). Otherwise I don't know how to do that easily.

Kubernetes has some node affinity and pod topology stuff. There are other add on schedulers too but I haven't tried either to do what you are asking here so don't have an answer for yo…

View full answer

tgravescs · 2024-10-08T13:31:22Z

tgravescs
Oct 8, 2024
Maintainer

so it sounds like you are having issues setting up the resource manager side of things (k8s or yarn). There are a lot of factors that come into play here. k8s and yarn have their own ways of discovering and scheduling for resources which are intertwined with what Spark is doing for it to find a specific GPU.

If you are running on YARN, the best approach is to configure it for cgroups and isolation. (https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/on-premise.html#yarn-3-1-3-with-isolation-and-gpu-scheduling-enabled). Its not clear to me if you got that all working? YARN docs - https://hadoop.apache.org/docs/r3.1.3/hadoop-yarn/hadoop-yarn-site/UsingGpus.html. This can certainly bit a bit tricky to setup and lots of different options that we don't go into because its outside the scope of just running Spark with Spark Rapids.
If isolation and gpu discovery was working then the nodemanager on yarn would report back to the resource manager how many GPUs are available on each node. When you submit your Spark application to YARN then the resource manager would just give you some node that has a GPU. The discovery script on Spark is just used to get the GPU id available once it launches the executor.

If you did not get isolation configured then see - https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/on-premise.html#yarn-without-isolation You can use that as a workaround and in this case YARN doesn't schedule for the GPUs so you have to make sure your spark app is put onto nodes with GPUs and then the discovery script along with the GPUs in EXCLUSIVE_PROCESS mode will just try to use what is there. This way gets tricky because you have to make sure you set all the resources you are requesting appropriate so you don't get to many executors on a node and not enough GPUs.

I want it to run on a specific node by setting the number of GPUs for the other nodes to 0 on the discovery script

If you have a set of nodes and are trying to target a specific one, using the startup script will not work. You have to work through the mechanisms YARN has to do this. For instance you could use node labels or queues. Why do you want to target a specific node? Does it have specific hardware on it you want?

4 replies

an-ys Oct 11, 2024
Author

I am sorry for the late reply.

Regarding the cgroups and isolation, I have already set the options specified in the article, but I did not know I had to do other things for Docker (https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/DockerContainers.html). I am still in the middle of setting the containers up (since Secure Mode seems to be necessary, and it seems like I have to set up IB for Docker).

All of the nodes have the same hardware specs. I did notice that YARN sometimes chooses the node that takes a longer time to run the application for the example I gave with targeting a specific node, but it is not exactly what I’m doing since the example is a bit oversimplified. I’m looking for a way to have the application use a specific number of GPUs for each node, sort of like affinity/masking. I am also planning on running multiple applications at the same time, so I’d like to assign which GPUs the applications will use manually. Is this possible in YARN, or would I have to use K8S for this?

Thank you.

tgravescs Oct 11, 2024
Maintainer

generally assigning nodes with either Resource Manager isn't recommended. I get the affinity part though where I think really you want to tell it to prefer consolidating containers vs spreading them across the nodes. But I am curious, are you seeing a bit performance difference in your applications by grouping them onto the same nodes?

There is still an issue open in spark to support yarn placement constraints (https://issues.apache.org/jira/browse/SPARK-26867). Otherwise I don't know how to do that easily.

Kubernetes has some node affinity and pod topology stuff. There are other add on schedulers too but I haven't tried either to do what you are asking here so don't have an answer for you off the top of my head.

Answer selected by an-ys

an-ys Oct 11, 2024
Author

I see, I guess I’ll stick with standalone mode for now. Thank you for your help.

There is some performance difference with grouping them into the same nodes for standalone mode, but what I found strange in YARN mode is that an application processing a slightly larger dataset (2-3GB difference) on 1 node with 1 GPU is 3x faster than the other application on another node despite having the same specs and configuration. I cleared the cached memory in the servers, and although I did not delete the contents of hdfs, it should not matter much since the data is from a MinIO server and repeated it. The difference between the two runs is that the faster application ran on the correct node, but the slower application ran on a different node with the number of GPUs in the discovery script set to 0.

tgravescs Oct 11, 2024
Maintainer

It seems unusually for it to matter which node a particular application is on if you already ruled out data locality and node spec. The other thing might be if I/O is different with disk or network and one application is very intensive on say the disk but one node has slower disks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using a certain number of GPUs for Spark on YARN #11566

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using a certain number of GPUs for Spark on YARN #11566

an-ys Oct 8, 2024

Replies: 1 comment · 4 replies

tgravescs Oct 8, 2024 Maintainer

an-ys Oct 11, 2024 Author

tgravescs Oct 11, 2024 Maintainer

an-ys Oct 11, 2024 Author

tgravescs Oct 11, 2024 Maintainer

an-ys
Oct 8, 2024

Replies: 1 comment 4 replies

tgravescs
Oct 8, 2024
Maintainer

an-ys Oct 11, 2024
Author

tgravescs Oct 11, 2024
Maintainer

an-ys Oct 11, 2024
Author

tgravescs Oct 11, 2024
Maintainer