GPU task using spark rapids perform slower than cpu #8082

hikrishn · 2023-04-12T13:30:42Z

hikrishn
Apr 12, 2023

I have a Airflow DAG that reads parquet data and executes a query that outputs the result table into another s3 location.
GPU we are using - g5.12xlarge
CPU - m5.4xlarge

My spark submit is as below for gpu -

/home/airflow/.local/lib/python3.8/site-packages/pyspark/bin/spark-submit --master k8s://https://xxxxx.eks.amazonaws.com --deploy-mode cluster --name xxx --conf spark.driver.extraJavaOptions=-Xms256m --conf spark.kubernetes.driver.label.project_name=xxx-PROJ --conf spark.kubernetes.driver.label.module_name=sql --conf spark.kubernetes.executor.label.project_name=xxx-PROJ --conf spark.kubernetes.executor.label.module_name=sql --conf spark.kubernetes.namespace=xxx-qa --conf spark.kubernetes.file.upload.path=file:///tmp --conf spark.kubernetes.container.image=xxx:5000/com/xxxx:1.2.3-58-xxxx --conf spark.eventLog.dir=s3a://xxxx-s3/spark-history --conf spark.eventLog.enabled=true --conf spark.driver.memory=15G --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.driverEnv.xxx_DATA_ROOT=s3a://xxxx-s3/var/xxx --conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=k8s://https://xxxx.eks.amazonaws.com --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.container.image.pullSecrets=xxxx-secret --conf spark.kubernetes.driverEnv.S3_ENDPOINT_URL= --conf spark.kubernetes.driver.podTemplateFile=/opt/airflow/conf/default-driver-pod.yaml --conf spark.kubernetes.executor.podTemplateFile=/opt/airflow/dags/s3/xxx-test/executor-pod-template-gpu.yaml --conf spark.executor.extraClassPath=/opt/xxx/spark/jars/* --conf spark.driver.extraClassPath=/opt/xxx/spark/jars/* --conf spark.shuffle.service.enabled=false --conf spark.dynamicAllocation.enabled=false --conf spark.kubernetes.driverEnv.MONGO_URL=xxx --class xxx.xxx.sql.ng.xxSQLComponent local:///opt/xxx/xxx-xxx/lib/xxx-full.jar -a tcxxx-PROJ -b PROJ -c s3a://xxx-xxx-s3//apps/xng-test/xxx/tcxxx/tcxxx_sql.jconf
--conf spark.driver.memory=15G
--conf spark.executor.instances=3
--conf spark.executor.resource.gpu.discoveryScript=/opt/xxx/sparkRapidsPlugin/getGpusResources.sh
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.executor.memory=45G
--conf spark.executor.cores=14
--conf spark.executor.resource.gpu.vendor=nvidia.com
--conf spark.executor.resource.gpu.amount=1
--conf spark.task.resource.gpu.amount=0.20
--conf spark.rapids.sql.concurrentGpuTasks=5
--confspark.executor.memoryOverhead=10G
--conf spark.rapids.memory.pinnedPool.size=8G
--conf spark.rapids.memory.host.spillStorageSize=8G
--conf spark.sql.files.maxPartitionBytes=6144MB
--conf spark.rapids.sql.metrics.level"=DEBUG
--conf spark.rapids.sql.enabled=true
--conf spark.rapids.sql.explain=ALL
--conf spark.sql.shuffle.partitions=5
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321.RapidsShuffleManager
--conf spark.rapids.shuffle.mode=UCX
--conf spark.shuffle.service.enabled=false
--conf spark.rapids.shuffle.enabled=true
--conf spark.rapids.shuffle.transport.enabled=true
--conf spark.dynamicAllocation.enabled=false
--conf spark.executorEnv.UCX_ERROR_SIGNALS=
--conf spark.executorEnv.UCX_MEMTYPE_CACHE=n
--conf spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024
--conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp
--conf spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy
--conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1
--conf spark.executor.extraClassPath=/opt/xxx/sparkRapidsPlugin/rapids-4-spark_2.12-23.02.0-cuda11.jar
--conf spark.driver.extraClassPath=/opt/xxx/sparkRapidsPlugin/rapids-4-spark_2.12-23.02.0-cuda11.jar

My GPU Run ->

My CPU Run ->

The below GPU component takes more time ->

jlowe · 2023-04-12T14:07:12Z

jlowe
Apr 12, 2023
Maintainer

Hi @hikrishn, thanks for the report. We're going to need more information to help diagnose the problem.

Ideally it would be great if we could see the eventlogs from the CPU and GPU queries so we can see how the query was translated to the GPU, whether any parts were not able to be translated, and where time was spent during both the CPU and GPU queries. If you're not willing to share those publicly but are willing to do so privately, you could initiate a discussion at [email protected].

You could try running the profiling tool to see where time is being spent in the queries and potentially gain some insights.

Some basic questions:

Are any Spark settings different for the CPU run besides settings specific to the RAPIDS Accelerator (e.g.: executor cores/memory, shuffle partitions, etc)?
What are the CPU and GPU query times? What stages of the query are taking the most time in the GPU query? (i.e.: initial data loading stages, intermediate stages, final writing stage?). The profiling tool mentioned above should help identify these if it's not clear from the Spark UI.
Are any portions of the query falling back to CPU operations? Not all fallbacks are performance problems, but if a large portion of time/computation occurs in an operation that falls back to the CPU it can be.
Have you tried lowering the concurrent task setting? Making this setting larger does not always lead to better performance, and setting it too high could cause excessive memory spilling from the GPU.
Have you tried running without UCX shuffle?

1 reply

hikrishn Apr 13, 2023
Author

hi jlowe,

we were using p3.2xlarge earlier but as executors were taking 30 minutes we switched to higher capacity g5.24xlarge
and now it improved to 4.9 min but the cpu task completes in 4.3 minutes. Based on this observation GPU usage seems expensive than CPU.

Our objective is to reduce the cost by improving the performance (job completion time) using GPU (using rapids cuda plugin) instead of CPU. But with the current GPU usage it seems less performant and more expensive.

I have attached the profiling compare of GPU vs CPU job.

Please look into this and advise what is missing or if GPU with spark rapids is not a good option for our cost savings plan.

Answering to your questions below in bold:

Are any Spark settings different for the CPU run besides settings specific to the RAPIDS Accelerator (e.g.: executor cores/memory, shuffle partitions, etc)?
Ans -
rapids_4_spark_tools_compare_scrubbed.log
appIndex 1. - GPU job
appIndex 2. - CPU job

What are the CPU and GPU query times? What stages of the query are taking the most time in the GPU query? (i.e.: initial data loading stages, intermediate stages, final writing stage?). The profiling tool mentioned above should help identify these if it's not clear from the Spark UI.

Ans- Pls refer to the profiler log

Are any portions of the query falling back to CPU operations? Not all fallbacks are performance problems, but if a large portion of time/computation occurs in an operation that falls back to the CPU it can be.

Ans- I dont think so

Have you tried lowering the concurrent task setting? Making this setting larger does not always lead to better performance, and setting it too high could cause excessive memory spilling from the GPU.

Ans- I have tried lowering the concurrent task setting to 4 from 8 but this increases the overall time taken to complete.
Have you tried running without UCX shuffle?

Ans-Yes. This also increases the time taken.

jlowe · 2023-04-13T14:32:50Z

jlowe
Apr 13, 2023
Maintainer

Thanks for the profiling log, this adds a lot more visibility into what could be problematic. There are a number of interesting things in the log, but the first thing that jumps out is that the SQL plan appears to have a project that was not translated to the GPU. There are GpuColumnarToRow and GpuRowToColumnar operations that have over 2 billion rows flowing through them, and that's very expensive. Is there a UDF or something similar in the plan? The driver log should have had a warning emitted when this query ran stating something was not translated to the GPU (search the driver logs for "cannot run on GPU because"). If for some reason you don't see that in the log, you can set spark.rapids.sql.explain=NOT_ON_GPU and you should see WARN messages in the driver log for this. I believe this is a large portion of the slowdown problem, as all of the input data is going through a very slow path. If it is indeed a UDF that's causing the fallback, the RAPIDS Accelerator is not likely to have a quick solution to translating that to the GPU. However we do have ways where users can implement a GPU version of UDFs to accelerate those. If it's falling back due to some Spark Catalyst expression that we have not implemented, that would be good to know as well, as we could try to prioritize implementing it.

Another curiosity in the profile log is setting spark.sql.shuffle.partitions to 600 on the CPU but only 4 on the GPU. 4 seems pretty bad, as the GPU cluster can run 88 tasks in parallel, but only 4 tasks will be able to run in the intermediate stages. I'm guessing the idea behind setting this to 4 is because there's 4 GPUs in the cluster, but there's no guarantee Spark will schedule the 4 shuffle partitions to run on separate nodes, and if it doesn't that would be bad for performance. Even with a concurrent GPU threads setting of 9, the GPU won't have enough SMs to run each thread's work at full throughput (i.e.: some kernels will fight for resources between the threads). In addition, throwing all the intermediate data at only 4 tasks means those 4 tasks have a lot of input. With too much input, the processing starts to spill to avoid dealing with batches that are too large, risking either blowing GPU memory or exceeding cudf dataframe row count limits. We see evidence of this in GpuHashAggregate's sort time metric, implying that it is starting to perform out-of-core processing because the batch sizes are getting too big. I recommend setting the shuffle partition count to at least the task parallelism of the cluster (88 in the GPU), and maybe higher if we still see significant out-of-core processing happening in the hash aggregates.

Speaking of batch sizes, I see a setting for spark.rapids.shuffle.batchSize. That's not a config the RAPIDS Accelerator recognizes, which is a good thing given all the batchSize configs are in bytes and a 2K batch size would be really bad for performance. 😄

Executive summary: we need to figure out why that first Project in the query plan is not translating to the GPU and ideally solve that. Second I'd look into increasing the shuffle partition count to make the hash aggregate processing more efficient.

The good news is that according to the metrics, the GPU seems to be doing very well on some portions of the query. The Parquet scan seems to be a lot faster. If we can get the query to be all on the GPU and hopefully avoid sort fallbacks in the hash aggregate, I think we could see excellent results for this query.

4 replies

hikrishn Apr 17, 2023
Author

Hi Jason Lowe

As per your suggestion, We have removed the part of the sql query that was running on cpu (was a regexp_replace() Function on non utf8 source data) and tried running the same DAG with same parameters except the spark.sql.shuffle.partitions which we increased to 70 (earlier 4) as per your suggestion.

We see that the executors are exiting/removed due to below error -
Executor Logs -
-------------------
-3] [1681732892.579074] [sql-tc269-proj-proj-73fadf878f125fd5-exec-3:1 :1] mpool.c:54 UCX WARN object 0x7fd7d40718c0 was not returned to mpool ucp_rkeys
[sql-tc269-proj-proj-73fadf878f125fd5-exec-3] 23/04/17 12:01:33 WARN GpuDeviceManager: Waiting for outstanding RMM allocations to be released...
[sql-tc269-proj-proj-73fadf878f125fd5-exec-3] 23/04/17 12:01:34 INFO DeviceMemoryEventHandler: Device allocation of 1985543040 bytes failed, device store has 0 bytes. First attempt. Total RMM allocated is 18003515648 bytes.
[sql-tc269-proj-proj-73fadf878f125fd5-exec-3] 23/04/17 12:01:34 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 1985543040 after a synchronize. Total RMM allocated is 18592436736 bytes.
[sql-tc269-proj-proj-73fadf878f125fd5-exec-3] 23/04/17 12:01:36 ERROR Executor: Exception in task 17.2 in stage 2.0 (TID 917)
[sql-tc269-proj-proj-73fadf878f125fd5-exec-3] java.lang.NullPointerException
[sql-tc269-proj-proj-73fadf878f125fd5-exec-3] at com.nvidia.spark.rapids.RapidsBufferCatalog$.addContiguousTable(RapidsBufferCatalog.scala:482)
[sql-tc269-proj-proj-73fadf878f125fd5-exec-3] at
Driver logs -
-------------------------
executor 1, partition 59, PROCESS_LOCAL, 24072 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
[xdf-42e1d3878f124449-driver] 23/04/17 11:58:28 WARN TaskSetManager: Lost task 9.1 in stage 2.0 (TID 848) (10.249.9.132 executor 1): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni-release-7-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit
[xdf-42e1d3878f124449-driver] at ai.rapids.cudf.ColumnView.extractRe(Native Method)
[xdf-42e1d3878f124449-driver] at ai.rapids.cudf.ColumnView.extractRe(ColumnView.java:3212)
[xdf-42e1d3878f124449-driver] at org.apache.spark.sql.rapids.GpuSubstringIndex.doColumnar(stringFunctions.scala:1541)
[xdf-42e1d3878f124449-driver] at com.nvidia.spark.rapids.GpuTernaryExpression.$anonfun$columnarEval$8(GpuExpressions.scala:415)
[xdf-42e1d3878f124449-driver] at com.nvidia.spark.rapids.Arm.withResourceIfAllowed(Arm.scala:73)
[xdf-42e1d3878f124449-driver] at com.nvidia.spark.rapids.Arm.withResourceIfAllowed$(Arm.scala:71)
[xdf-42e1d3878f124449-driver] at org.apache.spark.sql.rapids.GpuSubstringIndex.withResourceIfAllowed(stringFunctions.scala:1513)

Please advise on the issue.

thanks
Krishna

jlowe Apr 17, 2023
Maintainer

Are you still running with a GPU concurrent tasks setting of 9? That's really high, and it's likely the GPU cannot support that many simultaneous tasks on the GPU without running out of GPU memory (or spilling excessively which is a performance hit). When you only had 4 shuffle partitions, at most 4 threads could be on the GPU after the initial stages of the job. I'd try setting this to 2 initially and see if that fixes it, and you can dial it up from there if you continue to see performance gains without OOM.

hikrishn Apr 17, 2023
Author

Hi jlowe,

We reran the job after scrubbing sql of regexp_replace () and also setting below config changes ->  spark.rapids.sql.concurrentGpuTasks:2
spark.task.resource.gpu.amount:0.45
spark.sql.shuffle.partitions:66

It’s taking 3.9 min now. Pls provide insights on enabling this to run faster (under 1.5 min) so that we can have some cost savings compared to CPU.
Profiling info -
profile (1).log

jlowe Apr 17, 2023
Maintainer

From the profile log, I can see there are 15 executor cores, but spark.task.resource.gpu.amount=0.45. That means despite telling Spark it could normally run up to 15 task simultaneously on the executor, it's only going to run 2 because the executor only has 1 GPU and each task is taking almost half of it (from a resource scheduling perspective).

Typically you want spark.task.resource.gpu.amount = 1 / spark.executor.cores (or less) so the scheduling of tasks is limited by CPU resources rather than GPU resources. Note that spark.task.resource.gpu.amount is not directly related to spark.rapids.sql.concurrentGpuTasks. The latter controls how many tasks the RAPIDS accelerator will allow to perform GPU operations simultaneously, but this can be safely set to a much lower value than spark.executor.cores. There are many operations in a task that are not all on the GPU (e.g.: fetching data from the distributed filesystem, shuffle decompress, etc.) that can be done on the CPU without being blocked by other tasks running on the GPU. Restricting the number of executor tasks to the number of tasks that can simultaneously be on the GPU loses the ability to pipeline CPU and GPU work. I'd recommend setting spark.task.resource.gpu.amount to something small (e.g.: 0.01) so it does not become the limiting factor for scheduling. This is also called out in the notes at the bottom of the profile log, e.g.:

'spark.task.resource.gpu.amount' should be set to Max(1, (numCores / gpuCount)).

See https://github.com/NVIDIA/spark-rapids/blob/branch-23.06/docs/tuning-guide.md#number-of-tasks-per-executor and https://github.com/NVIDIA/spark-rapids/blob/branch-23.06/docs/tuning-guide.md#number-of-concurrent-tasks-per-gpu for a more detailed discussion about these configs.

Have you tried running with a concurrent GPU task setting higher than 2 (e.g.: 3 or 4?). Often we do not see improved performance beyond 4 (and it can induce OOM if set too high as you saw), but sometimes higher settings than 4 can lead to better performance.

Curious about the spark.sql.files.maxPartitionBytes and spark.sql.shuffle.partitions settings. Did you arrive at these empirically from query performance, or how were their values determined?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU task using spark rapids perform slower than cpu #8082

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

GPU task using spark rapids perform slower than cpu #8082

hikrishn Apr 12, 2023

Replies: 2 comments · 5 replies

jlowe Apr 12, 2023 Maintainer

hikrishn Apr 13, 2023 Author

Answering to your questions below in bold:

jlowe Apr 13, 2023 Maintainer

hikrishn Apr 17, 2023 Author

jlowe Apr 17, 2023 Maintainer

hikrishn Apr 17, 2023 Author

jlowe Apr 17, 2023 Maintainer

hikrishn
Apr 12, 2023

Replies: 2 comments 5 replies

jlowe
Apr 12, 2023
Maintainer

hikrishn Apr 13, 2023
Author

jlowe
Apr 13, 2023
Maintainer

hikrishn Apr 17, 2023
Author

jlowe Apr 17, 2023
Maintainer

hikrishn Apr 17, 2023
Author

jlowe Apr 17, 2023
Maintainer