Replies: 2 comments 5 replies
-
Hi @hikrishn, thanks for the report. We're going to need more information to help diagnose the problem. Ideally it would be great if we could see the eventlogs from the CPU and GPU queries so we can see how the query was translated to the GPU, whether any parts were not able to be translated, and where time was spent during both the CPU and GPU queries. If you're not willing to share those publicly but are willing to do so privately, you could initiate a discussion at [email protected]. You could try running the profiling tool to see where time is being spent in the queries and potentially gain some insights. Some basic questions:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the profiling log, this adds a lot more visibility into what could be problematic. There are a number of interesting things in the log, but the first thing that jumps out is that the SQL plan appears to have a project that was not translated to the GPU. There are GpuColumnarToRow and GpuRowToColumnar operations that have over 2 billion rows flowing through them, and that's very expensive. Is there a UDF or something similar in the plan? The driver log should have had a warning emitted when this query ran stating something was not translated to the GPU (search the driver logs for "cannot run on GPU because"). If for some reason you don't see that in the log, you can set spark.rapids.sql.explain=NOT_ON_GPU and you should see WARN messages in the driver log for this. I believe this is a large portion of the slowdown problem, as all of the input data is going through a very slow path. If it is indeed a UDF that's causing the fallback, the RAPIDS Accelerator is not likely to have a quick solution to translating that to the GPU. However we do have ways where users can implement a GPU version of UDFs to accelerate those. If it's falling back due to some Spark Catalyst expression that we have not implemented, that would be good to know as well, as we could try to prioritize implementing it. Another curiosity in the profile log is setting spark.sql.shuffle.partitions to 600 on the CPU but only 4 on the GPU. 4 seems pretty bad, as the GPU cluster can run 88 tasks in parallel, but only 4 tasks will be able to run in the intermediate stages. I'm guessing the idea behind setting this to 4 is because there's 4 GPUs in the cluster, but there's no guarantee Spark will schedule the 4 shuffle partitions to run on separate nodes, and if it doesn't that would be bad for performance. Even with a concurrent GPU threads setting of 9, the GPU won't have enough SMs to run each thread's work at full throughput (i.e.: some kernels will fight for resources between the threads). In addition, throwing all the intermediate data at only 4 tasks means those 4 tasks have a lot of input. With too much input, the processing starts to spill to avoid dealing with batches that are too large, risking either blowing GPU memory or exceeding cudf dataframe row count limits. We see evidence of this in GpuHashAggregate's sort time metric, implying that it is starting to perform out-of-core processing because the batch sizes are getting too big. I recommend setting the shuffle partition count to at least the task parallelism of the cluster (88 in the GPU), and maybe higher if we still see significant out-of-core processing happening in the hash aggregates. Speaking of batch sizes, I see a setting for spark.rapids.shuffle.batchSize. That's not a config the RAPIDS Accelerator recognizes, which is a good thing given all the batchSize configs are in bytes and a 2K batch size would be really bad for performance. 😄 Executive summary: we need to figure out why that first Project in the query plan is not translating to the GPU and ideally solve that. Second I'd look into increasing the shuffle partition count to make the hash aggregate processing more efficient. The good news is that according to the metrics, the GPU seems to be doing very well on some portions of the query. The Parquet scan seems to be a lot faster. If we can get the query to be all on the GPU and hopefully avoid sort fallbacks in the hash aggregate, I think we could see excellent results for this query. |
Beta Was this translation helpful? Give feedback.
-
I have a Airflow DAG that reads parquet data and executes a query that outputs the result table into another s3 location.
GPU we are using - g5.12xlarge
CPU - m5.4xlarge
My spark submit is as below for gpu -
/home/airflow/.local/lib/python3.8/site-packages/pyspark/bin/spark-submit --master k8s://https://xxxxx.eks.amazonaws.com --deploy-mode cluster --name xxx --conf spark.driver.extraJavaOptions=-Xms256m --conf spark.kubernetes.driver.label.project_name=xxx-PROJ --conf spark.kubernetes.driver.label.module_name=sql --conf spark.kubernetes.executor.label.project_name=xxx-PROJ --conf spark.kubernetes.executor.label.module_name=sql --conf spark.kubernetes.namespace=xxx-qa --conf spark.kubernetes.file.upload.path=file:///tmp --conf spark.kubernetes.container.image=xxx:5000/com/xxxx:1.2.3-58-xxxx --conf spark.eventLog.dir=s3a://xxxx-s3/spark-history --conf spark.eventLog.enabled=true --conf spark.driver.memory=15G --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.driverEnv.xxx_DATA_ROOT=s3a://xxxx-s3/var/xxx --conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=k8s://https://xxxx.eks.amazonaws.com --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.container.image.pullSecrets=xxxx-secret --conf spark.kubernetes.driverEnv.S3_ENDPOINT_URL= --conf spark.kubernetes.driver.podTemplateFile=/opt/airflow/conf/default-driver-pod.yaml --conf spark.kubernetes.executor.podTemplateFile=/opt/airflow/dags/s3/xxx-test/executor-pod-template-gpu.yaml --conf spark.executor.extraClassPath=/opt/xxx/spark/jars/* --conf spark.driver.extraClassPath=/opt/xxx/spark/jars/* --conf spark.shuffle.service.enabled=false --conf spark.dynamicAllocation.enabled=false --conf spark.kubernetes.driverEnv.MONGO_URL=xxx --class xxx.xxx.sql.ng.xxSQLComponent local:///opt/xxx/xxx-xxx/lib/xxx-full.jar -a tcxxx-PROJ -b PROJ -c s3a://xxx-xxx-s3//apps/xng-test/xxx/tcxxx/tcxxx_sql.jconf
--conf spark.driver.memory=15G
--conf spark.executor.instances=3
--conf spark.executor.resource.gpu.discoveryScript=/opt/xxx/sparkRapidsPlugin/getGpusResources.sh
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.executor.memory=45G
--conf spark.executor.cores=14
--conf spark.executor.resource.gpu.vendor=nvidia.com
--conf spark.executor.resource.gpu.amount=1
--conf spark.task.resource.gpu.amount=0.20
--conf spark.rapids.sql.concurrentGpuTasks=5
--confspark.executor.memoryOverhead=10G
--conf spark.rapids.memory.pinnedPool.size=8G
--conf spark.rapids.memory.host.spillStorageSize=8G
--conf spark.sql.files.maxPartitionBytes=6144MB
--conf spark.rapids.sql.metrics.level"=DEBUG
--conf spark.rapids.sql.enabled=true
--conf spark.rapids.sql.explain=ALL
--conf spark.sql.shuffle.partitions=5
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321.RapidsShuffleManager
--conf spark.rapids.shuffle.mode=UCX
--conf spark.shuffle.service.enabled=false
--conf spark.rapids.shuffle.enabled=true
--conf spark.rapids.shuffle.transport.enabled=true
--conf spark.dynamicAllocation.enabled=false
--conf spark.executorEnv.UCX_ERROR_SIGNALS=
--conf spark.executorEnv.UCX_MEMTYPE_CACHE=n
--conf spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024
--conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp
--conf spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy
--conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1
--conf spark.executor.extraClassPath=/opt/xxx/sparkRapidsPlugin/rapids-4-spark_2.12-23.02.0-cuda11.jar
--conf spark.driver.extraClassPath=/opt/xxx/sparkRapidsPlugin/rapids-4-spark_2.12-23.02.0-cuda11.jar
My GPU Run ->
My CPU Run ->
The below GPU component takes more time ->
Beta Was this translation helpful? Give feedback.
All reactions