-
Hey! I've been reading your guide about tuning Spark-RAPIDS. I am currently getting errors running a job due to poor tuning, I assume. The error is
I am running with one node that has a Tesla P100 16GB, 117GB of RAM, and 14 CPUs. These are the parameters I have used, with an explanation how I came to choose it:
I am still kinda confused what configurations should I try to make the system not run out of memory or even better, get the best possible run time. Can anyone clarify how I should be tuning in this situation? I especially am having issues with seeing how For context, my input is about 650GB of strings and output is about 300GB two column tables. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
we try to explain this in this section: https://nvidia.github.io/spark-rapids/docs/tuning-guide.html#number-of-tasks-per-executor. Sounds like we need to clarify it more.
This controls how many tasks can run per executor on the CPU side. Our plugin only supports 1 GPU per executor so this is the reciprocal of how many tasks you want on the CPU which is conotrolled by the spark.executor.cores configuration. So in your case you have 5 cores per executor which means you can have 5 tasks per executor if spark.task.resource.gpu.amount is set to 1/5. This is purely on the Spark side how it does the scheduling without the Rapids plugin enabled. if you change either the executor cores or the spark.task.resource.gpu.amount it can affect the number of tasks per executor allowed to run from Spark scheduling point of view.
This is purely a rapids plugin for spark configuration that control how many of those tasks can run GPU operations at a time. if you are running out of memory you either need to decrease the size of data per task being processed at once (https://nvidia.github.io/spark-rapids/docs/tuning-guide.html#columnar-batch-size) or you can also decrease the spark.rapids.sql.concurrentGpuTasks. You may have saw this in third paragraph of https://nvidia.github.io/spark-rapids/docs/tuning-guide.html#number-of-concurrent-tasks-per-gpu Can you tell us what type of operation is happening when it runs out of memory? |
Beta Was this translation helpful? Give feedback.
-
Please reopen the issue if you still have any questions |
Beta Was this translation helpful? Give feedback.
we try to explain this in this section: https://nvidia.github.io/spark-rapids/docs/tuning-guide.html#number-of-tasks-per-executor. Sounds like we need to clarify it more.
This controls how many tasks can run per executor on the CPU side. Our plugin only supports 1 GPU per executor so this is the reciprocal of how many tasks you want on the CPU which is conotrolled by the spark.executor.cores configuration. So in your case you have 5 cores per executor which means you can have 5 tasks per executor if spark.task.resource.gpu.amount is set to 1/5. This is purely on the Spark side how it does the scheduling without the Rapids plugin enabled. if you change eith…