-
I'm in the beginning of trying converting our HUGE data/spark code from the CPU to the GPU. While I have vast experience with GPUs/CUDA, I find it a bit hard to pin-point exactly how/what the performance limitations I'm seeing. Furthermore, I'm looking at many queries, with different issues (such as non-supported features, strings, parquet issues etc.) I've followed all the performance tuning guides and the suggested actions here, still would be happy to get more insights and assistance if possible :) I am currently looking at the following query. CPU time is roughly the same as the GPU.
Fields x1 to x26 are either strings or booleans. Attached is the nvprof output. If I understand correctly 50% of the compute time is string related? and 20% is decoding the PARQUET's page information (i.e. not even decompressing the data itself?) I guess my questions are:
Any further assistance is more than welcomed :) |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
What version of the RAPIDS Accelerator are you using? There have been recent performance optimizations in The profile traces also show a large Parquet decode time which we've seen in some input files, cc: @nvdbaranec who is currently working on optimizing for those cases. GPU traces are nice to figure out what's going on with the GPU, but it doesn't do the best job of conveying what's happening at a high-level with the query. I would suggest looking at the Spark SQL and job web UIs of the CPU and GPU queries and see if there are indications there where the bottlenecks are. For example, which stage(s) are consuming the most time? What operations within those stages appear to be the most expensive according to the SQL metrics? If it's an initial stage and loading Parquet, does the buffer time greatly exceed the GPU decode time, indicating the query has a notable I/O bottleneck?
This is related to the "where are we spending all the time" question. I cannot tell just from the trace above whether reading Parquet is the real bottleneck for this query or not, since I cannot see the query plans and how much time is spent in each stage (or nodes within stages). The Parquet decode kernel time is pretty big, so I suspect it is a significant contributor. You could try varying the number of input tasks (e.g.: via changing the max input partition size config) to see if that noticeably changes the performance. That would indicate whether scaling the input data per task is helping increase the GPU efficiency.
This is because the first step is figuring out where the bottleneck is at a high level. The qualification/explain tool can show whether parts of the query are not running on the GPU and therefore incur CPU<->GPU transition overhead. The profiling tool can examine the eventlog and extract metrics which is similar to examining the Spark SQL web UI for your query. Once we have a high-level idea of where the bottleneck is in the query then we can focus more on what's going on with that area of the query.
This is difficult to automatically determine. The GPU parallelism of a task read is dependent on how many columns you're loading and how many data buffers are being loaded per column. The worst-case scenario is loading only one column that has only a single, large buffer to decompress and decode, as there aren't very many opportunities for parallelism there (i.e.: compression decode, Parquet decode). Having multiple buffers that need compression and Parquet decoding is where a lot of the GPU parallelism (and therefore performance) is derived during Parquet loads. |
Beta Was this translation helpful? Give feedback.
-
@jlowe I'm using RAPIDS Accelerator 21.12.0 using cudf 21.12.0 Changing the spark.sql.files.maxPartitionBytes and spark.rapids.sql.concurrentGpuTasks didn't provide for a significant performance change. Going through the explain plan, I see a few non-GPU compliant ops, maybe this can explain the low performance As for the last item, regarding the data size fed into the GPU, I wasn't clear about what I thought. I understand that generally saying "this work is too small/big for the GPU" is not trivial/doable. I was thinking more in the lines that RAPIDS would be able to specify how many columns/rows/parquet pages (and any other performance relevant properties) was used and maybe that would give hints as to whether the spark configuration/parquet files are a reason for underutilizing the GPU.. hope it makes sense. Attached is the plan file for this query. I've replaced all field names as they can't be shared. Any further ideas, would be greatly appereciated. |
Beta Was this translation helpful? Give feedback.
-
Some SparkUI screen shots. `== Physical Plan == ` |
Beta Was this translation helpful? Give feedback.
-
Given the relatively high cost of the contiguous_split kernel from your traces, I suggest updating to the 22.02 release of the RAPIDS Accelerator and cudf. That includes some performance fixes for contiguous_split that may help your use-case.
According to the physical plan and stage runtime statistics, I don't think the impact of this is significant. The only operations that aren't running on the GPU are the Project and CollectLimit occurring right at the end of the query. This is the last stage of the job, which appears to have taken only 0.4 seconds, while the main stage took 1.1 minutes. Also note that the SQL metrics show that only 26,080 rows went through that part, which is a small fraction of the 33+ million rows we started with. This job appears to be mostly all about the initial stage, and specifically the Parquet load within that stage. That explains why changing other parts of the query doesn't seem to help much. Looking at the metrics associated with the Parquet load, buffer time is significant with the average task spending 3.6 seconds just fetching the data from the distributed filesystem, and the GPU takes 305 milliseconds decoding it afterwards. So I/O overhead seems to be a significant factor for this query, which helps explain the low GPU utilization.
This isn't something that RAPIDS cudf supports reporting back as a side-product of the load, although one could identify those metrics separately via Parquet tools that examine the footer of the Parquet files. Although given the metrics I'm seeing above, it seems to be more about waiting for I/O than waiting for the GPU. The metrics show tasks are spending over 10x the time waiting for raw Parquet data from the distributed filesystem rather than waiting for the GPU to decompress and decode it. |
Beta Was this translation helpful? Give feedback.
Given the relatively high cost of the contiguous_split kernel from your traces, I suggest updating to the 22.02 release of the RAPIDS Accelerator and cudf. That includes some performance fixes for contiguous_split that may help your use-case.
According to the physical plan and stage runtime statistics, I don't think the impact of this is significant. The only operations that aren't running on the GPU are the Project and CollectLimit occurring right at the end of the query. This is the last stage of the job, which appears to have taken only 0.4 seconds, while the ma…