-
-
Notifications
You must be signed in to change notification settings - Fork 83
Sum function is slow #679
Comments
Please try again with latest master:
It's not unsurprising that dot, which dispatches to highly-optimized CUBLAS kernels in this case, performs well. Our |
Makse sense regarding the dot product. Master doesn't work for me (though the release version still works). Here is the output:
|
You need to upgrade both GPUArrays and CuArrays to master. |
Perfect- that worked. Thanks for having a look!
|
With smaller arrays the cuda performance is rather abysmal. For N=256^2:
|
It's actually a bit worse than this - I should have included
This gives me:
If the issue is worth reopening, perhaps it should be moved to CUDA.jl? |
Sure, feel free to open an issue about the performance on small arrays. Do know that the launch overhead is multiple us already, let alone transferring the memory, so it's never going to be fast. And it's not possible to fall back to a CPU-based implementation, that should be done at the higher level. |
What do you mean by transferring the memory? Normally the array is in the GPU memory to begin with if the code uses CuArrays. Trying to fall back to CPU summing is not very fast because that does involve memory transfer. For small arrays it's faster than doing the CUDA sum but much slower than without the memory transfer. I added a case for this in the example:
Output:
|
Describe the bug
Summing a vector is very slow, slower than allocating a new vector and taking a dot-product, and much slower than taking a dot product with a pre-allocated vector.
To Reproduce
The Minimal Working Example (MWE) for this bug:
Output:
Expected behavior
Fast summing
Build log
Environment details (please complete this section)
Details on Julia:
CUDA: toolkit and driver version
Toolkit: 10.2
Driver: 441.66
The text was updated successfully, but these errors were encountered: