You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.
When doing repeated matrix-vector multiplications with a 10000x10000 matrix, the performance drops significantly (factor 50 to 100) after ~250 iterations. This is probably related to #323, even though the code should (ideally) not allocate additional arrays, I think, and nvidia-smi is also telling me that the memory usage stays limited at under 1 GB.
using CuArrays
n =10000
a =rand(Float32, n) |> CuArray
b =rand(Float32, n) |> CuArray
c =rand(Float32, n, n) |> CuArray
for i in1:1000@time b .= c * a
end
I tried it on different machines under julia 1.1.0 with a Tesla M10, GTX690, and GTX1070Ti running Ubuntu 18.04 (cuda 10.1), 16.04 (cuda 8.0) and Arch Linux (cuda 10.1) respectively. I tried both add CuArrays and add CuArrays#master.
Questions:
can you reproduce this?
is there some obvious problem I don't get?
if it is GC-related: any easy way to prevent allocations / the slowdown?
The text was updated successfully, but these errors were encountered:
Regarding matrix multiplications, it seems like using mul! is more efficient. With the code
using CuArrays
n = 10000
a = rand(Float32, n) |> CuArray
b = rand(Float32, n) |> CuArray
c = rand(Float32, n, n) |> CuArray
for i in 1:2000
@time mul!(b, c, a)
end
it takes about 0.000013 seconds per iteration at first. However, it still slows down to 0.002 sec/iteration after about 1000 iterations.
By the way, 0.002 sec/iteration is still 10-20x faster than doing the same thing on the CPU (on two 10-core Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, with MKL backend), which is already pretty much satisfactory. Maybe something unexpected is happening regarding timing?
Thanks for your response. I took a closer look at CuArrays.jl and realized that I was tricked by the asynchronous nature of CuArrays. Replacing @time ... with @time CuArrays.@sync ... shows the "slower" runtime right from the beginning. I was just confused because I saw no real performance improvement for my algorithm using the M10 compared to the server-CPU - but that just seems to be the harsh truth.
When doing repeated matrix-vector multiplications with a 10000x10000 matrix, the performance drops significantly (factor 50 to 100) after ~250 iterations. This is probably related to #323, even though the code should (ideally) not allocate additional arrays, I think, and
nvidia-smi
is also telling me that the memory usage stays limited at under 1 GB.Initially it looks like this
but after some time it becomes
I tried it on different machines under julia 1.1.0 with a Tesla M10, GTX690, and GTX1070Ti running Ubuntu 18.04 (cuda 10.1), 16.04 (cuda 8.0) and Arch Linux (cuda 10.1) respectively. I tried both
add CuArrays
andadd CuArrays#master
.Questions:
The text was updated successfully, but these errors were encountered: