Performance issue with v2.1.0 compared with v1.7.3 #701

findmyway · 2020-05-03T17:25:46Z

Describe the bug
The performance of [email protected] is slower compared to v1.7.3 for small models.

To Reproduce
The Minimal Working Example (MWE) for this bug:

(@v1.4) pkg> st
  [587475ba] Flux v0.10.4
  [3a865a2d] CuArrays v2.1.0 #master (https://github.com/JuliaGPU/CuArray
  [be33ccc6] CUDAnative v3.0.4

julia> using Flux,CuArrays

julia> model = Chain(
           Dense(4, 128, relu),
           Dense(128, 128, relu),
           Dense(128, 2),
       ) |> gpu
Chain(Dense(4, 128, relu), Dense(128, 128, relu), Dense(128, 2))

julia> @benchmark  CuArrays.@sync model($(cu(rand(4))))
BenchmarkTools.Trial: 
  memory estimate:  8.80 KiB
  allocs estimate:  276
  --------------
  minimum time:     93.864 μs (0.00% GC)
  median time:      115.179 μs (0.00% GC)
  mean time:        125.542 μs (1.97% GC)
  maximum time:     50.622 ms (48.86% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> CuArrays.version()
v"10.1.243"

For comparison:

(@v1.4) pkg> st
  [be33ccc6] CUDAnative v2.10.2
  [3a865a2d] CuArrays v1.7.3
  [587475ba] Flux v0.10.3

julia> @benchmark  CuArrays.@sync model($(cu(rand(4))))
BenchmarkTools.Trial: 
  memory estimate:  8.16 KiB
  allocs estimate:  223
  --------------
  minimum time:     45.627 μs (0.00% GC)
  median time:      74.875 μs (0.00% GC)
  mean time:        85.175 μs (2.61% GC)
  maximum time:     32.836 ms (33.09% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> CUDAdrv.version()
v"10.1.0"

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete this section)
Details on Julia:

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

Additional context
Add any other context about the problem here.
Test with RTX 2080ti

Note that the model is quite small above. For some large models, the performance is similar between v2.1.0 and v1.7.3. However, I'm still quite interested in why there's a significant difference with small models.

The text was updated successfully, but these errors were encountered:

maleadt · 2020-05-04T06:37:08Z

Bisect to bd38b15

maleadt · 2020-05-04T07:44:46Z

Could you verify #704 works?

findmyway · 2020-05-04T08:16:34Z

Yes, I can confirm it works. 🎉

Thanks!

maleadt · 2020-05-04T11:17:43Z

Great. Thanks for the report!

findmyway added the bug label May 3, 2020

maleadt added performance and removed bug labels May 4, 2020

maleadt mentioned this issue May 4, 2020

Repopulate the pool from freed blocks before allocating. #704

Merged

maleadt closed this as completed in #704 May 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with v2.1.0 compared with v1.7.3 #701

Performance issue with v2.1.0 compared with v1.7.3 #701

findmyway commented May 3, 2020

maleadt commented May 4, 2020

maleadt commented May 4, 2020

findmyway commented May 4, 2020

maleadt commented May 4, 2020

Performance issue with v2.1.0 compared with v1.7.3 #701

Performance issue with v2.1.0 compared with v1.7.3 #701

Comments

findmyway commented May 3, 2020

maleadt commented May 4, 2020

maleadt commented May 4, 2020

findmyway commented May 4, 2020

maleadt commented May 4, 2020