Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialization fixes for mapreducedim. #316

Merged
merged 1 commit into from
Jul 24, 2020
Merged

Specialization fixes for mapreducedim. #316

merged 1 commit into from
Jul 24, 2020

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Jul 24, 2020

Fixes #302

julia> k = KnetArray{Float32}(rand(10,100));

julia> c = CuArray{Float32}(rand(10,100));

julia> @benchmark sum(k)
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     8.516 μs (0.00% GC)
  median time:      8.788 μs (0.00% GC)
  mean time:        8.817 μs (0.00% GC)
  maximum time:     23.335 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark sum(c)
BenchmarkTools.Trial: 
  memory estimate:  1.08 KiB
  allocs estimate:  37
  --------------
  minimum time:     12.593 μs (0.00% GC)
  median time:      19.867 μs (0.00% GC)
  mean time:        19.804 μs (0.00% GC)
  maximum time:     325.308 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark sum(k,dims=1)
BenchmarkTools.Trial: 
  memory estimate:  288 bytes
  allocs estimate:  11
  --------------
  minimum time:     2.899 μs (0.00% GC)
  median time:      3.016 μs (0.00% GC)
  mean time:        3.072 μs (0.00% GC)
  maximum time:     163.371 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark sum(c,dims=1)
BenchmarkTools.Trial: 
  memory estimate:  960 bytes
  allocs estimate:  33
  --------------
  minimum time:     3.236 μs (0.00% GC)
  median time:      3.493 μs (0.00% GC)
  mean time:        4.332 μs (4.12% GC)
  maximum time:     5.562 ms (32.07% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> @benchmark sum(abs2, k)
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     8.926 μs (0.00% GC)
  median time:      9.180 μs (0.00% GC)
  mean time:        9.288 μs (0.00% GC)
  maximum time:     128.632 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark sum(abs2, c)
BenchmarkTools.Trial: 
  memory estimate:  1.08 KiB
  allocs estimate:  37
  --------------
  minimum time:     13.336 μs (0.00% GC)
  median time:      20.882 μs (0.00% GC)
  mean time:        20.743 μs (0.00% GC)
  maximum time:     66.558 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

From x10 to less than 50% overhead.

@denizyuret at this point the largest issues are gone, and it would be good to port over some over the tricks that Knet does. For example, I think the scalar reductions here avoid allocating an output container (I couldn't see a cudaMalloc in the profiler), which might account for the remaining overhead.

@maleadt maleadt added cuda array Stuff about CuArray. performance How fast can we go? labels Jul 24, 2020
@denizyuret
Copy link
Contributor

Looks good. The scalar reduction code is in Knet/deps/cuda20.jl, how can I help with the port?

@codecov
Copy link

codecov bot commented Jul 24, 2020

Codecov Report

Merging #316 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #316      +/-   ##
==========================================
+ Coverage   79.33%   79.34%   +0.01%     
==========================================
  Files         155      155              
  Lines        8902     8900       -2     
==========================================
  Hits         7062     7062              
+ Misses       1840     1838       -2     
Impacted Files Coverage Δ
src/mapreduce.jl 100.00% <100.00%> (+4.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5dc771d...3107df7. Read the comment docs.

@maleadt maleadt merged commit afaec8e into master Jul 24, 2020
@maleadt maleadt deleted the tb/mapreduce branch July 24, 2020 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda array Stuff about CuArray. performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance: sum
2 participants