Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

Improve mapreduce performance #646

Merged
merged 3 commits into from
Mar 31, 2020
Merged

Improve mapreduce performance #646

merged 3 commits into from
Mar 31, 2020

Conversation

wongalvis14
Copy link
Contributor

@wongalvis14 wongalvis14 commented Mar 25, 2020

More than 3-fold improvement over the latest implementation

Benchmarking function from #611

First stage: Using the number of "max parallel threads a single block can hold" as the number of blocks, perform reduction with serial iteration if needed

Second stage: Reduction in a single block, no serial iteration

This approach aims to strike an optimal balance between workload of each thread, kernel launch overhead and parallel resource exhaustion.

New impl:
julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial: 
  memory estimate:  16.98 KiB
  allocs estimate:  468
  --------------
  minimum time:     2.520 ms (0.00% GC)
  median time:      2.536 ms (0.00% GC)
  mean time:        2.584 ms (0.64% GC)
  maximum time:     15.600 ms (50.62% GC)
  --------------
  samples:          1930
  evals/sample:     1

Old recursion impl:
julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial: 
  memory estimate:  17.05 KiB
  allocs estimate:  472
  --------------
  minimum time:     4.059 ms (0.00% GC)
  median time:      4.076 ms (0.00% GC)
  mean time:        4.130 ms (0.64% GC)
  maximum time:     23.199 ms (63.12% GC)
  --------------
  samples:          1209
  evals/sample:     1

Latest serial impl:
BenchmarkTools.Trial: 
  memory estimate:  7.81 KiB
  allocs estimate:  242
  --------------
  minimum time:     8.544 ms (0.00% GC)
  median time:      8.579 ms (0.00% GC)
  mean time:        8.622 ms (0.27% GC)
  maximum time:     26.172 ms (41.80% GC)
  --------------
  samples:          580
  evals/sample:     1

src/mapreduce.jl Outdated Show resolved Hide resolved
src/mapreduce.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member

maleadt commented Mar 25, 2020

Great! Seems like a solid improvement. I'll have a closer look soon, it would be nice if we could keep the launch configuration via a configuration function instead of manually having to cfunction.

@MasonProtter
Copy link

So it's still a factor of 4 behind the old tagged release (1.7.2) though right?

@MasonProtter
Copy link

MasonProtter commented Mar 25, 2020

@wongalvis14 and I discussed on Slack. Here are timings on my machine for

  • CuArrays v1.7.2
(@v1.4) pkg> add CuArrays#v1.7.2
   Updating registry at `~/.julia/registries/General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
   Updating `~/.julia/environments/v1.4/Project.toml`
  [3a865a2d] + CuArrays v1.7.2 #v1.7.2 (https://github.com/JuliaGPU/CuArrays.jl.git)
   Updating `~/.julia/environments/v1.4/Manifest.toml`
  [3895d2a7] + CUDAapi v3.1.0
  [c5f51814] + CUDAdrv v6.0.0
  [be33ccc6] + CUDAnative v2.10.2
  [3a865a2d] + CuArrays v1.7.2 #v1.7.2 (https://github.com/JuliaGPU/CuArrays.jl.git)
  [0c68f7d7] + GPUArrays v2.0.1
  [929cbde3] + LLVM v1.3.4
  [a759f4b9] + TimerOutputs v0.5.3

julia> using BenchmarkTools, CuArrays
[ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
┌ Warning: Incompatibility detected between CUDA and LLVM 8.0+; disabling debug info emission for CUDA kernels
└ @ CUDAnative ~/.julia/packages/CUDAnative/hfulr/src/CUDAnative.jl:114
WARNING: using CuArrays.BLAS in module Main conflicts with an existing identifier.julia> using BenchmarkTools, CuArrays

julia> function pi_mc_cu(nsamples)
           xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
           mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
       end
pi_mc_cu (generic function with 1 method)

julia> @benchmark pi_mc_cu(10000000)
[ Info: Building the CUDAnative run-time library for your sm_75 device, this might take a while...
BenchmarkTools.Trial: 
  memory estimate:  4.61 KiB
  allocs estimate:  126
  --------------
  minimum time:     594.163 μs (0.00% GC)
  median time:      658.573 μs (0.00% GC)
  mean time:        671.493 μs (2.87% GC)
  maximum time:     2.311 ms (55.14% GC)
  --------------
  samples:          7424
  evals/sample:     1
  • CuArrays Master (today)
(@v1.4) pkg> add https://github.com/JuliaGPU/CuArrays.jl.git#master
   Updating git-repo `https://github.com/JuliaGPU/CuArrays.jl.git`
  Resolving package versions...
   Updating `~/.julia/environments/v1.4/Project.toml`
  [3a865a2d] + CuArrays v2.0.0 #master (https://github.com/JuliaGPU/CuArrays.jl.git)
   Updating `~/.julia/environments/v1.4/Manifest.toml`
  [3895d2a7] + CUDAapi v4.0.0
  [c5f51814] + CUDAdrv v6.2.1
  [be33ccc6] + CUDAnative v3.0.1
  [f68482b8] + Cthulhu v1.0.0
  [3a865a2d] + CuArrays v2.0.0 #master (https://github.com/JuliaGPU/CuArrays.jl.git)
  [0c68f7d7] + GPUArrays v3.1.0
  [929cbde3] + LLVM v1.3.4
  [dc548174] + TerminalMenus v0.1.0
  [a759f4b9] + TimerOutputs v0.5.3

julia> using BenchmarkTools, CuArrays
[ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
WARNING: using CuArrays.BLAS in module Main conflicts with an existing identifier.

julia> function pi_mc_cu(nsamples)
           xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
           mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
       end
pi_mc_cu (generic function with 1 method)julia> @benchmark pi_mc_cu(10000000)
[ Info: Building the CUDAnative run-time library for your sm_75 device, this might take a while...
BenchmarkTools.Trial: 
  memory estimate:  7.81 KiB
  allocs estimate:  245
  --------------
  minimum time:     10.014 ms (0.00% GC)
  median time:      10.159 ms (0.00% GC)
  mean time:        10.198 ms (0.31% GC)
  maximum time:     11.559 ms (9.85% GC)
  --------------
  samples:          491
  evals/sample:     1
  • Alvis's PR branch:
(@v1.4) pkg> add https://github.com/wongalvis14/CuArrays.jl.git#mapreduce
   Updating git-repo `https://github.com/wongalvis14/CuArrays.jl.git`
   Updating registry at `~/.julia/registries/General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
   Updating `~/.julia/environments/v1.4/Project.toml`
  [3a865a2d] + CuArrays v2.0.0 #mapreduce (https://github.com/wongalvis14/CuArrays.jl.git)
   Updating `~/.julia/environments/v1.4/Manifest.toml`
  [3895d2a7] + CUDAapi v4.0.0
  [c5f51814] + CUDAdrv v6.2.1
  [be33ccc6] + CUDAnative v3.0.1
  [f68482b8] + Cthulhu v1.0.0
  [3a865a2d] + CuArrays v2.0.0 #mapreduce (https://github.com/wongalvis14/CuArrays.jl.git)
  [0c68f7d7] + GPUArrays v3.1.0
  [929cbde3] + LLVM v1.3.4
  [dc548174] + TerminalMenus v0.1.0
  [a759f4b9] + TimerOutputs v0.5.3julia> using BenchmarkTools, CuArrays
[ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
WARNING: using CuArrays.BLAS in module Main conflicts with an existing identifier.

julia> function pi_mc_cu(nsamples)
           xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
           mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
       end
pi_mc_cu (generic function with 1 method)

julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial: 
  memory estimate:  11.58 KiB
  allocs estimate:  357
  --------------
  minimum time:     7.527 ms (0.00% GC)
  median time:      7.715 ms (0.00% GC)
  mean time:        7.795 ms (0.52% GC)
  maximum time:     10.703 ms (13.98% GC)
  --------------
  samples:          642
  evals/sample:     1

I'm seeing marginal gains, but still a very large regression over 1.7.2.

@MasonProtter
Copy link

@maleadt mentioned in #611 that it could be because mapreduce(f, op, xs, ys) is falling back on redude(op, map(f, xs, ys)) but looking at the old implementation: https://github.com/JuliaGPU/CuArrays.jl/blob/v1.7.2/src/mapreduce.jl I don't see any obvious mechanism in the old version to fuse the map with reduce when there's multiple containers.

@MasonProtter
Copy link

MasonProtter commented Mar 26, 2020

Ohh, I see. Yeah, I missed that.

@wongalvis14
Copy link
Contributor Author

This implementation is faster than the old one on 1D array mapreduce

v1.7

julia> @benchmark pi_mc_cu(100000000)
BenchmarkTools.Trial: 
  memory estimate:  3.75 KiB
  allocs estimate:  113
  --------------
  minimum time:     4.068 ms (0.00% GC)
  median time:      7.378 ms (0.00% GC)
  mean time:        25.011 ms (0.15% GC)
  maximum time:     698.870 ms (0.00% GC)
  --------------
  samples:          200
  evals/sample:     1

New impl:

julia> @benchmark pi_mc_cu(100000000)
BenchmarkTools.Trial: 
  memory estimate:  14.91 KiB
  allocs estimate:  426
  --------------
  minimum time:     2.941 ms (0.00% GC)
  median time:      5.904 ms (0.00% GC)
  mean time:        20.800 ms (0.47% GC)
  maximum time:     723.428 ms (0.00% GC)
  --------------
  samples:          253
  evals/sample:     1

@maleadt
Copy link
Member

maleadt commented Mar 31, 2020

Continuing the approach of this PR, which already improved performance by a good 25% (I can't reproduce @wongalvis14's timings with my GPU), I'm now selecting a launch configuration based on the recommended grid size as returned by the occupancy API. Together with #663 that brings us back to the original performance. Not sure how the old GPUArrays implementation did that though, as it launched multiple blocks without device-wide synchronization (i.e. it only used a single kernel)...

EDIT: ha, it did the reduction on the CPU, sneaky little bastard! https://github.com/JuliaGPU/GPUArrays.jl/blob/fc08102f999e999fd3c6ac176bda0af450925032/src/mapreduce.jl#L179-L180

@maleadt
Copy link
Member

maleadt commented Mar 31, 2020

bors r+

@bors
Copy link
Contributor

bors bot commented Mar 31, 2020

Build succeeded

@bors bors bot merged commit 5047dc9 into JuliaGPU:master Mar 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants