Improve mapreduce performance #646

wongalvis14 · 2020-03-25T01:57:43Z

~~More than 3-fold improvement over the latest implementation~~

Benchmarking function from #611

First stage: Using the number of "max parallel threads a single block can hold" as the number of blocks, perform reduction with serial iteration if needed

Second stage: Reduction in a single block, no serial iteration

This approach aims to strike an optimal balance between workload of each thread, kernel launch overhead and parallel resource exhaustion.

New impl:
julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial: 
  memory estimate:  16.98 KiB
  allocs estimate:  468
  --------------
  minimum time:     2.520 ms (0.00% GC)
  median time:      2.536 ms (0.00% GC)
  mean time:        2.584 ms (0.64% GC)
  maximum time:     15.600 ms (50.62% GC)
  --------------
  samples:          1930
  evals/sample:     1

Old recursion impl:
julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial: 
  memory estimate:  17.05 KiB
  allocs estimate:  472
  --------------
  minimum time:     4.059 ms (0.00% GC)
  median time:      4.076 ms (0.00% GC)
  mean time:        4.130 ms (0.64% GC)
  maximum time:     23.199 ms (63.12% GC)
  --------------
  samples:          1209
  evals/sample:     1

Latest serial impl:
BenchmarkTools.Trial: 
  memory estimate:  7.81 KiB
  allocs estimate:  242
  --------------
  minimum time:     8.544 ms (0.00% GC)
  median time:      8.579 ms (0.00% GC)
  mean time:        8.622 ms (0.27% GC)
  maximum time:     26.172 ms (41.80% GC)
  --------------
  samples:          580
  evals/sample:     1

src/mapreduce.jl

maleadt · 2020-03-25T07:14:14Z

Great! Seems like a solid improvement. I'll have a closer look soon, it would be nice if we could keep the launch configuration via a configuration function instead of manually having to cfunction.

MasonProtter · 2020-03-25T18:41:52Z

So it's still a factor of 4 behind the old tagged release (1.7.2) though right?

MasonProtter · 2020-03-25T19:34:22Z

@wongalvis14 and I discussed on Slack. Here are timings on my machine for

CuArrays v1.7.2

(@v1.4) pkg> add CuArrays#v1.7.2
   Updating registry at `~/.julia/registries/General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
   Updating `~/.julia/environments/v1.4/Project.toml`
  [3a865a2d] + CuArrays v1.7.2 #v1.7.2 (https://github.com/JuliaGPU/CuArrays.jl.git)
   Updating `~/.julia/environments/v1.4/Manifest.toml`
  [3895d2a7] + CUDAapi v3.1.0
  [c5f51814] + CUDAdrv v6.0.0
  [be33ccc6] + CUDAnative v2.10.2
  [3a865a2d] + CuArrays v1.7.2 #v1.7.2 (https://github.com/JuliaGPU/CuArrays.jl.git)
  [0c68f7d7] + GPUArrays v2.0.1
  [929cbde3] + LLVM v1.3.4
  [a759f4b9] + TimerOutputs v0.5.3

julia> using BenchmarkTools, CuArrays
[ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
┌ Warning: Incompatibility detected between CUDA and LLVM 8.0+; disabling debug info emission for CUDA kernels
└ @ CUDAnative ~/.julia/packages/CUDAnative/hfulr/src/CUDAnative.jl:114
WARNING: using CuArrays.BLAS in module Main conflicts with an existing identifier.julia> using BenchmarkTools, CuArrays

julia> function pi_mc_cu(nsamples)
           xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
           mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
       end
pi_mc_cu (generic function with 1 method)

julia> @benchmark pi_mc_cu(10000000)
[ Info: Building the CUDAnative run-time library for your sm_75 device, this might take a while...
BenchmarkTools.Trial: 
  memory estimate:  4.61 KiB
  allocs estimate:  126
  --------------
  minimum time:     594.163 μs (0.00% GC)
  median time:      658.573 μs (0.00% GC)
  mean time:        671.493 μs (2.87% GC)
  maximum time:     2.311 ms (55.14% GC)
  --------------
  samples:          7424
  evals/sample:     1

CuArrays Master (today)

(@v1.4) pkg> add https://github.com/JuliaGPU/CuArrays.jl.git#master
   Updating git-repo `https://github.com/JuliaGPU/CuArrays.jl.git`
  Resolving package versions...
   Updating `~/.julia/environments/v1.4/Project.toml`
  [3a865a2d] + CuArrays v2.0.0 #master (https://github.com/JuliaGPU/CuArrays.jl.git)
   Updating `~/.julia/environments/v1.4/Manifest.toml`
  [3895d2a7] + CUDAapi v4.0.0
  [c5f51814] + CUDAdrv v6.2.1
  [be33ccc6] + CUDAnative v3.0.1
  [f68482b8] + Cthulhu v1.0.0
  [3a865a2d] + CuArrays v2.0.0 #master (https://github.com/JuliaGPU/CuArrays.jl.git)
  [0c68f7d7] + GPUArrays v3.1.0
  [929cbde3] + LLVM v1.3.4
  [dc548174] + TerminalMenus v0.1.0
  [a759f4b9] + TimerOutputs v0.5.3

julia> using BenchmarkTools, CuArrays
[ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
WARNING: using CuArrays.BLAS in module Main conflicts with an existing identifier.

julia> function pi_mc_cu(nsamples)
           xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
           mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
       end
pi_mc_cu (generic function with 1 method)julia> @benchmark pi_mc_cu(10000000)
[ Info: Building the CUDAnative run-time library for your sm_75 device, this might take a while...
BenchmarkTools.Trial: 
  memory estimate:  7.81 KiB
  allocs estimate:  245
  --------------
  minimum time:     10.014 ms (0.00% GC)
  median time:      10.159 ms (0.00% GC)
  mean time:        10.198 ms (0.31% GC)
  maximum time:     11.559 ms (9.85% GC)
  --------------
  samples:          491
  evals/sample:     1

Alvis's PR branch:

(@v1.4) pkg> add https://github.com/wongalvis14/CuArrays.jl.git#mapreduce
   Updating git-repo `https://github.com/wongalvis14/CuArrays.jl.git`
   Updating registry at `~/.julia/registries/General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
   Updating `~/.julia/environments/v1.4/Project.toml`
  [3a865a2d] + CuArrays v2.0.0 #mapreduce (https://github.com/wongalvis14/CuArrays.jl.git)
   Updating `~/.julia/environments/v1.4/Manifest.toml`
  [3895d2a7] + CUDAapi v4.0.0
  [c5f51814] + CUDAdrv v6.2.1
  [be33ccc6] + CUDAnative v3.0.1
  [f68482b8] + Cthulhu v1.0.0
  [3a865a2d] + CuArrays v2.0.0 #mapreduce (https://github.com/wongalvis14/CuArrays.jl.git)
  [0c68f7d7] + GPUArrays v3.1.0
  [929cbde3] + LLVM v1.3.4
  [dc548174] + TerminalMenus v0.1.0
  [a759f4b9] + TimerOutputs v0.5.3julia> using BenchmarkTools, CuArrays
[ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
WARNING: using CuArrays.BLAS in module Main conflicts with an existing identifier.

julia> function pi_mc_cu(nsamples)
           xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
           mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
       end
pi_mc_cu (generic function with 1 method)

julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial: 
  memory estimate:  11.58 KiB
  allocs estimate:  357
  --------------
  minimum time:     7.527 ms (0.00% GC)
  median time:      7.715 ms (0.00% GC)
  mean time:        7.795 ms (0.52% GC)
  maximum time:     10.703 ms (13.98% GC)
  --------------
  samples:          642
  evals/sample:     1

I'm seeing marginal gains, but still a very large regression over 1.7.2.

MasonProtter · 2020-03-26T00:16:39Z

@maleadt mentioned in #611 that it could be because mapreduce(f, op, xs, ys) is falling back on redude(op, map(f, xs, ys)) but looking at the old implementation: https://github.com/JuliaGPU/CuArrays.jl/blob/v1.7.2/src/mapreduce.jl I don't see any obvious mechanism in the old version to fuse the map with reduce when there's multiple containers.

maleadt · 2020-03-26T07:18:25Z

https://github.com/JuliaGPU/GPUArrays.jl/blob/fc08102f999e999fd3c6ac176bda0af450925032/src/mapreduce.jl#L129-L134

MasonProtter · 2020-03-26T18:03:33Z

Ohh, I see. Yeah, I missed that.

wongalvis14 · 2020-03-27T19:14:08Z

This implementation is faster than the old one on 1D array mapreduce

v1.7

julia> @benchmark pi_mc_cu(100000000)
BenchmarkTools.Trial: 
  memory estimate:  3.75 KiB
  allocs estimate:  113
  --------------
  minimum time:     4.068 ms (0.00% GC)
  median time:      7.378 ms (0.00% GC)
  mean time:        25.011 ms (0.15% GC)
  maximum time:     698.870 ms (0.00% GC)
  --------------
  samples:          200
  evals/sample:     1

New impl:

julia> @benchmark pi_mc_cu(100000000)
BenchmarkTools.Trial: 
  memory estimate:  14.91 KiB
  allocs estimate:  426
  --------------
  minimum time:     2.941 ms (0.00% GC)
  median time:      5.904 ms (0.00% GC)
  mean time:        20.800 ms (0.47% GC)
  maximum time:     723.428 ms (0.00% GC)
  --------------
  samples:          253
  evals/sample:     1

…e elements per thread.

…urns.

maleadt · 2020-03-31T13:36:17Z

Continuing the approach of this PR, which already improved performance by a good 25% (I can't reproduce @wongalvis14's timings with my GPU), I'm now selecting a launch configuration based on the recommended grid size as returned by the occupancy API. Together with #663 that brings us back to the original performance. Not sure how the old GPUArrays implementation did that though, as it launched multiple blocks without device-wide synchronization (i.e. it only used a single kernel)...

EDIT: ha, it did the reduction on the CPU, sneaky little bastard! https://github.com/JuliaGPU/GPUArrays.jl/blob/fc08102f999e999fd3c6ac176bda0af450925032/src/mapreduce.jl#L179-L180

maleadt · 2020-03-31T13:36:23Z

bors r+

bors · 2020-03-31T14:27:30Z

Build succeeded

ci/gitlab/staging

maleadt reviewed Mar 25, 2020

View reviewed changes

src/mapreduce.jl Outdated Show resolved Hide resolved

maleadt reviewed Mar 25, 2020

View reviewed changes

src/mapreduce.jl Outdated Show resolved Hide resolved

wongalvis14 force-pushed the mapreduce branch from fec501a to 181c3ba Compare March 25, 2020 19:13

maleadt mentioned this pull request Mar 31, 2020

Implement GPUArrays vararg mapreduce. #663

Merged

maleadt force-pushed the mapreduce branch from eb28965 to 32df61a Compare March 31, 2020 13:29

wongalvis14 and others added 3 commits March 31, 2020 15:30

Return to a two-staged kernel, but retain the loop to process multipl…

e148888

…e elements per thread.

Figure out the block sizes without passing additional arguments.

dc60037

Better align the launch configuration with what the occupancy API ret…

135cc76

…urns.

maleadt force-pushed the mapreduce branch from 32df61a to 135cc76 Compare March 31, 2020 13:30

bors bot merged commit 5047dc9 into JuliaGPU:master Mar 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve mapreduce performance #646

Improve mapreduce performance #646

wongalvis14 commented Mar 25, 2020 •

edited by maleadt

Loading

maleadt commented Mar 25, 2020

MasonProtter commented Mar 25, 2020

MasonProtter commented Mar 25, 2020 •

edited

Loading

MasonProtter commented Mar 26, 2020

maleadt commented Mar 26, 2020

MasonProtter commented Mar 26, 2020 •

edited

Loading

wongalvis14 commented Mar 27, 2020

maleadt commented Mar 31, 2020 •

edited

Loading

maleadt commented Mar 31, 2020

bors bot commented Mar 31, 2020

Improve mapreduce performance #646

Improve mapreduce performance #646

Conversation

wongalvis14 commented Mar 25, 2020 • edited by maleadt Loading

maleadt commented Mar 25, 2020

MasonProtter commented Mar 25, 2020

MasonProtter commented Mar 25, 2020 • edited Loading

MasonProtter commented Mar 26, 2020

maleadt commented Mar 26, 2020

MasonProtter commented Mar 26, 2020 • edited Loading

wongalvis14 commented Mar 27, 2020

maleadt commented Mar 31, 2020 • edited Loading

maleadt commented Mar 31, 2020

bors bot commented Mar 31, 2020

Build succeeded

wongalvis14 commented Mar 25, 2020 •

edited by maleadt

Loading

MasonProtter commented Mar 25, 2020 •

edited

Loading

MasonProtter commented Mar 26, 2020 •

edited

Loading

maleadt commented Mar 31, 2020 •

edited

Loading