Optimize accumulate #447

maleadt · 2019-10-10T10:44:58Z

Greatly improves performance (CPU vs reference GPU vs optimized GPU):

i = 10
  8.239 ns (0 allocations: 0 bytes)
  74.607 μs (305 allocations: 9.88 KiB)
  5.928 μs (22 allocations: 768 bytes)
i = 100
  85.683 ns (0 allocations: 0 bytes)
  114.921 μs (462 allocations: 14.81 KiB)
  6.047 μs (23 allocations: 784 bytes)
i = 1000
  1.040 μs (0 allocations: 0 bytes)
  154.329 μs (629 allocations: 19.91 KiB)
  5.966 μs (28 allocations: 864 bytes)
i = 10000
  10.284 μs (0 allocations: 0 bytes)
  198.071 μs (861 allocations: 26.84 KiB)
  67.132 μs (236 allocations: 7.53 KiB)
i = 100000
  102.629 μs (0 allocations: 0 bytes)
  237.085 μs (1035 allocations: 32.05 KiB)
  66.363 μs (246 allocations: 7.69 KiB)
i = 1000000
  1.056 ms (0 allocations: 0 bytes)
  286.495 μs (1234 allocations: 37.64 KiB)
  68.353 μs (255 allocations: 7.83 KiB)
i = 10000000
  10.783 ms (0 allocations: 0 bytes)
  334.199 μs (1570 allocations: 46.20 KiB)
  129.853 μs (469 allocations: 14.61 KiB)
i = 100000000
  110.446 ms (0 allocations: 0 bytes)
  1.001 ms (1763 allocations: 51.72 KiB)
  130.578 μs (479 allocations: 14.77 KiB)

Most important contribution is reducing the amount of kernels to 1 for small arrays (N=500 to 1000, depending on the GPU), and to 4 for most other arrays, but the kernels themselves are much faster too. However, our kernel launching code paths have slowed down over the course of the years, so we're paying a significant penalty (multiple 10s of μs) for every kernel launch.

As this kernel is used by a bunch of other operations, this also speeds up e.g. findall (#446 (comment)):

# GPU reference
julia> for i in (100, 1_000, 10_000, 100_000)
       @btime findall($(CuArray(rand(Bool, i))));
       end
  127.818 μs (505 allocations: 16.27 KiB)
  162.533 μs (672 allocations: 21.36 KiB)
  210.293 μs (900 allocations: 28.23 KiB)
  243.953 μs (1071 allocations: 33.39 KiB)

# GPU optimized
julia> for i in (100, 1_000, 10_000, 100_000)
       @btime findall($(CuArray(rand(Bool, i))));
       end
  38.430 μs (128 allocations: 4.16 KiB)
  43.782 μs (136 allocations: 4.28 KiB)
  105.097 μs (338 allocations: 10.88 KiB)
  103.674 μs (348 allocations: 11.03 KiB)

cc @ChrisRackauckas

maleadt · 2019-10-10T12:42:28Z

Oh interesting, cumprod fails since we hard-code 0 as neutral element (the Base.accumulate API does not have a keyword argument for that). I'll need to restructure the kernel a little, or add a v0 argument.

maleadt added 2 commits October 10, 2019 12:40

Add an optimized implementation of accumulate.

efae2ed

Minor improvements to other kernels.

1fcb3ba

maleadt added the performance label Oct 10, 2019

Improve test coverage.

b140c2f

ChrisRackauckas mentioned this pull request Oct 10, 2019

continuous callback tests pass SciML/DiffEqGPU.jl#22

Closed

maleadt added 10 commits October 11, 2019 11:49

Fixes to copyto with SubArray.

e0b9678

Don't launch unnecessary threads for indexing kernels.

03c38d0

Generalize accumulate to multiple dimensions.

35fcbde

Don't hard-code +.

240d98c

No need for pow2-aggregates -- avoid broadcasting copy.

cb2bfe5

Implement another Base method.

8dec783

Remove literal 0.

77a468c

Fast-path for non multidimensional scan.

bcf29b3

Use global state to provide a neutral element.

8a3aa2f

Support for initializing.

cccf8c7

maleadt force-pushed the tb/accumulate branch from 246fff3 to cccf8c7 Compare October 14, 2019 06:41

maleadt merged commit 36697d3 into master Oct 14, 2019

bors bot deleted the tb/accumulate branch October 14, 2019 07:20

maleadt mentioned this pull request Oct 14, 2019

Extend accumulate! #68

Closed

maleadt referenced this pull request Nov 4, 2019

Test released Flux.

7642038

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize accumulate #447

Optimize accumulate #447

maleadt commented Oct 10, 2019

maleadt commented Oct 10, 2019

Optimize accumulate #447

Optimize accumulate #447

Conversation

maleadt commented Oct 10, 2019

maleadt commented Oct 10, 2019