Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

Optimize accumulate #447

Merged
merged 13 commits into from
Oct 14, 2019
Merged

Optimize accumulate #447

merged 13 commits into from
Oct 14, 2019

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Oct 10, 2019

Ref #445

Greatly improves performance (CPU vs reference GPU vs optimized GPU):

i = 10
  8.239 ns (0 allocations: 0 bytes)
  74.607 μs (305 allocations: 9.88 KiB)
  5.928 μs (22 allocations: 768 bytes)
i = 100
  85.683 ns (0 allocations: 0 bytes)
  114.921 μs (462 allocations: 14.81 KiB)
  6.047 μs (23 allocations: 784 bytes)
i = 1000
  1.040 μs (0 allocations: 0 bytes)
  154.329 μs (629 allocations: 19.91 KiB)
  5.966 μs (28 allocations: 864 bytes)
i = 10000
  10.284 μs (0 allocations: 0 bytes)
  198.071 μs (861 allocations: 26.84 KiB)
  67.132 μs (236 allocations: 7.53 KiB)
i = 100000
  102.629 μs (0 allocations: 0 bytes)
  237.085 μs (1035 allocations: 32.05 KiB)
  66.363 μs (246 allocations: 7.69 KiB)
i = 1000000
  1.056 ms (0 allocations: 0 bytes)
  286.495 μs (1234 allocations: 37.64 KiB)
  68.353 μs (255 allocations: 7.83 KiB)
i = 10000000
  10.783 ms (0 allocations: 0 bytes)
  334.199 μs (1570 allocations: 46.20 KiB)
  129.853 μs (469 allocations: 14.61 KiB)
i = 100000000
  110.446 ms (0 allocations: 0 bytes)
  1.001 ms (1763 allocations: 51.72 KiB)
  130.578 μs (479 allocations: 14.77 KiB)

Most important contribution is reducing the amount of kernels to 1 for small arrays (N=500 to 1000, depending on the GPU), and to 4 for most other arrays, but the kernels themselves are much faster too. However, our kernel launching code paths have slowed down over the course of the years, so we're paying a significant penalty (multiple 10s of μs) for every kernel launch.

As this kernel is used by a bunch of other operations, this also speeds up e.g. findall (#446 (comment)):

# GPU reference
julia> for i in (100, 1_000, 10_000, 100_000)
       @btime findall($(CuArray(rand(Bool, i))));
       end
  127.818 μs (505 allocations: 16.27 KiB)
  162.533 μs (672 allocations: 21.36 KiB)
  210.293 μs (900 allocations: 28.23 KiB)
  243.953 μs (1071 allocations: 33.39 KiB)

# GPU optimized
julia> for i in (100, 1_000, 10_000, 100_000)
       @btime findall($(CuArray(rand(Bool, i))));
       end
  38.430 μs (128 allocations: 4.16 KiB)
  43.782 μs (136 allocations: 4.28 KiB)
  105.097 μs (338 allocations: 10.88 KiB)
  103.674 μs (348 allocations: 11.03 KiB)

cc @ChrisRackauckas

@maleadt
Copy link
Member Author

maleadt commented Oct 10, 2019

Oh interesting, cumprod fails since we hard-code 0 as neutral element (the Base.accumulate API does not have a keyword argument for that). I'll need to restructure the kernel a little, or add a v0 argument.

@maleadt maleadt merged commit 36697d3 into master Oct 14, 2019
@bors bors bot deleted the tb/accumulate branch October 14, 2019 07:20
@maleadt maleadt mentioned this pull request Oct 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant