Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelizing BallTree Construction #132

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

SebastianAment
Copy link

@SebastianAment SebastianAment commented Jan 7, 2022

Overview

This PR parallelizes the construction of BallTree structures, achieving a speedup of a factor of 5 for n = 1_000_000 points with 8 threads.

The implementation uses @spawn and @sync, which requires raising the Julia compatibility entry to 1.3 and incrementing the minor version of this package.

Benchmarks

Setup

using NearestNeighbors
using BenchmarkTools
d = 100

On Master

n = 100;
X = randn(d, n);
@btime T = BallTree(X);
  1.244 ms (23 allocations: 174.83 KiB)

n = 10_000;
X = randn(d, n);
@btime T = BallTree(X);
  372.398 ms (26 allocations: 16.95 MiB)

n = 100_000;
X = randn(d, n);
@btime T = BallTree(X);
  7.989 s (26 allocations: 169.53 MiB)

n = 1_000_000;
X = randn(d, n);
@btime T = BallTree(X);
  161.170 s (26 allocations: 1.66 GiB)

With this PR (updated after further edits with improved allocations)

n = 100;
X = randn(d, n);
@btime T = BallTree(X);
  813.417 μs (244 allocations: 189.97 KiB)

n = 10_000;
X = randn(d, n);
@btime T = BallTree(X);
  101.158 ms (25348 allocations: 18.70 MiB)

n = 100_000;
X = randn(d, n);
@btime T = BallTree(X);
  2.816 s (253697 allocations: 187.03 MiB)

n = 1_000_000;
X = randn(d, n);
@btime T = BallTree(X);
  33.461 s (2527680 allocations: 2.13 GiB)

Further, the PR still allows for sequential execution with the parallel = false keyword:

n = 100;
X = randn(d, n);
@btime T = BallTree(X, parallel = false);
  1.090 ms (24 allocations: 174.06 KiB)

n = 10_000;
X = randn(d, n);
@btime T = BallTree(X, parallel = false);
  362.205 ms (27 allocations: 16.95 MiB)

n = 100_000;
X = randn(d, n);
@btime T = BallTree(X, parallel = false);
  8.262 s (27 allocations: 169.53 MiB)

n = 1_000_000;
X = randn(d, n);
@btime T = BallTree(X, parallel = false);
  150.437 s (25 allocations: 1.66 GiB)

Summary

  • The parallel implementation yields a speed up for even small datasets of n = 100 data points, and achieves a speedup of a factor of 3 for n = 100_000 points.

  • Compared to the sequential code, the memory allocation is up by about 10-20% in size and considerably in number, which is due to the parallel code needing to allocate temporary arrays to avoid race conditions, while the sequential code reuses a single temporary. If allocations, rather than execution speed are the concern, one can always use the parallel = false flag this PR provides.

  • The sequential option parallel = false maintains the same allocation behavior and comparable performance as the master branch. Notably, the sequential branch of this PR is consistently 20% faster on the n = 100 test case compared to master.

The experiments were run on a 2021 MacBook Pro with an M1 Pro and 8 threads.

@SebastianAment SebastianAment force-pushed the parallel-ball-tree branch 2 times, most recently from e40c3f0 to f6acba9 Compare January 8, 2022 11:25
Copy link
Owner

@KristofferC KristofferC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.

The parallel implementation yields a speed up for even small datasets of n = 100 data points,

But from what I understand, the parallel building only happens if the size is smaller than DEFAULT_BALLTREE_MIN_PARALLEL_SIZE which is 1024? What gives the speed improvement for small trees?


Since the structure of creating a BallTree and a KDTree is pretty much the same, the same could be applied there?


You seem to have an extra commit not related to the tree building in this PR.

@@ -88,3 +57,14 @@ function create_bsphere(m::Metric,

return HyperSphere(SVector{N,T}(center), rad)
end

@inline function interpolate(::M, c1::V, c2::V, x, d) where {V <: AbstractVector, M <: NormMetric}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move this function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had two versions locally, the previous one, and this one without the array buffer variable ab. It turns out that in the sequential code, the compiler is able to get rid of the allocations without explicitly pre-allocating an ArrayBuffer variable. In the parallel code, having an array buffer leads to race conditions, which is why I wrote this modification.

I can move it back to where it was in the file.

high::Int,
tree_data::TreeData,
reorder::Bool,
parallel::Val{true},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a Val and use a separate function like this feels a bit awkward. Couldn't one just look at parallel_size in the original build_BallTree function and then decide whether to call the parallel function or the serial one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using type dispatch on the parallel variable is important, because the compiler is able to get rid of temporary allocations during sequential execution. I can isolate the recursive component of the function though, and only use the Val(true) dispatch for that. If we only use a regular if statement on a Bool, performance during sequential execution will take a hit compared to the status quo.

@SebastianAment
Copy link
Author

The parallel implementation yields a speed up for even small datasets of n = 100 data points,

But from what I understand, the parallel building only happens if the size is smaller than DEFAULT_BALLTREE_MIN_PARALLEL_SIZE which is 1024? What gives the speed improvement for small trees?

This was run with a prior version where parallel_size = 0. A larger parallel_size seems beneficial for larger problems, where parallelization plays a bigger role.

Since the structure of creating a BallTree and a KDTree is pretty much the same, the same could be applied there?

I have a parallelized KDTree implementation locally too, but wanted to finish this one first. Do you prefer having everything in the same PR?

You seem to have an extra commit not related to the tree building in this PR.

Yes, maybe this wasn't smart in retrospect. I thought at the time that this PR would be easy to merge and just built on top of it. Would you like me to edit the commit history of the current PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants