-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelizing BallTree
Construction
#132
base: master
Are you sure you want to change the base?
Parallelizing BallTree
Construction
#132
Conversation
e40c3f0
to
f6acba9
Compare
f6acba9
to
18b93cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this.
The parallel implementation yields a speed up for even small datasets of n = 100 data points,
But from what I understand, the parallel building only happens if the size is smaller than DEFAULT_BALLTREE_MIN_PARALLEL_SIZE
which is 1024? What gives the speed improvement for small trees?
Since the structure of creating a BallTree
and a KDTree
is pretty much the same, the same could be applied there?
You seem to have an extra commit not related to the tree building in this PR.
@@ -88,3 +57,14 @@ function create_bsphere(m::Metric, | |||
|
|||
return HyperSphere(SVector{N,T}(center), rad) | |||
end | |||
|
|||
@inline function interpolate(::M, c1::V, c2::V, x, d) where {V <: AbstractVector, M <: NormMetric} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had two versions locally, the previous one, and this one without the array buffer variable ab
. It turns out that in the sequential code, the compiler is able to get rid of the allocations without explicitly pre-allocating an ArrayBuffer
variable. In the parallel code, having an array buffer leads to race conditions, which is why I wrote this modification.
I can move it back to where it was in the file.
high::Int, | ||
tree_data::TreeData, | ||
reorder::Bool, | ||
parallel::Val{true}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a Val
and use a separate function like this feels a bit awkward. Couldn't one just look at parallel_size
in the original build_BallTree
function and then decide whether to call the parallel function or the serial one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using type dispatch on the parallel variable is important, because the compiler is able to get rid of temporary allocations during sequential execution. I can isolate the recursive component of the function though, and only use the Val(true)
dispatch for that. If we only use a regular if
statement on a Bool
, performance during sequential execution will take a hit compared to the status quo.
This was run with a prior version where
I have a parallelized KDTree implementation locally too, but wanted to finish this one first. Do you prefer having everything in the same PR?
Yes, maybe this wasn't smart in retrospect. I thought at the time that this PR would be easy to merge and just built on top of it. Would you like me to edit the commit history of the current PR? |
Overview
This PR parallelizes the construction of
BallTree
structures, achieving a speedup of a factor of 5 forn = 1_000_000
points with 8 threads.The implementation uses
@spawn
and@sync
, which requires raising the Julia compatibility entry to 1.3 and incrementing the minor version of this package.Benchmarks
Setup
On Master
With this PR (updated after further edits with improved allocations)
Further, the PR still allows for sequential execution with the
parallel = false
keyword:Summary
The parallel implementation yields a speed up for even small datasets of
n = 100
data points, and achieves a speedup of a factor of 3 forn = 100_000
points.Compared to the sequential code, the memory allocation is up by about 10-20% in size and considerably in number, which is due to the parallel code needing to allocate temporary arrays to avoid race conditions, while the sequential code reuses a single temporary. If allocations, rather than execution speed are the concern, one can always use the
parallel = false
flag this PR provides.The sequential option
parallel = false
maintains the same allocation behavior and comparable performance as the master branch. Notably, the sequential branch of this PR is consistently 20% faster on then = 100
test case compared to master.The experiments were run on a 2021 MacBook Pro with an M1 Pro and 8 threads.