v1.0.0
WaterLily v1.0.0
Upgraded the solver for backend agnostic execution:
- New version of
@loop
macro which integrates KernelAbstractions.jl (KA) to run multi-threaded on CPUs and GPUs. This replaces the@simd
version of@loop
as well as previous multi-threading code. Each@loop <expr> over <I in R>
is expanded into a@kernel function
and then run on the backend of the first variable in<expr>
. - BREAKING CHANGE: Many high-level functions don't compile or run correctly or run much slower than expected on GPUs. Things as simple as
sum
orLinearAlgebra:norm2
. These have been replaced in the code-base with lower-level functions, but unfortunately, users will need to take extra care when defining things likeAutoBody(sdf, map)
functions. - PERFORMANCE NOTE: KA allocates to the CPU on every loop. Reverting
@loop
to use@simd
restores a perfectly non-allocatingsim_step!
. We tried other tools likePolyester.jl
which had better multi-threading performance for small simulations, but large simulations is where we need the speed-up and so we chose KA. - PERFORMANCE NOTE:
@loop
is not fully optimized. For example, there is an execution overhead for each@loop
call on GPUs. A few of the loops have been combined to help reduce this overhead, but many more would require major refactoring or modification of the@loop
macro. Despite this we benchmarked up to 182x speed-up with GPU execution. - BREAKING CHANGE: The
Simulation
constructor arguments have changed.dims
is now the internal field dimension(L,2L)
not(L+2,2L+2)
, andU
must now be anNTuple
. - The
Simulation
constructor also take a newmem=Array
argument which can be set toCUDA.CuArray
orAMDGPU.ROCArray
to set-up simulations on GPUs. TheFlow
andPoisson
structs now useAbstractArrays
for all fields to accommodate those arrays types. - DEFAULT CHANGE:
sim_step!(remeasure=true)
is now the default as that is the safer (but slower) option. Poission
now shares memory for theL
,x
, andz
fields withFlow
to reduce the memory footprint. Thez
field holds the RHS vector and is mutated bysolve!
.- The
SOR!
andGS!
smoothers are not thread-safe, and have been replaced with a Jacobi preconditioned conjugate-gradient smoother held in new routinesJacobi!
andpcg!
. - PERFORMANCE NOTE: Because of the poor-scaling on small fields, the number of multi-grid levels has been set to a default
maxlevels=4
. The optimal number of levels is likely to be simulation and backend dependent. - PERFORMANCE NOTE:
pcg!
requires a lot of inner products, which are somewhat slow. Switching to the data-driven approximate inverse smoother may be beneficial in the future. - Because of the poor-scaling on small fields, the multi-grid-style recursive
apply_sdf!
has been replaced withmeasure_sdf!
which simply@loop
sbody.sdf()
.
There have also many changes to the code outside of src
to support the upgrade:
- The testing cases have been massively expanded. In particular, there are tests for every major function on CPU, CUDA, and AMDGPU backends.
- The benchmarks have been massively expanded. In particular, benchmarks for each function within
mom_step!
as well as the 3D TGV and donut cases can be compared against previous commits, including pre 1.0 versions. - The examples have been brought up-to-date, including GPU execution for the 3D examples and a new jelly fish example demonstrating a deforming geometry.
The only (intentional) modelling change was to add correct_div!(σ)
to Body.jl
to enable the deformable jelly fish example. This has nothing to do with the backend upgrade and should have been added to master and then merged in - but it wasn't.
Closed issues:
Merged pull requests:
- add function addBody (#35) (@Blagneaux)
- Update for new Makie (#37) (@asinghvi17)
- Boundary conditions kernel and dependencies (#38) (@b-fg)
- Flow.jl MWE (#39) (@b-fg)
- Moved creation of boundary conditions array out of Flow (#40) (@b-fg)
- Cleaned up CUDAEnv/Flow.jl and fixed allowscalar in tests. Fixed BCs too. (#41) (@b-fg)
- Started porting Flow.jl using KernelAbstractions.jl [WIP] (#43) (@b-fg)
- Changed from cu to CuArray the way to create arrays in GPU memory. (#45) (@b-fg)
- Added CUDAEnv/benchmark.jl where it breaks down mom_step. (#46) (@b-fg)
- mom_step benchmark (#47) (@b-fg)
- Added AMDGPU package (#48) (@b-fg)
- Update to 1.0 (#49) (@weymouth)