Update to 1.0 #49

weymouth · 2023-04-26T07:29:59Z

Upgraded the solver for backend agnostic execution:

New version of @loop macro which integrates KernelAbstractions.jl (KA) to run multi-threaded on CPUs and GPUs. This replaces the @simd version of @loop as well as previous multi-threading code. Each @loop <expr> over <I in R> is expanded into a @kernel function and then run on the backend of the first variable in <expr>.
BREAKING CHANGE: Many high-level functions don't compile or run correctly or run much slower than expected on GPUs. Things as simple as sum or LinearAlgebra:norm2. These have been replaced in the code-base with lower-level functions, but unfortunately, users will need to take extra care when defining things like AutoBody(sdf, map) functions.
PERFORMANCE NOTE: KA allocates to the CPU on every loop. Reverting @loop to use @simd restores a perfectly non-allocating sim_step!. We tried other tools like Polyester.jl which had better multi-threading performance for small simulations, but large simulations is where we need the speed-up and so we chose KA.
PERFORMANCE NOTE: @loop is not fully optimized. For example, there is an execution overhead for each @loop call on GPUs. A few of the loops have been combined to help reduce this overhead, but many more would require major refactoring or modification of the @loop macro. Despite this we benchmarked up to 182x speed-up with GPU execution.
BREAKING CHANGE: The Simulation constructor arguments have changed. dims is now the internal field dimension (L,2L) not (L+2,2L+2), and U must now be an NTuple.
The Simulation constructor also take a new mem=Array argument which can be set to CUDA.CuArray or AMDGPU.ROCArray to set-up simulations on GPUs. The Flow and Poisson structs now use AbstractArrays for all fields to accommodate those arrays types.
DEFAULT CHANGE: sim_step!(remeasure=true) is now the default as that is the safer (but slower) option.
Poission now shares memory for the L, x, and z fields with Flow to reduce the memory footprint. The z field holds the RHS vector and is mutated by solve!.
The SOR! and GS! smoothers are not thread-safe, and have been replaced with a Jacobi preconditioned conjugate-gradient smoother held in new routines Jacobi! and pcg!.
PERFORMANCE NOTE: Because of the poor-scaling on small fields, the number of multi-grid levels has been set to a default maxlevels=4. The optimal number of levels is likely to be simulation and backend dependent.
PERFORMANCE NOTE: pcg! requires a lot of inner products, which are somewhat slow. Switching to the data-driven approximate inverse smoother may be beneficial in the future.
Because of the poor-scaling on small fields, the multi-grid-style recursive apply_sdf! has been replaced with measure_sdf! which simply @loops body.sdf().

There have also many changes to the code outside of src to support the upgrade:

The testing cases have been massively expanded. In particular, there are tests for every major function on CPU, CUDA, and AMDGPU backends.
The benchmarks have been massively expanded. In particular, benchmarks for each function within mom_step! as well as the 3D TGV and donut cases can be compared against previous commits, including pre 1.0 versions.
The examples have been brought up-to-date, including GPU execution for the 3D examples and a new jelly fish example demonstrating a deforming geometry.

The only (intentional) modelling change was to add correct_div!(σ) to Body.jl to enable the deformable jelly fish example. This has nothing to do with the backend upgrade and should have been added to master and then merged in - but it wasn't.

@loop

Cleaned up the benchmarking code to be use a 2D and 3D sphere test, as well as adding profiling. 2D is running fine, but 3D has some array allocation issue that I need to track down and fix. The code is currently spending around 80% of its time on loops created with the @loop macro. I added Polyester.@Batch to this macro which does a nice job of maintaining approximately the same optimizations on 1 thread, but improving performance on a multi-core machine. This required changing a few closures in MultiLevelPoisson, but otherwise, it's a 1-line improvement.

The memory leak was (of course) from the small closure changes I made separate closure functions, and this let Polyester work and killed the allocations. I've also added a function DISABLE_PUSH() which I can use to quickly stop pushing to the CFL and Poisson count arrays. This lets me ensure the allocations are zero when running on a single thread. When running multi-threaded, there are allocations created for each loop - currently this is a O(100) allocations and only O(10 KiB).

The flux calculation in conv_diff! contributes to the forcing on the cells on either side of the face. I was doing this all in one loop, but this introduces a multi-threading conflict, so I was getting random forcing. I've passed in an array to calculate and store the flux, and then apply it to the force in two loops to avoid the conflicts.

Updated the benchmark to check across problem sizes, 2D vs 3D as well as single vs multi-threaded. Compute time per time step is linear with length(u). Multi-threading on my machine gives a 2x speedup, except for very small jobs.

@loop

Adding a folder to help develop the code needed to make WaterLily run on GPUs 1. divergence.jl is a test case comparing the current scalar approach to a GPU approach using KernelAbstractions.jl and OffsetArrays. It's seeing around a 100x speed up for the given (large 2D) test case. Also has a sketch of a new Flow data type. 2. macros.jl defines a macro to automate the creation of kernels using @loop and @inside. This should allow a really easy switch to GPU computing. 3. clamped.jl is an attempt to avoid OffsetArrays and the ghost cells altogether. It works, and it's just as fast on the GPU (since it's bandwidth limited) but it's 4x slower on the CPU. So we'll stick to OffsetArrays.

Boundary conditions kernel and dependencies

] activate CUDAEnv; resolve will activate that project toml without cluttering up the

functions to get new Flow working (commented out for the moment). Small improvements on bcs.jl in this line too.

Switching the code over to general arrays, starting with using OffsetArrays in Poisson. The new file tests this with and without offset and it gives idential solutions and times. Note! This is a breaking change. The current versions of MultiLevelPoisson, and Flow and many other things won't work. They all need to be changed over.

Flow.jl MWE

The poisson test was running solver without resetting the initial guess, so the timings were not accurate. Also, Polyester's allocations are confusing the process, so I've removed it.

@loop

Added a sketch of the Flow struct (using a mapping to convert to CuArrays, which might be a nice way to do it). The previous version of the @loop macro failed when given an expression with composite values like ``` @loop flow.div[I] = div_operator(I,flow.u) ``` which is a problem since there are TONS of these in the code-base. Problem 1: `:(flow.u)` is an Expr not a Symbol, so `grab!()` broke this up into `:flow, :u`, which isn't what we want to pass to the function. This is solved by simply aborting the recursion when `head==:.`. Problem 2: `:(flow.u)` is not a valid function argument, meaning you can not write ``` function f(flow.div,flow.u) # invalid arguments I = @Index() flow.div[I] = div_operator(I,flow.u) end ``` Instead you need to replace these with the composite value itself *including in the expression* ie ``` function f(div,u) # rename arguments I = @Index() div[I] = div_operator(I,u) # replace arguments in expression end f(flow.div,flow.u) # call with full composite values ``` This was pretty easy, with some help from a discourse post. Now it's running and tested to match the hand-written kernel.

I tried using the lambda function f to adapt the arrays, but Flow.jl:68 throws an error when doing so. It makes little sense since all the other CUDA arrays are created without any issues when using this method.

I have commented out most of the WaterLily.jl and test\runtests.jl files. The plan is to roll out the changes to the code base one file at a time, and add the tests as we go. Currently tested most of the serial code in util, including OA and the new version of inside().

@loops

To do so, KernelAbstractions, CUDA, and Adapt, have been added into the Project.toml. Another package has also been added, OutMacro, which is helpful for debugging/testing. Added BC! and its kernel into utils.jl and a test for this. Added adapt! into utils.jl, which can be used anywhere to adapt an array to the corresponding backend defined in utils.jl. Finally, added a commented out version of apply! and its kernel into utils.jl, but this could be written using the new @loops.

… passing in Github. Also bumped WaterLily version to 1.0.0 (all this is a major breaking change, so in my book this requires bumping the first digit). Modified the github workflow to use at least julia 1.6, as defined in the Project.toml

Added test for both CPU and GPU backends. Removed text from tests and removed OutMacro package.

Moved creation of boundary conditions array out of Flow

@inside

The main thing is using the new KernelAbstractions versions of @inside and @loop. Cleaned up the file a bit and updated the apply! tests. I took out the backend stuff for now. We can add it back later if needed.

@loop

…U or GPU arrays. It also uses the new apply! with @loop. Fixed allowscalar in tests. According to the docs, this is the correct way to do lines/block allowscalar stuff.

@loop

Sooooo many bug fixes. In Poisson, and in the @loop macro. Test are now running and working. on CPU in 2D. There is also a big in the BC! code, but I've side stepped this for the moment.

Fix to ensure grab!() applies rep() to the arguments, and added test in the util test set. Ran Poisson test for CUDA arrays and it worked without change... which seems wrong since SOR!() seems to use scalar indexing. I switched to use GS!() instead, but we need to figure out if scalar indexing is being checked properly.

Cleaned up CUDAEnv/Flow.jl and removed CUDAEnv/bcs.jl, since the latter is already in utils.jl and tested. Added OutMacro.jl package as extra since it is useful for debugging.

compiling and running, but not working on large grids.

These still fail in GPU since L2 function in utils.jl is indexing. Tests for utils.jl are all passing.

The solver can be run on a AMD GPU using Julia 1.9.0. This has been tested at the BSC AMD cluster with a TGV case. To run on AMD, simply change the array type from Array to ROCArray. Tests now also consider AMD devices if available. The TOML.jl package had to be explicitely added too, even if is part of the standard library - unsure why.

This seems like a nice way to generalize the makie video generation. Changed the TGV example to use lambda_2 contours.

I was getting a visual artifact between the two levels on either size of zero. Setting `levels` to an even number gets rid of it.

…marks.

Added AMDGPU package

codecov · 2023-04-26T07:35:21Z

Codecov Report

Merging #49 (46d1e01) into master (a340b5a) will increase coverage by 6.02%.
The diff coverage is 96.52%.

❗ Current head 46d1e01 differs from pull request most recent head a82eb10. Consider uploading reports for the commit a82eb10 to get more accurate results

@@            Coverage Diff             @@
##           master      #49      +/-   ##
==========================================
+ Coverage   87.69%   93.71%   +6.02%     
==========================================
  Files           8        8              
  Lines         325      382      +57     
==========================================
+ Hits          285      358      +73     
+ Misses         40       24      -16

Impacted Files	Coverage Δ
src/Body.jl	`80.00% <75.00%> (-3.34%)`	⬇️
src/WaterLily.jl	`83.33% <80.00%> (-0.88%)`	⬇️
src/util.jl	`84.37% <89.28%> (+7.58%)`	⬆️
src/AutoBody.jl	`85.71% <92.30%> (-3.58%)`	⬇️
src/MultiLevelPoisson.jl	`98.50% <97.43%> (+9.41%)`	⬆️
src/Flow.jl	`98.64% <100.00%> (+13.90%)`	⬆️
src/Metrics.jl	`100.00% <100.00%> (ø)`
src/Poisson.jl	`100.00% <100.00%> (+4.08%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

test/runtests.jl

benchmark/donut/donut.jl

Gabe Weymouth and others added 30 commits March 24, 2023 17:25

Update benchmark

ff68c54

Updated the benchmark to check across problem sizes, 2D vs 3D as well as single vs multi-threaded. Compute time per time step is linear with length(u). Multi-threading on my machine gives a 2x speedup, except for very small jobs.

Update Project.toml

5b2f120

Boundary conditions kernel example.

43c08d5

Added new kernel dependencies and updated .gitignore for VSCode stuff.

2e37b87

Merge pull request #38 from b-fg/CUDA

ac10e8f

Boundary conditions kernel and dependencies

Silo CUDA into the CUDAEnv folder

1f6c287

] activate CUDAEnv; resolve will activate that project toml without cluttering up the

Flow.jl self-contained example. Also included in utils.jl the necessary

5721513

functions to get new Flow working (commented out for the moment). Small improvements on bcs.jl in this line too.

Updated Flow struct to the performant method.

413a851

Merge pull request #39 from b-fg/CUDA

9c0dcd1

Flow.jl MWE

Fix Poisson test bug

c1741df

The poisson test was running solver without resetting the initial guess, so the timings were not accurate. Also, Polyester's allocations are confusing the process, so I've removed it.

Moved the creation of the boundary conditions array out of Flow.

dee7a1e

I tried using the lambda function f to adapt the arrays, but Flow.jl:68 throws an error when doing so. It makes little sense since all the other CUDA arrays are created without any issues when using this method.

Merged from upstream/CUDA

b25c0d2

Fixed bug in BC! kernel for scalars.

7c5f3ef

Added test for both CPU and GPU backends. Removed text from tests and removed OutMacro package.

Merge pull request #40 from b-fg/CUDA

a6e222b

Moved creation of boundary conditions array out of Flow

Finished util.jl update and test

d27223c

The main thing is using the new KernelAbstractions versions of @inside and @loop. Cleaned up the file a bit and updated the apply! tests. I took out the backend stuff for now. We can add it back later if needed.

Cleaned up CUDAEnv/Flow.jl which now uses the function f to create CP…

c3dcf66

…U or GPU arrays. It also uses the new apply! with @loop. Fixed allowscalar in tests. According to the docs, this is the correct way to do lines/block allowscalar stuff.

Bug fixes for Poisson

013b08c

Sooooo many bug fixes. In Poisson, and in the @loop macro. Test are now running and working. on CPU in 2D. There is also a big in the BC! code, but I've side stepped this for the moment.

Fixed BCs for vector fields and updated tests accordingly.

87e5780

Cleaned up CUDAEnv/Flow.jl and removed CUDAEnv/bcs.jl, since the latter is already in utils.jl and tested. Added OutMacro.jl package as extra since it is useful for debugging.

MultiLevel update

e25555e

compiling and running, but not working on large grids.

Fixed merge conflicts and added @allowscalar on the Poisson checks.

d178069

These still fail in GPU since L2 function in utils.jl is indexing. Tests for utils.jl are all passing.

b-fg and others added 22 commits April 24, 2023 16:39

Cleaned up imports in utils.jl since it was not passing CI tests.

4ba67f2

Bumped CI version to 1.8

d59ecdf

Update TwoD_circle

281e781

Update TwoD_UnicodePlots.jl

8e928d5

Update TwoD_Julia.jl

534eee0

Updated TGV example

d94289e

Update 3D plots to take plotting function as first argument

f7ec079

This seems like a nice way to generalize the makie video generation. Changed the TGV example to use lambda_2 contours.

Update ThreeD_donut.jl

3c3ceb0

Update ThreeD_Plots.jl

e9505d7

Update TwoD_hover.jl

95d1b7c

Update TwoD_hover.jl

0b907db

Update README.md

1775da1

Update README.md

5b55874

Added levels to flood

0ffd28f

I was getting a visual artifact between the two levels on either size of zero. Setting `levels` to an even number gets rid of it.

Clean up CUDAEnv and toml and benchmark

08e9983

Forgot about the notebooks...

24c6cde

Merge branch 'CUDA' into CUDA

1cc6969

Updated README.md regarding AMD GPUs and commented out plots in bench…

08300da

…marks.

Delete tests_using_threads.jl

aa088f6

Merge pull request #48 from b-fg/CUDA

74e7c2d

Added AMDGPU package

Update ThreeD_jelly.jl

46d1e01

b-fg reviewed Apr 26, 2023

View reviewed changes

test/runtests.jl Outdated Show resolved Hide resolved

b-fg reviewed Apr 26, 2023

View reviewed changes

benchmark/donut/donut.jl Outdated Show resolved Hide resolved

weymouth linked an issue Apr 26, 2023 that may be closed by this pull request

Use KernelAbstractions for loops? #18

Closed

weymouth added the enhancement New feature or request label Apr 26, 2023

Bernat caught a few things

a82eb10

weymouth merged commit 247b886 into master Apr 26, 2023

weymouth deleted the CUDA branch June 8, 2023 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to 1.0 #49

Update to 1.0 #49

weymouth commented Apr 26, 2023 •

edited

Loading

codecov bot commented Apr 26, 2023 •

edited

Loading

Update to 1.0 #49

Update to 1.0 #49

Conversation

weymouth commented Apr 26, 2023 • edited Loading

codecov bot commented Apr 26, 2023 • edited Loading

Codecov Report

weymouth commented Apr 26, 2023 •

edited

Loading

codecov bot commented Apr 26, 2023 •

edited

Loading