Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to 1.0 #49

Merged
merged 109 commits into from
Apr 26, 2023
Merged

Update to 1.0 #49

merged 109 commits into from
Apr 26, 2023

Conversation

weymouth
Copy link
Collaborator

@weymouth weymouth commented Apr 26, 2023

Upgraded the solver for backend agnostic execution:

  • New version of @loop macro which integrates KernelAbstractions.jl (KA) to run multi-threaded on CPUs and GPUs. This replaces the @simd version of @loop as well as previous multi-threading code. Each @loop <expr> over <I in R> is expanded into a @kernel function and then run on the backend of the first variable in <expr>.
  • BREAKING CHANGE: Many high-level functions don't compile or run correctly or run much slower than expected on GPUs. Things as simple as sum or LinearAlgebra:norm2. These have been replaced in the code-base with lower-level functions, but unfortunately, users will need to take extra care when defining things like AutoBody(sdf, map) functions.
  • PERFORMANCE NOTE: KA allocates to the CPU on every loop. Reverting @loop to use @simd restores a perfectly non-allocating sim_step!. We tried other tools like Polyester.jl which had better multi-threading performance for small simulations, but large simulations is where we need the speed-up and so we chose KA.
  • PERFORMANCE NOTE: @loop is not fully optimized. For example, there is an execution overhead for each @loop call on GPUs. A few of the loops have been combined to help reduce this overhead, but many more would require major refactoring or modification of the @loop macro. Despite this we benchmarked up to 182x speed-up with GPU execution.
  • BREAKING CHANGE: The Simulation constructor arguments have changed. dims is now the internal field dimension (L,2L) not (L+2,2L+2), and U must now be an NTuple.
  • The Simulation constructor also take a new mem=Array argument which can be set to CUDA.CuArray or AMDGPU.ROCArray to set-up simulations on GPUs. The Flow and Poisson structs now use AbstractArrays for all fields to accommodate those arrays types.
  • DEFAULT CHANGE: sim_step!(remeasure=true) is now the default as that is the safer (but slower) option.
  • Poission now shares memory for the L, x, and z fields with Flow to reduce the memory footprint. The z field holds the RHS vector and is mutated by solve!.
  • The SOR! and GS! smoothers are not thread-safe, and have been replaced with a Jacobi preconditioned conjugate-gradient smoother held in new routines Jacobi! and pcg!.
  • PERFORMANCE NOTE: Because of the poor-scaling on small fields, the number of multi-grid levels has been set to a default maxlevels=4. The optimal number of levels is likely to be simulation and backend dependent.
  • PERFORMANCE NOTE: pcg! requires a lot of inner products, which are somewhat slow. Switching to the data-driven approximate inverse smoother may be beneficial in the future.
  • Because of the poor-scaling on small fields, the multi-grid-style recursive apply_sdf! has been replaced with measure_sdf! which simply @loops body.sdf().

There have also many changes to the code outside of src to support the upgrade:

  • The testing cases have been massively expanded. In particular, there are tests for every major function on CPU, CUDA, and AMDGPU backends.
  • The benchmarks have been massively expanded. In particular, benchmarks for each function within mom_step! as well as the 3D TGV and donut cases can be compared against previous commits, including pre 1.0 versions.
  • The examples have been brought up-to-date, including GPU execution for the 3D examples and a new jelly fish example demonstrating a deforming geometry.

The only (intentional) modelling change was to add correct_div!(σ) to Body.jl to enable the deformable jelly fish example. This has nothing to do with the backend upgrade and should have been added to master and then merged in - but it wasn't.

Gabe Weymouth and others added 30 commits March 24, 2023 17:25
Cleaned up the benchmarking code to be use a 2D and 3D sphere test, as well as adding profiling.

2D is running fine, but 3D has some array allocation issue that I need to track down and fix.

The code is currently spending around 80% of its time on loops created with the @loop macro. I added Polyester.@Batch to this macro which does a nice job of maintaining approximately the same optimizations on 1 thread, but improving performance on a multi-core machine. This required changing a few closures in MultiLevelPoisson, but otherwise, it's a 1-line improvement.
The memory leak was (of course) from the small closure changes I made separate closure functions, and this let Polyester work and killed the allocations. I've also added a function DISABLE_PUSH() which I can use to quickly stop pushing to the CFL and Poisson count arrays. This lets me ensure the allocations are zero when running on a single thread. When running multi-threaded, there are allocations created for each loop - currently this is a O(100) allocations and only O(10 KiB).
The flux calculation in conv_diff! contributes to the forcing on the cells on either side of the face. I was doing this all in one loop, but this introduces a multi-threading conflict, so I was getting random forcing. I've passed in an array to calculate and store the flux, and then apply it to the force in two loops to avoid the conflicts.
Updated the benchmark to check across problem sizes, 2D vs 3D as well as single vs multi-threaded. Compute time per time step is linear with length(u). Multi-threading on my machine gives a 2x speedup, except for very small jobs.
Adding a folder to help develop the code needed to make WaterLily run on GPUs
1. divergence.jl is a test case comparing the current scalar approach to a GPU approach using KernelAbstractions.jl and OffsetArrays. It's seeing around a 100x speed up for the given (large 2D) test case. Also has a sketch of a new Flow data type.
2. macros.jl defines a macro to automate the creation of kernels using @loop and @inside. This should allow a really easy switch to GPU computing.
3. clamped.jl is an attempt to avoid OffsetArrays and the ghost cells altogether. It works, and it's just as fast on the GPU (since it's bandwidth limited) but it's 4x slower on the CPU. So we'll stick to OffsetArrays.
Boundary conditions kernel and dependencies
] activate CUDAEnv; resolve
will activate that project toml without cluttering up the
functions to get new Flow working (commented out for the moment).
Small improvements on bcs.jl in this line too.
Switching the code over to general arrays, starting with using OffsetArrays in Poisson. The new file tests this with and without offset and it gives idential solutions and times.

Note! This is a breaking change. The current versions of MultiLevelPoisson, and Flow and many other things won't work. They all need to be changed over.
The poisson test was running solver without resetting the initial guess, so the timings were not accurate. Also, Polyester's allocations are confusing the process, so I've removed it.
Added a sketch of the Flow struct (using a mapping to convert to CuArrays, which might be a nice way to do it).

The previous version of the @loop macro failed when given an expression with composite values like
```
@loop flow.div[I] = div_operator(I,flow.u)
```
which is a problem since there are TONS of these in the code-base.
Problem 1: `:(flow.u)` is an Expr not a Symbol, so `grab!()` broke this up into `:flow, :u`, which isn't what we want to pass to the function. This is solved by simply aborting the recursion when `head==:.`.
Problem 2: `:(flow.u)` is not a valid function argument, meaning you can not write
```
function f(flow.div,flow.u) # invalid arguments
   I = @Index()
   flow.div[I] = div_operator(I,flow.u)
end
```
Instead you need to replace these with the composite value itself *including in the expression* ie
```
function f(div,u) # rename arguments
   I = @Index()
   div[I] = div_operator(I,u) # replace arguments in expression
end
f(flow.div,flow.u) # call with full composite values
```
This was pretty easy, with some help from a discourse post. Now it's running and tested to match the hand-written kernel.
I tried using the lambda function f to adapt the arrays, but Flow.jl:68 throws an error when doing so.
It makes little sense since all the other CUDA arrays are created without any issues when using this method.
I have commented out most of the WaterLily.jl and test\runtests.jl files. The plan is to roll out the changes to the code base one file at a time, and add the tests as we go.

Currently tested most of the serial code in util, including OA and the new version of inside().
To do so, KernelAbstractions, CUDA, and Adapt, have been added into the Project.toml.
Another package has also been added, OutMacro, which is helpful for debugging/testing.
Added BC! and its kernel into utils.jl and a test for this.
Added adapt! into utils.jl, which can be used anywhere to adapt an array to the corresponding backend defined in utils.jl.
Finally, added a commented out version of apply! and its kernel into utils.jl, but this could be written using the new @loops.
… passing in Github.

Also bumped WaterLily version to 1.0.0 (all this is a major breaking change, so in my book this requires bumping the first digit).
Modified the github workflow to use at least julia 1.6, as defined in the Project.toml
Added test for both CPU and GPU backends.
Removed text from tests and removed OutMacro package.
Moved creation of boundary conditions array out of Flow
The main thing is using the new KernelAbstractions versions of @inside and @loop. Cleaned up the file a bit and updated the apply! tests.

I took out the backend stuff for now. We can add it back later if needed.
…U or GPU arrays.

It also uses the new apply! with @loop.
Fixed allowscalar in tests. According to the docs, this is the correct way to do lines/block allowscalar stuff.
Sooooo many bug fixes. In Poisson, and in the @loop macro. Test are now running and working. on CPU in 2D.
There is also a big in the BC! code, but I've side stepped this for the moment.
Fix to ensure grab!() applies rep() to the arguments, and added test in the util test set.
Ran Poisson test for CUDA arrays and it worked without change... which seems wrong since SOR!() seems to use scalar indexing. I switched to use GS!() instead, but we need to figure out if scalar indexing is being checked properly.
Cleaned up CUDAEnv/Flow.jl and removed CUDAEnv/bcs.jl, since the latter is already in utils.jl and tested.
Added OutMacro.jl package as extra since it is useful for debugging.
compiling and running, but not working on large grids.
These still fail in GPU since L2 function in utils.jl is indexing.
Tests for utils.jl are all passing.
b-fg and others added 22 commits April 24, 2023 16:39
The solver can be run on a AMD GPU using Julia 1.9.0. This has been tested at the BSC AMD cluster with a TGV case.
To run on AMD, simply change the array type from Array to ROCArray.
Tests now also consider AMD devices if available.
The TOML.jl package had to be explicitely added too, even if is part of the standard library - unsure why.
This seems like a nice way to generalize the makie video generation. Changed the TGV example to use lambda_2 contours.
I was getting a visual artifact between the two levels on either size of zero. Setting `levels` to an even number gets rid of it.
Added AMDGPU package
@codecov
Copy link

codecov bot commented Apr 26, 2023

Codecov Report

Merging #49 (46d1e01) into master (a340b5a) will increase coverage by 6.02%.
The diff coverage is 96.52%.

❗ Current head 46d1e01 differs from pull request most recent head a82eb10. Consider uploading reports for the commit a82eb10 to get more accurate results

@@            Coverage Diff             @@
##           master      #49      +/-   ##
==========================================
+ Coverage   87.69%   93.71%   +6.02%     
==========================================
  Files           8        8              
  Lines         325      382      +57     
==========================================
+ Hits          285      358      +73     
+ Misses         40       24      -16     
Impacted Files Coverage Δ
src/Body.jl 80.00% <75.00%> (-3.34%) ⬇️
src/WaterLily.jl 83.33% <80.00%> (-0.88%) ⬇️
src/util.jl 84.37% <89.28%> (+7.58%) ⬆️
src/AutoBody.jl 85.71% <92.30%> (-3.58%) ⬇️
src/MultiLevelPoisson.jl 98.50% <97.43%> (+9.41%) ⬆️
src/Flow.jl 98.64% <100.00%> (+13.90%) ⬆️
src/Metrics.jl 100.00% <100.00%> (ø)
src/Poisson.jl 100.00% <100.00%> (+4.08%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

test/runtests.jl Outdated Show resolved Hide resolved
benchmark/donut/donut.jl Outdated Show resolved Hide resolved
@weymouth weymouth linked an issue Apr 26, 2023 that may be closed by this pull request
@weymouth weymouth added the enhancement New feature or request label Apr 26, 2023
@weymouth weymouth merged commit 247b886 into master Apr 26, 2023
@weymouth weymouth deleted the CUDA branch June 8, 2023 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use KernelAbstractions for loops?
2 participants