-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to 1.0 #49
Merged
Update to 1.0 #49
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Cleaned up the benchmarking code to be use a 2D and 3D sphere test, as well as adding profiling. 2D is running fine, but 3D has some array allocation issue that I need to track down and fix. The code is currently spending around 80% of its time on loops created with the @loop macro. I added Polyester.@Batch to this macro which does a nice job of maintaining approximately the same optimizations on 1 thread, but improving performance on a multi-core machine. This required changing a few closures in MultiLevelPoisson, but otherwise, it's a 1-line improvement.
The memory leak was (of course) from the small closure changes I made separate closure functions, and this let Polyester work and killed the allocations. I've also added a function DISABLE_PUSH() which I can use to quickly stop pushing to the CFL and Poisson count arrays. This lets me ensure the allocations are zero when running on a single thread. When running multi-threaded, there are allocations created for each loop - currently this is a O(100) allocations and only O(10 KiB).
The flux calculation in conv_diff! contributes to the forcing on the cells on either side of the face. I was doing this all in one loop, but this introduces a multi-threading conflict, so I was getting random forcing. I've passed in an array to calculate and store the flux, and then apply it to the force in two loops to avoid the conflicts.
Updated the benchmark to check across problem sizes, 2D vs 3D as well as single vs multi-threaded. Compute time per time step is linear with length(u). Multi-threading on my machine gives a 2x speedup, except for very small jobs.
Adding a folder to help develop the code needed to make WaterLily run on GPUs 1. divergence.jl is a test case comparing the current scalar approach to a GPU approach using KernelAbstractions.jl and OffsetArrays. It's seeing around a 100x speed up for the given (large 2D) test case. Also has a sketch of a new Flow data type. 2. macros.jl defines a macro to automate the creation of kernels using @loop and @inside. This should allow a really easy switch to GPU computing. 3. clamped.jl is an attempt to avoid OffsetArrays and the ghost cells altogether. It works, and it's just as fast on the GPU (since it's bandwidth limited) but it's 4x slower on the CPU. So we'll stick to OffsetArrays.
Boundary conditions kernel and dependencies
] activate CUDAEnv; resolve will activate that project toml without cluttering up the
functions to get new Flow working (commented out for the moment). Small improvements on bcs.jl in this line too.
Switching the code over to general arrays, starting with using OffsetArrays in Poisson. The new file tests this with and without offset and it gives idential solutions and times. Note! This is a breaking change. The current versions of MultiLevelPoisson, and Flow and many other things won't work. They all need to be changed over.
Flow.jl MWE
The poisson test was running solver without resetting the initial guess, so the timings were not accurate. Also, Polyester's allocations are confusing the process, so I've removed it.
Added a sketch of the Flow struct (using a mapping to convert to CuArrays, which might be a nice way to do it). The previous version of the @loop macro failed when given an expression with composite values like ``` @loop flow.div[I] = div_operator(I,flow.u) ``` which is a problem since there are TONS of these in the code-base. Problem 1: `:(flow.u)` is an Expr not a Symbol, so `grab!()` broke this up into `:flow, :u`, which isn't what we want to pass to the function. This is solved by simply aborting the recursion when `head==:.`. Problem 2: `:(flow.u)` is not a valid function argument, meaning you can not write ``` function f(flow.div,flow.u) # invalid arguments I = @Index() flow.div[I] = div_operator(I,flow.u) end ``` Instead you need to replace these with the composite value itself *including in the expression* ie ``` function f(div,u) # rename arguments I = @Index() div[I] = div_operator(I,u) # replace arguments in expression end f(flow.div,flow.u) # call with full composite values ``` This was pretty easy, with some help from a discourse post. Now it's running and tested to match the hand-written kernel.
I tried using the lambda function f to adapt the arrays, but Flow.jl:68 throws an error when doing so. It makes little sense since all the other CUDA arrays are created without any issues when using this method.
I have commented out most of the WaterLily.jl and test\runtests.jl files. The plan is to roll out the changes to the code base one file at a time, and add the tests as we go. Currently tested most of the serial code in util, including OA and the new version of inside().
To do so, KernelAbstractions, CUDA, and Adapt, have been added into the Project.toml. Another package has also been added, OutMacro, which is helpful for debugging/testing. Added BC! and its kernel into utils.jl and a test for this. Added adapt! into utils.jl, which can be used anywhere to adapt an array to the corresponding backend defined in utils.jl. Finally, added a commented out version of apply! and its kernel into utils.jl, but this could be written using the new @loops.
… passing in Github. Also bumped WaterLily version to 1.0.0 (all this is a major breaking change, so in my book this requires bumping the first digit). Modified the github workflow to use at least julia 1.6, as defined in the Project.toml
Added test for both CPU and GPU backends. Removed text from tests and removed OutMacro package.
Moved creation of boundary conditions array out of Flow
…U or GPU arrays. It also uses the new apply! with @loop. Fixed allowscalar in tests. According to the docs, this is the correct way to do lines/block allowscalar stuff.
Sooooo many bug fixes. In Poisson, and in the @loop macro. Test are now running and working. on CPU in 2D. There is also a big in the BC! code, but I've side stepped this for the moment.
Fix to ensure grab!() applies rep() to the arguments, and added test in the util test set. Ran Poisson test for CUDA arrays and it worked without change... which seems wrong since SOR!() seems to use scalar indexing. I switched to use GS!() instead, but we need to figure out if scalar indexing is being checked properly.
Cleaned up CUDAEnv/Flow.jl and removed CUDAEnv/bcs.jl, since the latter is already in utils.jl and tested. Added OutMacro.jl package as extra since it is useful for debugging.
compiling and running, but not working on large grids.
These still fail in GPU since L2 function in utils.jl is indexing. Tests for utils.jl are all passing.
The solver can be run on a AMD GPU using Julia 1.9.0. This has been tested at the BSC AMD cluster with a TGV case. To run on AMD, simply change the array type from Array to ROCArray. Tests now also consider AMD devices if available. The TOML.jl package had to be explicitely added too, even if is part of the standard library - unsure why.
This seems like a nice way to generalize the makie video generation. Changed the TGV example to use lambda_2 contours.
I was getting a visual artifact between the two levels on either size of zero. Setting `levels` to an even number gets rid of it.
Added AMDGPU package
Codecov Report
@@ Coverage Diff @@
## master #49 +/- ##
==========================================
+ Coverage 87.69% 93.71% +6.02%
==========================================
Files 8 8
Lines 325 382 +57
==========================================
+ Hits 285 358 +73
+ Misses 40 24 -16
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
b-fg
reviewed
Apr 26, 2023
b-fg
reviewed
Apr 26, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Upgraded the solver for backend agnostic execution:
@loop
macro which integrates KernelAbstractions.jl (KA) to run multi-threaded on CPUs and GPUs. This replaces the@simd
version of@loop
as well as previous multi-threading code. Each@loop <expr> over <I in R>
is expanded into a@kernel function
and then run on the backend of the first variable in<expr>
.sum
orLinearAlgebra:norm2
. These have been replaced in the code-base with lower-level functions, but unfortunately, users will need to take extra care when defining things likeAutoBody(sdf, map)
functions.@loop
to use@simd
restores a perfectly non-allocatingsim_step!
. We tried other tools likePolyester.jl
which had better multi-threading performance for small simulations, but large simulations is where we need the speed-up and so we chose KA.@loop
is not fully optimized. For example, there is an execution overhead for each@loop
call on GPUs. A few of the loops have been combined to help reduce this overhead, but many more would require major refactoring or modification of the@loop
macro. Despite this we benchmarked up to 182x speed-up with GPU execution.Simulation
constructor arguments have changed.dims
is now the internal field dimension(L,2L)
not(L+2,2L+2)
, andU
must now be anNTuple
.Simulation
constructor also take a newmem=Array
argument which can be set toCUDA.CuArray
orAMDGPU.ROCArray
to set-up simulations on GPUs. TheFlow
andPoisson
structs now useAbstractArrays
for all fields to accommodate those arrays types.sim_step!(remeasure=true)
is now the default as that is the safer (but slower) option.Poission
now shares memory for theL
,x
, andz
fields withFlow
to reduce the memory footprint. Thez
field holds the RHS vector and is mutated bysolve!
.SOR!
andGS!
smoothers are not thread-safe, and have been replaced with a Jacobi preconditioned conjugate-gradient smoother held in new routinesJacobi!
andpcg!
.maxlevels=4
. The optimal number of levels is likely to be simulation and backend dependent.pcg!
requires a lot of inner products, which are somewhat slow. Switching to the data-driven approximate inverse smoother may be beneficial in the future.apply_sdf!
has been replaced withmeasure_sdf!
which simply@loop
sbody.sdf()
.There have also many changes to the code outside of
src
to support the upgrade:mom_step!
as well as the 3D TGV and donut cases can be compared against previous commits, including pre 1.0 versions.The only (intentional) modelling change was to add
correct_div!(σ)
toBody.jl
to enable the deformable jelly fish example. This has nothing to do with the backend upgrade and should have been added to master and then merged in - but it wasn't.