-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WaterLily + AutoDiff workflow #129
Conversation
The goal of this PR is to enable auto-diff with WaterLily. A minor issue is that we have certain parts of the code which declare types which should not. I've removed some of these in the first commit. The second and far more serious issue is that of the field arrays within `Flow` and `Poisson`. These are pre-allocated, and this is incompatible with ForwardDiff, which needs to be able to allocate Dual arrays for variables "downstream" of the derivative being taken. I've included an example "test.jl" file which fails with ForwardDiff when the simulation is pre-allocated as normal OR when pre-allocated with a Dual type. The last issue is KernelAbstractions. It's possible that this will not be an issue once the types are sorted out, but the error messages are so bad that it's impossible to debug. I've changed `@loop` to just use @simd for the time being.
Got the very hacky proto-type test working. Checked it against a finite difference and it looks like it's correct! This function uses non-in-place (out-of-place?) functions `AutoBody` and `measure` to generate the (nasty compound) Dual type needed for the derivative. Then the Flow is generated using that Type, and everything proceeds pretty much as normal from there. This works because the example is taking the derivative with respect to a geometric parameter. If we were taking derivatives with respect to a flow parameter, I'm not sure how different the generated type would need to be and we would need out-of-place function calls to find out (which isn't generally how WaterLily is set up).
I've cleaned up the step_force functions so I can come like-to-like for the mutating and non-mutating versions, with or without using Duals. The two functions return identical value up to the 6 sig digit. I'm guessing this small error is from something which is not quite zeroed properly in mom_step, but I'm not sure. The Dual version of the non-mutating function is weirdly sensitive to how things are initialized. When the function is first called with Dual types, it throws the `Cannot determine ordering of Dual tags` error, but if you re-evaluate the function and then call it again, it works. I don't know why yet.
I can reproduce the error too. I am looking into it. |
The derivative function automatically infers the Dual type, so we do not need to create it manually. The next step is to expand this derivative function so that we do not need to allocate a new Simulation struct for every time step.
I used a Ref to wrap theta such that it updates the Simulation automatically without needing to deconstruct it or anything. This actually vastly cleans up both the Finite Difference and AD examples. The workflow is really clean now and the results match.
Commented out the old `@loop` tests until those are (hopefully) added back in. Tests are passing.
Nice, works as expected! |
Moved test.jl to examples/ to test CUDA backend. Currently not working as it complains about a non-bits type for RefValue{Float64}.
KA with CPU backend is working. However, trying to use KA with CuArray is failing even if using: LoadError: MethodError: no method matching setindex!(::CUDA.CuRefValue{Float64}, ::Float64)
Closest candidates are:
setindex!(::Ref, ::Any, ::CartesianIndex{0})
@ Base multidimensional.jl:1933 I guess it means that we cannot directly modify the pointer value in GPU memory? Maybe @maleadt has an idea on how to circumvent this? KA on GPU with On the other hand, when using AD and without modifying the CuRefValue T = typeof(Tag(step_force!, Float64)) # make a tag
θAD = Ref(Dual{T}(π/36, 0.)) |> cudaconvert # wrap the Dual parameter in a CuRefValue there seems to be a problem when executing the |
That The second problem sounds like something we need to fix in the code base. Does that bit work on the CPU as is? |
When running with a single thread (julia -t 1 ...), WaterLily will not use KA, instead it will run every kernel usinga for loop with @simd For more than 1 thread, it will automatically launch the KA kernels. Note that both non-KA and KA kernels are compiled during precompilation, and only during runtime the specific kernel is selected based on Threads.nthreads()
Note that some tests were not passing when using Float32 in loc since they were comparing Float64 to Float32. This can be changed to Float32 then, but tests should be modified accordingly.
…or Sim::Float32 and equally Sim::Float64
The error in this test stems from the test in loc function to use Float32 instead of Rational.
Interestingly, with the new generalization of macro loop(args...)
ex,_,itr = args
_,I,R = itr.args
return quote
@simd for $I ∈ $R
@fastmath @inbounds $ex
end
end |> esc
end B) macro loop(args...)
ex,_,itr = args
_,I,R = itr.args; sym = []
grab!(sym,ex) # get arguments and replace composites in `ex`
setdiff!(sym,[I]) # don't want to pass I as an argument
@gensym(kern, kern_) # generate unique kernel function names for serial and KA execution
return quote
function $kern($(rep.(sym)...),::Val{1})
@simd for $I ∈ $R
@fastmath @inbounds $ex
end
end
@kernel function $kern_($(rep.(sym)...),@Const(I0)) # replace composite arguments
$I = @index(Global,Cartesian)
$I += I0
@fastmath @inbounds $ex
end
function $kern($(rep.(sym)...),_)
$kern_(get_backend($(sym[1])),64)($(sym...),$R[1]-oneunit($R[1]),ndrange=size($R))
end
$kern($(sym...),Val{Threads.nthreads()}()) # dispatch to SIMD for -t 1, or KA otherwise
end |> esc
end resulting in the following benchmarks of
|
The new @simd version of @loop lets us aim for 0 allocations in the code. We're not there yet. 1. This version of @loop still allocates. Maybe the `grab!` function is to blame. 2. The `residual!` and `Linf` functions had allocation bugs and have been fixed. 3. The perdir tag causes a ton of small allocations. - a Calling BC! at all isn't needed within the Poisson solver if not using, so we might want to rethink this approach. - b The [insideI] inner product was allocating a bunch and is also only needed when periodic. 4. The time-variable BCs is causing a few small allocations. I think this is from the unknown type of U. We should be able to fix this by passing a function as a different argument.
The simple SIMD version of macro loop(args...)
ex,_,itr = args
_,I,R = itr.args
return quote
@simd for $I ∈ $R
@fastmath @inbounds $ex
end
end |> esc
end Still shows 62 allocations on EDIT: Nvm, it is 0 in |
Interestingly enough, the simple
Current "functional"
Since the functional |
The periodic conditions are still off, and this threw quite a few errors stemming from some old commits that assume we will call BC! within solve! many many times. Hopefully, that will no longer be true soon, so I've cleaned those up. NOTE: The "WaterLily.jl" testset is still throwing a weird GPU error, but I couldn't track it down at the moment.
@b-fg , What is the test case you are using for timing in those comments above? |
This one using WaterLily, StaticArrays, BenchmarkTools
function make_sim(θ;L=32,U=1,Re=100,mem=Array)
function map(x,t)
s,c = sincos(θ)
SA[c -s; s c]*(x-SA[L,L])
end
function sdf(ξ,t)
p = ξ-SA[0,clamp(ξ[1],-L/2,L/2)]
√(p'*p)-2
end
Simulation((2L,2L),(U,0),L,ν=U*L/Re,body=AutoBody(sdf,map),T=Float32,mem=mem)
end
sim = make_sim(π/36; mem=Array)
sim_step!(sim)
@btime mom_step!($sim.flow, $sim.pois) |
I can replicate that on my Windows PC at home. With
vs
I also checked the impact of array size using
So the amount of allocations does not grow with array size and is consistently a little faster. Certainly not an issue. |
Reintegrated unsteady BCs without allocation by passing the dt arrays instead of the Flow struct and only taking the sum if it's actually needed.
Are tests passing locally for |
GPU is still broke.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #129 +/- ##
==========================================
- Coverage 96.02% 94.21% -1.82%
==========================================
Files 12 12
Lines 503 501 -2
==========================================
- Hits 483 472 -11
- Misses 20 29 +9 ☔ View full report in Codecov by Sentry. |
AD still works for my tests, but I should add some to the testset.
…HREADS=1). Added allocations test in the pipeline. Conditional loading of CUDA and AMDGPU backends on the pipeline to avoid huge error message.
More testing shows that `typeof(loc)` actually wasn't the problem with GPUs, it was only the default time in `measure!`. All (non-broken) tests passing CPU & GPU. AD test still works (and with Float32) but still not working on GPUs.
How can the code coverage be worse? I didn't change the number of lines or functions? I reduced the number of arguments in two functions!!? |
It's also complaining about src/util.jl#L105, which you did not touch at all... Nvm, I think we are good. |
I changed the default `perdir=()` so that I cold test for an empty tuple in dispatch and loop `for i in perdir` otherwise. This is in a new `perBC!` function which is called within Poisson and after `solver!` is called in flow. Unskipped the periodic tests and added a second allocation test for periodic flows. Passing and non-allocating.
The scalar `BC!` function s never used in the code! I got rid of it and moved the `perBC!` call into `solver!` so it returns the "correct" solution.
Really tandem plates... But I think we can be forgiven.
Doing some review, shouldn't the |
Went over it once again, all looks good except for the 2 minor details above! |
@michel2323 has been working on adding Enzyme support to KernelAbstractions which together with Checkpointing.jl adds reverse mode capabilities. We just gave a tutorial on using Enzyme for codes like this: https://dj4earth.github.io/MPE24 and in particular We are still working through some minutia for full GPU support, but we are getting close. |
Thanks @vchuravy! That's very useful indeed. Currently ForwardDiff works as expected on CPUs, but yields a fairly cryptic error message on GPUs. Eventually I think that we will move to Enzyme, probably when the GPU support under KA is ready. And thanks for working on that! It will truly bring this solver to a whole new feature level :) |
The goal of this PR is to enable auto-diff with WaterLily, ala issue #52. This PR is IN-PROGRESS!
A minor issue is that we have certain parts of the code which declare types which should not. I've removed some of these in the first commit.
The second and far more serious issue is that of the field arrays within
Flow
andPoisson
. These are pre-allocated, and this is incompatible with ForwardDiff, which needs to be able to allocate Dual arrays for variables "downstream" of the derivative being taken. I've included an example "test.jl" file which fails with ForwardDiff when the simulation is pre-allocated as normal OR when pre-allocated with a Dual type.The last issue is KernelAbstractions. It's possible that this will not be an issue once the types are sorted out, but the error messages are so bad that it's impossible to debug. I've changed
@loop
to just use@SIMD
for the time being.