WaterLily + AutoDiff workflow #129

weymouth · 2024-06-10T14:46:45Z

The goal of this PR is to enable auto-diff with WaterLily, ala issue #52. This PR is IN-PROGRESS!

A minor issue is that we have certain parts of the code which declare types which should not. I've removed some of these in the first commit.

The second and far more serious issue is that of the field arrays within Flow and Poisson. These are pre-allocated, and this is incompatible with ForwardDiff, which needs to be able to allocate Dual arrays for variables "downstream" of the derivative being taken. I've included an example "test.jl" file which fails with ForwardDiff when the simulation is pre-allocated as normal OR when pre-allocated with a Dual type.

The last issue is KernelAbstractions. It's possible that this will not be an issue once the types are sorted out, but the error messages are so bad that it's impossible to debug. I've changed @loop to just use @SIMD for the time being.

@simd

The goal of this PR is to enable auto-diff with WaterLily. A minor issue is that we have certain parts of the code which declare types which should not. I've removed some of these in the first commit. The second and far more serious issue is that of the field arrays within `Flow` and `Poisson`. These are pre-allocated, and this is incompatible with ForwardDiff, which needs to be able to allocate Dual arrays for variables "downstream" of the derivative being taken. I've included an example "test.jl" file which fails with ForwardDiff when the simulation is pre-allocated as normal OR when pre-allocated with a Dual type. The last issue is KernelAbstractions. It's possible that this will not be an issue once the types are sorted out, but the error messages are so bad that it's impossible to debug. I've changed `@loop` to just use @simd for the time being.

Got the very hacky proto-type test working. Checked it against a finite difference and it looks like it's correct! This function uses non-in-place (out-of-place?) functions `AutoBody` and `measure` to generate the (nasty compound) Dual type needed for the derivative. Then the Flow is generated using that Type, and everything proceeds pretty much as normal from there. This works because the example is taking the derivative with respect to a geometric parameter. If we were taking derivatives with respect to a flow parameter, I'm not sure how different the generated type would need to be and we would need out-of-place function calls to find out (which isn't generally how WaterLily is set up).

I've cleaned up the step_force functions so I can come like-to-like for the mutating and non-mutating versions, with or without using Duals. The two functions return identical value up to the 6 sig digit. I'm guessing this small error is from something which is not quite zeroed properly in mom_step, but I'm not sure. The Dual version of the non-mutating function is weirdly sensitive to how things are initialized. When the function is first called with Dual types, it throws the `Cannot determine ordering of Dual tags` error, but if you re-evaluate the function and then call it again, it works. I don't know why yet.

b-fg · 2024-06-12T12:23:40Z

I can reproduce the error too. I am looking into it.

The derivative function automatically infers the Dual type, so we do not need to create it manually. The next step is to expand this derivative function so that we do not need to allocate a new Simulation struct for every time step.

I used a Ref to wrap theta such that it updates the Simulation automatically without needing to deconstruct it or anything. This actually vastly cleans up both the Finite Difference and AD examples. The workflow is really clean now and the results match.

Commented out the old `@loop` tests until those are (hopefully) added back in. Tests are passing.

b-fg · 2024-06-12T22:17:12Z

Nice, works as expected!

Moved test.jl to examples/ to test CUDA backend. Currently not working as it complains about a non-bits type for RefValue{Float64}.

b-fg · 2024-06-12T23:12:45Z

KA with CPU backend is working. However, trying to use KA with CuArray is failing even if using: θ = Ref(π/36) |> cudaconvert, which returns a CuRefValue. Without AD, if fails when trying to modify the CuRefValue, ie θ[]=0.5:

LoadError: MethodError: no method matching setindex!(::CUDA.CuRefValue{Float64}, ::Float64)

Closest candidates are:
  setindex!(::Ref, ::Any, ::CartesianIndex{0})
   @ Base multidimensional.jl:1933

I guess it means that we cannot directly modify the pointer value in GPU memory? Maybe @maleadt has an idea on how to circumvent this? KA on GPU with θ = Ref(π/36) |> cudaconvert (no AD) works well though.

On the other hand, when using AD and without modifying the CuRefValue

T = typeof(Tag(step_force!, Float64)) # make a tag
θAD = Ref(Dual{T}(π/36, 0.)) |> cudaconvert # wrap the Dual parameter in a CuRefValue

there seems to be a problem when executing the measure kernel of Body, on BC! https://github.com/weymouth/WaterLily.jl/blob/d9161b7dcc262e4744ac88649b8d21b7310f354d/src/Body.jl#L46
Maybe related to the BC! problem on GPU you mentioned before @weymouth ?

weymouth · 2024-06-13T07:58:07Z

That Ref problem on GPUs doesn't have anything to do with WaterLily in particular. We can play around with really small MWEs and see what works. Maybe post something on Slack.

The second problem sounds like something we need to fix in the code base. Does that bit work on the CPU as is?

@simd

When running with a single thread (julia -t 1 ...), WaterLily will not use KA, instead it will run every kernel usinga for loop with @simd For more than 1 thread, it will automatically launch the KA kernels. Note that both non-KA and KA kernels are compiled during precompilation, and only during runtime the specific kernel is selected based on Threads.nthreads()

Note that some tests were not passing when using Float32 in loc since they were comparing Float64 to Float32. This can be changed to Float32 then, but tests should be modified accordingly.

…or Sim::Float32 and equally Sim::Float64

The error in this test stems from the test in loc function to use Float32 instead of Rational.

b-fg · 2024-06-14T17:38:16Z

Interestingly, with the new generalization of @loop to use @simd or KA based on Threads.nthreads(), I get a speedup when using a single thread compared to the bare @simd implementation, even though there are many more allocations.
A)

macro loop(args...)
    ex,_,itr = args
    _,I,R = itr.args
    return quote
        @simd for $I ∈ $R
            @fastmath @inbounds $ex
        end
    end |> esc
end

B)

macro loop(args...)
    ex,_,itr = args
    _,I,R = itr.args; sym = []
    grab!(sym,ex)     # get arguments and replace composites in `ex`
    setdiff!(sym,[I]) # don't want to pass I as an argument
    @gensym(kern, kern_) # generate unique kernel function names for serial and KA execution
    return quote
        function $kern($(rep.(sym)...),::Val{1})
            @simd for $I ∈ $R
                @fastmath @inbounds $ex
            end
        end
        @kernel function $kern_($(rep.(sym)...),@Const(I0)) # replace composite arguments
            $I = @index(Global,Cartesian)
            $I += I0
            @fastmath @inbounds $ex
        end
        function $kern($(rep.(sym)...),_)
            $kern_(get_backend($(sym[1])),64)($(sym...),$R[1]-oneunit($R[1]),ndrange=size($R))
        end
        $kern($(sym...),Val{Threads.nthreads()}()) # dispatch to SIMD for -t 1, or KA otherwise
    end |> esc
end

resulting in the following benchmarks of @btime sim_step!($sim)

A) 776.913 μs (180 allocations: 250.23 KiB)
B) 656.911 μs (1376 allocations: 269.77 KiB)

@simd

The new @simd version of @loop lets us aim for 0 allocations in the code. We're not there yet. 1. This version of @loop still allocates. Maybe the `grab!` function is to blame. 2. The `residual!` and `Linf` functions had allocation bugs and have been fixed. 3. The perdir tag causes a ton of small allocations. - a Calling BC! at all isn't needed within the Poisson solver if not using, so we might want to rethink this approach. - b The [insideI] inner product was allocating a bunch and is also only needed when periodic. 4. The time-variable BCs is causing a few small allocations. I think this is from the unknown type of U. We should be able to fix this by passing a function as a different argument.

b-fg · 2024-06-16T17:48:14Z

The simple SIMD version of @loop

macro loop(args...)
    ex,_,itr = args
    _,I,R = itr.args
    return quote
        @simd for $I ∈ $R
            @fastmath @inbounds $ex
        end
    end |> esc
end

Still shows 62 allocations on sim_step! for me. How did you write it to get 0 allocations @weymouth? Or was that at mom_step!?

EDIT: Nvm, it is 0 in mom_step! indeed.

b-fg · 2024-06-16T17:53:29Z

Interestingly enough, the simple @loop version produces 0 allocations, but is slower.
Simple @loop

576.174 μs (0 allocations: 0 bytes)

Current "functional" @loop

434.004 μs (486 allocations: 7.84 KiB)

Since the functional @loop is faster (?) and does not incur many allocations (these are coming from grab! and rep), I think the current form is acceptable.

The periodic conditions are still off, and this threw quite a few errors stemming from some old commits that assume we will call BC! within solve! many many times. Hopefully, that will no longer be true soon, so I've cleaned those up. NOTE: The "WaterLily.jl" testset is still throwing a weird GPU error, but I couldn't track it down at the moment.

weymouth · 2024-06-17T16:01:23Z

@b-fg , What is the test case you are using for timing in those comments above?

b-fg · 2024-06-17T16:19:10Z

This one

using WaterLily, StaticArrays, BenchmarkTools

function make_sim(θ;L=32,U=1,Re=100,mem=Array)
    function map(x,t)
        s,c = sincos(θ)
        SA[c -s; s c]*(x-SA[L,L])
    end
    function sdf(ξ,t)
        p = ξ-SA[0,clamp(ξ[1],-L/2,L/2)]
        √(p'*p)-2
    end
    Simulation((2L,2L),(U,0),L,ν=U*L/Re,body=AutoBody(sdf,map),T=Float32,mem=mem)
end

sim = make_sim(π/36; mem=Array)
sim_step!(sim)
@btime mom_step!($sim.flow, $sim.pois)

weymouth · 2024-06-17T19:02:44Z

I can replicate that on my Windows PC at home. With sim = make_sim(0f0) I'm getting

796.000 μs (0 allocations: 0 bytes)

vs

626.200 μs (413 allocations: 6.70 KiB)

I also checked the impact of array size using sim = make_sim(0f0,L=256) which gives

31.799 ms (0 allocations: 0 bytes)
vs

27.886 ms (264 allocations: 4.38 KiB)

So the amount of allocations does not grow with array size and is consistently a little faster. Certainly not an issue.

Reintegrated unsteady BCs without allocation by passing the dt arrays instead of the Flow struct and only taking the sum if it's actually needed.

b-fg · 2024-06-17T22:45:24Z

Are tests passing locally for CuArrays too? What else is needed for this PR?

src/Flow.jl

GPU is still broke.

codecov · 2024-06-18T10:35:36Z

Codecov Report

Attention: Patch coverage is 95.91837% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.21%. Comparing base (f6c5b34) to head (7750fb6).
Report is 2 commits behind head on master.

❗ Current head 7750fb6 differs from pull request most recent head 9bac6a0

Please upload reports for the commit 9bac6a0 to get more accurate results.

Files	Patch %	Lines
src/Poisson.jl	83.33%	1 Missing ⚠️
src/util.jl	93.75%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #129      +/-   ##
==========================================
- Coverage   96.02%   94.21%   -1.82%     
==========================================
  Files          12       12              
  Lines         503      501       -2     
==========================================
- Hits          483      472      -11     
- Misses         20       29       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

AD still works for my tests, but I should add some to the testset.

…HREADS=1). Added allocations test in the pipeline. Conditional loading of CUDA and AMDGPU backends on the pipeline to avoid huge error message.

More testing shows that `typeof(loc)` actually wasn't the problem with GPUs, it was only the default time in `measure!`. All (non-broken) tests passing CPU & GPU. AD test still works (and with Float32) but still not working on GPUs.

weymouth · 2024-06-19T09:51:48Z

How can the code coverage be worse? I didn't change the number of lines or functions? I reduced the number of arguments in two functions!!?

b-fg · 2024-06-19T10:39:23Z

It's also complaining about src/util.jl#L105, which you did not touch at all... Nvm, I think we are good.

I changed the default `perdir=()` so that I cold test for an empty tuple in dispatch and loop `for i in perdir` otherwise. This is in a new `perBC!` function which is called within Poisson and after `solver!` is called in flow. Unskipped the periodic tests and added a second allocation test for periodic flows. Passing and non-allocating.

The scalar `BC!` function s never used in the code! I got rid of it and moved the `perBC!` call into `solver!` so it returns the "correct" solution.

Really tandem plates... But I think we can be forgiven.

b-fg · 2024-06-19T14:42:33Z

Doing some review, shouldn't the T(0) be zero(T) in
https://github.com/weymouth/WaterLily.jl/blob/b97d700dea197c2683f70d68753edaa591329bbf/src/WaterLily.jl#L71

src/util.jl

b-fg · 2024-06-19T14:48:44Z

Went over it once again, all looks good except for the 2 minor details above!

New version!

vchuravy · 2024-06-19T20:15:31Z

@michel2323 has been working on adding Enzyme support to KernelAbstractions which together with Checkpointing.jl adds reverse mode capabilities.

We just gave a tutorial on using Enzyme for codes like this: https://dj4earth.github.io/MPE24 and in particular
https://dj4earth.github.io/MPE24/Burgers_tutorial/burgers_tutorial.html

We are still working through some minutia for full GPU support, but we are getting close.

b-fg · 2024-06-19T20:49:58Z

Thanks @vchuravy! That's very useful indeed. Currently ForwardDiff works as expected on CPUs, but yields a fairly cryptic error message on GPUs. Eventually I think that we will move to Enzyme, probably when the GPU support under KA is ready. And thanks for working on that! It will truly bring this solver to a whole new feature level :)

src/Flow.jl

weymouth added the enhancement New feature or request label Jun 10, 2024

weymouth requested a review from b-fg June 10, 2024 14:46

weymouth added 3 commits June 11, 2024 17:25

Update test with non-mutating step_force function

c76dd03

b-fg and others added 3 commits June 12, 2024 14:57

Working version with ForwardDiff.derivative

38e111e

The derivative function automatically infers the Dual type, so we do not need to create it manually. The next step is to expand this derivative function so that we do not need to allocate a new Simulation struct for every time step.

Working test,jl

79cf790

I used a Ref to wrap theta such that it updates the Simulation automatically without needing to deconstruct it or anything. This actually vastly cleans up both the Finite Difference and AD examples. The workflow is really clean now and the results match.

Update runtests.jl

57e1de4

Commented out the old `@loop` tests until those are (hopefully) added back in. Tests are passing.

Reverted back to KA (instead of simd loops) and works for CPU backend.

d9161b7

Moved test.jl to examples/ to test CUDA backend. Currently not working as it complains about a non-bits type for RefValue{Float64}.

weymouth changed the title ~~Create WaterLily structs which are AutoDiff compatible~~ WaterLily + AutoDiff workflow Jun 13, 2024

b-fg added 4 commits June 14, 2024 18:50

Replaced Rational numbers in loc for Float64 numbers.

573a74a

Note that some tests were not passing when using Float32 in loc since they were comparing Float64 to Float32. This can be changed to Float32 then, but tests should be modified accordingly.

Changed loc function to use Float32 as it yields better performance f…

db39c68

…or Sim::Float32 and equally Sim::Float64

Fixed test that was comparing a Float32 with a Float64 number.

3ea3dc8

The error in this test stems from the test in loc function to use Float32 instead of Rational.

b-fg and others added 3 commits June 14, 2024 19:43

Reactivated tests for ex and sym, and fixed spacing in loc.

8296709

Tiny fixes to @loop doc and interp type

e42fbc2

Unsteady BCs

1dc0044

Reintegrated unsteady BCs without allocation by passing the dt arrays instead of the Flow struct and only taking the sum if it's actually needed.

b-fg reviewed Jun 18, 2024

View reviewed changes

src/Flow.jl Show resolved Hide resolved

Fix double BCtuple

f721343

GPU is still broke.

weymouth and others added 3 commits June 18, 2024 19:08

GPU test workng

e9b10a6

AD still works for my tests, but I should add some to the testset.

Added warning when using WaterLily in single thread mode (JULIA_NUM_T…

7bca7b4

…HREADS=1). Added allocations test in the pipeline. Conditional loading of CUDA and AMDGPU backends on the pipeline to avoid huge error message.

Fixed CI.

b44b5b9

b-fg force-pushed the AD_version branch from 2ca7cdc to b44b5b9 Compare June 18, 2024 19:49

Simplify loc

7750fb6

More testing shows that `typeof(loc)` actually wasn't the problem with GPUs, it was only the default time in `measure!`. All (non-broken) tests passing CPU & GPU. AD test still works (and with Float32) but still not working on GPUs.

weymouth added 4 commits June 19, 2024 13:28

scalar BC!

50d0d26

The scalar `BC!` function s never used in the code! I got rid of it and moved the `perBC!` call into `solver!` so it returns the "correct" solution.

Added AD test

8481fcd

Add tandem foil optimization example

b97d700

Really tandem plates... But I think we can be forgiven.

b-fg marked this pull request as ready for review June 19, 2024 14:36

b-fg reviewed Jun 19, 2024

View reviewed changes

src/util.jl Show resolved Hide resolved

Avoid type casting in Simulation constructor assert.

9bac6a0

weymouth merged commit 1525197 into master Jun 19, 2024
43 checks passed

This was referenced Jun 19, 2024

Adding Polyester package for CPU backend and improving allocations #111

Closed

Integration with Flux #52

Closed

weymouth referenced this pull request Jun 19, 2024

Update Project.toml

1882eae

New version!

b-fg mentioned this pull request Jun 19, 2024

Performance tests #107

Closed

b-fg reviewed Jun 20, 2024

View reviewed changes

src/Flow.jl Show resolved Hide resolved

b-fg mentioned this pull request Jul 29, 2024

Force Differences in WaterLily Version #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WaterLily + AutoDiff workflow #129

WaterLily + AutoDiff workflow #129

weymouth commented Jun 10, 2024 •

edited

Loading

b-fg commented Jun 12, 2024

b-fg commented Jun 12, 2024

b-fg commented Jun 12, 2024 •

edited

Loading

weymouth commented Jun 13, 2024

b-fg commented Jun 14, 2024

b-fg commented Jun 16, 2024 •

edited

Loading

b-fg commented Jun 16, 2024 •

edited

Loading

weymouth commented Jun 17, 2024

b-fg commented Jun 17, 2024 •

edited

Loading

weymouth commented Jun 17, 2024 •

edited

Loading

b-fg commented Jun 17, 2024

codecov bot commented Jun 18, 2024 •

edited

Loading

weymouth commented Jun 19, 2024

b-fg commented Jun 19, 2024

b-fg commented Jun 19, 2024

b-fg commented Jun 19, 2024

vchuravy commented Jun 19, 2024

b-fg commented Jun 19, 2024 •

edited

Loading

WaterLily + AutoDiff workflow #129

WaterLily + AutoDiff workflow #129

Conversation

weymouth commented Jun 10, 2024 • edited Loading

b-fg commented Jun 12, 2024

b-fg commented Jun 12, 2024

b-fg commented Jun 12, 2024 • edited Loading

weymouth commented Jun 13, 2024

b-fg commented Jun 14, 2024

b-fg commented Jun 16, 2024 • edited Loading

b-fg commented Jun 16, 2024 • edited Loading

weymouth commented Jun 17, 2024

b-fg commented Jun 17, 2024 • edited Loading

weymouth commented Jun 17, 2024 • edited Loading

b-fg commented Jun 17, 2024

codecov bot commented Jun 18, 2024 • edited Loading

Codecov Report

weymouth commented Jun 19, 2024

b-fg commented Jun 19, 2024

b-fg commented Jun 19, 2024

b-fg commented Jun 19, 2024

vchuravy commented Jun 19, 2024

b-fg commented Jun 19, 2024 • edited Loading

weymouth commented Jun 10, 2024 •

edited

Loading

b-fg commented Jun 12, 2024 •

edited

Loading

b-fg commented Jun 16, 2024 •

edited

Loading

b-fg commented Jun 16, 2024 •

edited

Loading

b-fg commented Jun 17, 2024 •

edited

Loading

weymouth commented Jun 17, 2024 •

edited

Loading

codecov bot commented Jun 18, 2024 •

edited

Loading

b-fg commented Jun 19, 2024 •

edited

Loading