Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaterLily + AutoDiff workflow #129

Merged
merged 27 commits into from
Jun 19, 2024
Merged

WaterLily + AutoDiff workflow #129

merged 27 commits into from
Jun 19, 2024

Conversation

weymouth
Copy link
Collaborator

@weymouth weymouth commented Jun 10, 2024

The goal of this PR is to enable auto-diff with WaterLily, ala issue #52. This PR is IN-PROGRESS!

A minor issue is that we have certain parts of the code which declare types which should not. I've removed some of these in the first commit.

The second and far more serious issue is that of the field arrays within Flow and Poisson. These are pre-allocated, and this is incompatible with ForwardDiff, which needs to be able to allocate Dual arrays for variables "downstream" of the derivative being taken. I've included an example "test.jl" file which fails with ForwardDiff when the simulation is pre-allocated as normal OR when pre-allocated with a Dual type.

The last issue is KernelAbstractions. It's possible that this will not be an issue once the types are sorted out, but the error messages are so bad that it's impossible to debug. I've changed @loop to just use @SIMD for the time being.

The goal of this PR is to enable auto-diff with WaterLily.

A minor issue is that we have certain parts of the code which declare types which should not. I've removed some of these in the first commit.

The second and far more serious issue is that of the field arrays within `Flow` and `Poisson`. These are pre-allocated, and this is incompatible with ForwardDiff, which needs to be able to allocate Dual arrays for variables "downstream" of the derivative being taken. I've included an example "test.jl" file which fails with ForwardDiff when the simulation is pre-allocated as normal OR when pre-allocated with a Dual type.

The last issue is KernelAbstractions. It's possible that this will not be an issue once the types are sorted out, but the error messages are so bad that it's impossible to debug. I've changed `@loop` to just use @simd for the time being.
@weymouth weymouth added the enhancement New feature or request label Jun 10, 2024
@weymouth weymouth requested a review from b-fg June 10, 2024 14:46
Got the very hacky proto-type test working. Checked it against a finite difference and it looks like it's correct!

This function uses non-in-place (out-of-place?) functions `AutoBody` and `measure` to generate the (nasty compound) Dual type needed for the derivative. Then the Flow is generated using that Type, and everything proceeds pretty much as normal from there.

This works because the example is taking the derivative with respect to a geometric parameter. If we were taking derivatives with respect to a flow parameter, I'm not sure how different the generated type would need to be and we would need out-of-place function calls to find out (which isn't generally how WaterLily is set up).
I've cleaned up the step_force functions so I can come like-to-like for the mutating and non-mutating versions, with or without using Duals.

The two functions return identical value up to the 6 sig digit. I'm guessing this small error is from something which is not quite zeroed properly in mom_step, but I'm not sure.

The Dual version of the non-mutating function is weirdly sensitive to how things are initialized. When the function is first called with Dual types, it throws the `Cannot determine ordering of Dual tags` error, but if you re-evaluate the function and then call it again, it works. I don't know why yet.
@b-fg
Copy link
Member

b-fg commented Jun 12, 2024

I can reproduce the error too. I am looking into it.

b-fg and others added 3 commits June 12, 2024 14:57
The derivative function automatically infers the Dual type, so we do not need to create it manually.
The next step is to expand this derivative function so that we do not need to allocate a new Simulation struct for every time step.
I used a Ref to wrap theta such that it updates the Simulation automatically without needing to deconstruct it or anything. This actually vastly cleans up both the Finite Difference and AD examples. The workflow is really clean now and the results match.
Commented out the old `@loop` tests until those are (hopefully) added back in. Tests are passing.
@b-fg
Copy link
Member

b-fg commented Jun 12, 2024

Nice, works as expected!

Moved test.jl to examples/ to test CUDA backend. Currently not working as it complains about a non-bits type for RefValue{Float64}.
@b-fg
Copy link
Member

b-fg commented Jun 12, 2024

KA with CPU backend is working. However, trying to use KA with CuArray is failing even if using: θ = Ref(π/36) |> cudaconvert, which returns a CuRefValue. Without AD, if fails when trying to modify the CuRefValue, ie θ[]=0.5:

LoadError: MethodError: no method matching setindex!(::CUDA.CuRefValue{Float64}, ::Float64)

Closest candidates are:
  setindex!(::Ref, ::Any, ::CartesianIndex{0})
   @ Base multidimensional.jl:1933

I guess it means that we cannot directly modify the pointer value in GPU memory? Maybe @maleadt has an idea on how to circumvent this? KA on GPU with θ = Ref(π/36) |> cudaconvert (no AD) works well though.

On the other hand, when using AD and without modifying the CuRefValue

T = typeof(Tag(step_force!, Float64)) # make a tag
θAD = Ref(Dual{T}/36, 0.)) |> cudaconvert # wrap the Dual parameter in a CuRefValue

there seems to be a problem when executing the measure kernel of Body, on BC! https://github.com/weymouth/WaterLily.jl/blob/d9161b7dcc262e4744ac88649b8d21b7310f354d/src/Body.jl#L46
Maybe related to the BC! problem on GPU you mentioned before @weymouth ?

@weymouth
Copy link
Collaborator Author

That Ref problem on GPUs doesn't have anything to do with WaterLily in particular. We can play around with really small MWEs and see what works. Maybe post something on Slack.

The second problem sounds like something we need to fix in the code base. Does that bit work on the CPU as is?

@weymouth weymouth changed the title Create WaterLily structs which are AutoDiff compatible WaterLily + AutoDiff workflow Jun 13, 2024
b-fg added 4 commits June 14, 2024 18:50
When running with a single thread (julia -t 1 ...), WaterLily will not use KA,
instead it will run every kernel usinga for loop with @simd
For more than 1 thread, it will automatically launch the KA kernels.

Note that both non-KA and KA kernels are compiled during precompilation, and only
during runtime the specific kernel is selected based on Threads.nthreads()
Note that some tests were not passing when using Float32 in loc since they were comparing Float64 to Float32.
This can be changed to Float32 then, but tests should be modified accordingly.
The error in this test stems from the test in loc function to use Float32 instead of Rational.
@b-fg
Copy link
Member

b-fg commented Jun 14, 2024

Interestingly, with the new generalization of @loop to use @simd or KA based on Threads.nthreads(), I get a speedup when using a single thread compared to the bare @simd implementation, even though there are many more allocations.
A)

macro loop(args...)
    ex,_,itr = args
    _,I,R = itr.args
    return quote
        @simd for $I  $R
            @fastmath @inbounds $ex
        end
    end |> esc
end

B)

macro loop(args...)
    ex,_,itr = args
    _,I,R = itr.args; sym = []
    grab!(sym,ex)     # get arguments and replace composites in `ex`
    setdiff!(sym,[I]) # don't want to pass I as an argument
    @gensym(kern, kern_) # generate unique kernel function names for serial and KA execution
    return quote
        function $kern($(rep.(sym)...),::Val{1})
            @simd for $I  $R
                @fastmath @inbounds $ex
            end
        end
        @kernel function $kern_($(rep.(sym)...),@Const(I0)) # replace composite arguments
            $I = @index(Global,Cartesian)
            $I += I0
            @fastmath @inbounds $ex
        end
        function $kern($(rep.(sym)...),_)
            $kern_(get_backend($(sym[1])),64)($(sym...),$R[1]-oneunit($R[1]),ndrange=size($R))
        end
        $kern($(sym...),Val{Threads.nthreads()}()) # dispatch to SIMD for -t 1, or KA otherwise
    end |> esc
end

resulting in the following benchmarks of @btime sim_step!($sim)

A) 776.913 μs (180 allocations: 250.23 KiB)
B) 656.911 μs (1376 allocations: 269.77 KiB)

b-fg and others added 3 commits June 14, 2024 19:43
The new @simd version of @loop lets us aim for 0 allocations in the code. We're not there yet.
1. This version of @loop still allocates. Maybe the `grab!` function is to blame.
2. The `residual!` and `Linf` functions had allocation bugs and have been fixed.
3. The perdir tag causes a ton of small allocations.
 - a Calling BC! at all isn't needed within the Poisson solver if not using, so we might want to rethink this approach.
 - b The [insideI] inner product was allocating a bunch and is also only needed when periodic.
4. The time-variable BCs is causing a few small allocations. I think this is from the unknown type of U. We should be able to fix this by passing a function as a different argument.
@b-fg
Copy link
Member

b-fg commented Jun 16, 2024

The simple SIMD version of @loop

macro loop(args...)
    ex,_,itr = args
    _,I,R = itr.args
    return quote
        @simd for $I  $R
            @fastmath @inbounds $ex
        end
    end |> esc
end

Still shows 62 allocations on sim_step! for me. How did you write it to get 0 allocations @weymouth? Or was that at mom_step!?

EDIT: Nvm, it is 0 in mom_step! indeed.

@b-fg
Copy link
Member

b-fg commented Jun 16, 2024

Interestingly enough, the simple @loop version produces 0 allocations, but is slower.
Simple @loop

576.174 μs (0 allocations: 0 bytes)

Current "functional" @loop

434.004 μs (486 allocations: 7.84 KiB)

Since the functional @loop is faster (?) and does not incur many allocations (these are coming from grab! and rep), I think the current form is acceptable.

The periodic conditions are still off, and this threw quite a few errors stemming from some old commits that assume we will call BC! within solve! many many times. Hopefully, that will no longer be true soon, so I've cleaned those up.

NOTE: The "WaterLily.jl" testset is still throwing a weird GPU error, but I couldn't track it down at the moment.
@weymouth
Copy link
Collaborator Author

@b-fg , What is the test case you are using for timing in those comments above?

@b-fg
Copy link
Member

b-fg commented Jun 17, 2024

This one

using WaterLily, StaticArrays, BenchmarkTools

function make_sim(θ;L=32,U=1,Re=100,mem=Array)
    function map(x,t)
        s,c = sincos(θ)
        SA[c -s; s c]*(x-SA[L,L])
    end
    function sdf(ξ,t)
        p = ξ-SA[0,clamp(ξ[1],-L/2,L/2)]
        (p'*p)-2
    end
    Simulation((2L,2L),(U,0),L,ν=U*L/Re,body=AutoBody(sdf,map),T=Float32,mem=mem)
end

sim = make_sim/36; mem=Array)
sim_step!(sim)
@btime mom_step!($sim.flow, $sim.pois)

@weymouth
Copy link
Collaborator Author

weymouth commented Jun 17, 2024

I can replicate that on my Windows PC at home. With sim = make_sim(0f0) I'm getting

796.000 μs (0 allocations: 0 bytes)

vs

626.200 μs (413 allocations: 6.70 KiB)

I also checked the impact of array size using sim = make_sim(0f0,L=256) which gives

31.799 ms (0 allocations: 0 bytes)
vs

27.886 ms (264 allocations: 4.38 KiB)

So the amount of allocations does not grow with array size and is consistently a little faster. Certainly not an issue.

Reintegrated unsteady BCs without allocation by passing the dt arrays instead of the Flow struct and only taking the sum if it's actually needed.
@b-fg
Copy link
Member

b-fg commented Jun 17, 2024

Are tests passing locally for CuArrays too? What else is needed for this PR?

src/Flow.jl Show resolved Hide resolved
GPU is still broke.
Copy link

codecov bot commented Jun 18, 2024

Codecov Report

Attention: Patch coverage is 95.91837% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.21%. Comparing base (f6c5b34) to head (7750fb6).
Report is 2 commits behind head on master.

Current head 7750fb6 differs from pull request most recent head 9bac6a0

Please upload reports for the commit 9bac6a0 to get more accurate results.

Files Patch % Lines
src/Poisson.jl 83.33% 1 Missing ⚠️
src/util.jl 93.75% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #129      +/-   ##
==========================================
- Coverage   96.02%   94.21%   -1.82%     
==========================================
  Files          12       12              
  Lines         503      501       -2     
==========================================
- Hits          483      472      -11     
- Misses         20       29       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

weymouth and others added 3 commits June 18, 2024 19:08
AD still works for my tests, but I should add some to the testset.
…HREADS=1).

Added allocations test in the pipeline.
Conditional loading of CUDA and AMDGPU backends on the pipeline to avoid huge error message.
More testing shows that `typeof(loc)` actually wasn't the problem with GPUs, it was only the default time in `measure!`. All (non-broken) tests passing CPU & GPU. AD test still works (and with Float32) but still not working on GPUs.
@weymouth
Copy link
Collaborator Author

How can the code coverage be worse? I didn't change the number of lines or functions? I reduced the number of arguments in two functions!!?

@b-fg
Copy link
Member

b-fg commented Jun 19, 2024

It's also complaining about src/util.jl#L105, which you did not touch at all... Nvm, I think we are good.

I changed the default `perdir=()` so that I cold test for an empty tuple in dispatch and loop `for i in perdir` otherwise. This is in a new `perBC!` function which is called within Poisson and after `solver!` is called in flow.

Unskipped the periodic tests and added a second allocation test for periodic flows. Passing and non-allocating.
The scalar `BC!` function s never used in the code! I got rid of it and moved the `perBC!` call into `solver!` so it returns the "correct" solution.
Really tandem plates... But I think we can be forgiven.
@b-fg b-fg marked this pull request as ready for review June 19, 2024 14:36
@b-fg
Copy link
Member

b-fg commented Jun 19, 2024

Doing some review, shouldn't the T(0) be zero(T) in
https://github.com/weymouth/WaterLily.jl/blob/b97d700dea197c2683f70d68753edaa591329bbf/src/WaterLily.jl#L71

src/util.jl Show resolved Hide resolved
@b-fg
Copy link
Member

b-fg commented Jun 19, 2024

Went over it once again, all looks good except for the 2 minor details above!

@weymouth weymouth merged commit 1525197 into master Jun 19, 2024
43 checks passed
weymouth referenced this pull request Jun 19, 2024
@b-fg b-fg mentioned this pull request Jun 19, 2024
@vchuravy
Copy link

@michel2323 has been working on adding Enzyme support to KernelAbstractions which together with Checkpointing.jl adds reverse mode capabilities.

We just gave a tutorial on using Enzyme for codes like this: https://dj4earth.github.io/MPE24 and in particular
https://dj4earth.github.io/MPE24/Burgers_tutorial/burgers_tutorial.html

We are still working through some minutia for full GPU support, but we are getting close.

@b-fg
Copy link
Member

b-fg commented Jun 19, 2024

Thanks @vchuravy! That's very useful indeed. Currently ForwardDiff works as expected on CPUs, but yields a fairly cryptic error message on GPUs. Eventually I think that we will move to Enzyme, probably when the GPU support under KA is ready. And thanks for working on that! It will truly bring this solver to a whole new feature level :)

src/Flow.jl Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants