Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arbitrary tracers #452

Merged
merged 66 commits into from
Oct 17, 2019
Merged

Arbitrary tracers #452

merged 66 commits into from
Oct 17, 2019

Conversation

glwagner
Copy link
Member

@glwagner glwagner commented Oct 9, 2019

This PR introduces 'arbitrary tracer' functionality to Oceananigans.

With this PR, the user is now able to specify which tracers they would like to include in their simulation via the Model constructor:

model = Model(..., tracers=(:b, :c1, :c2), ...)

User-facing considerations

  1. The specified tracers must be consistent with the specified buoyancy model. This consistency is ensured by validate_buoyancy in the Model constructor. For BuoyancyTracer, the user must specify a tracer named :b. For SeawaterBuoyancy, the user must specify tracers named :T and :S. (In the future, we could also allow the names of these buoyancy, temperature, and salinity tracers to be specified in the constructor for the respective buoyancy models, eg SeawaterBuoyancy(temperature=:T, salinity=:S).)

  2. The constructor for ModelForcing (which will have to be merged with Forcing API refactor #444 when it is approved) now allows the user to specify forcing for any of the tracer fields. If the user specifies a forcing for a tracer field that does not exist, an error is thrown. If a user does not specify a forcing for some tracer field, a default zero forcing is applied.

  3. The same logic for ModelForcing applies to BoundaryConditions / SolutionBoundaryConditions. When boundary conditions for a tracer field are unspecified, a default tracer boundary condition is inferred from the x-velocity field: tracers either inherit Periodic boundary conditions, or are given a Flux, Nothing boundary condition for all other cases.

  4. For closures, all tracer-diffusivity-related fields permit users to specify either:

  • a constant (applied to all tracers), or
  • a NamedTuple with fields for every tracer.

In the future, we could possibly implement some mixed behavior where the user specifies both a default and particular values for tracer diffusivity. This would be useful in the (possibly rare) use-case of a large number of tracers with the same diffusivity, but where one or two of them require a special, different diffusivity. I am not sure this API is necessary so I left it for future work.

Diffusivity-like fields include:

  • κ (constant component of tracer diffusivity) for ConstantIsotropicDiffusivity, AnisotropicMinimumDissipation, and ConstantSmagorinsky
  • Pr (turbulent Prandtl number) for ConstantSmagorinsky
  • κh and κv for ConstantAnisotropicDiffusivity.

Internal algorithmic considerations

This implementation includes a major refactor of the time-stepping algorithm. In particular, kernels are launched for each tracer for all operations that involve tracers. This differs from the previous pattern, in which a single kernel was called in some cases (for example, to update the velocity and tracer fields, or to store previous source terms). The reason for this change is because I ran into some issues (dynamic function invocations) using ntuple to unroll a loop over tracers inside the kernel. In addition, I think that with a large number of tracers the kernels may become too large and their performance could degrade (but I'm not sure).

This refactoring of the algorithm means we need to

  • benchmark the changes in this PR to see if there are any significant changes in model performance.

@ali-ramadhan, can you help with this?

If there are changes in model performance, we can work on unrolling loops over the tracer fields inside our kernels. This is probably possible; it just requires some debugging. We would probably also want to make sure that this doesn't lead to poor performance for up to O(10) tracers.

If any of this PR is not satisfactory, I'm happy to work on it and iterate until the PR is in mergeable shape.

Resolves #25
Resolves #430
Partially addresses #448

…r of tracers; making AMD closures accept arbitrary tracers
…cleanup of docstrings and overlapping AMD functionality
…rs are consistent with the specified buoyancy model. Changes non-dimensional model to use a buoyancy tracer.
@glwagner
Copy link
Member Author

Unrolling inside loops may be the way to go then.

Copy link
Member

@ali-ramadhan ali-ramadhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Not much to disagree with so mostly just had minor comments.

  1. Good to make sure model is consistent with validate_buoyancy. I'm okay with being able to change the name of temperature and salinity, although probably not something I used. On Slack you suggested changing :T to which I'd be in favor of.

  2. Nice!

  3. Seems reasonable. Side note: Would be nice to have something like validate_boundary_conditions in the future to make inferences like this safe by e.g. checking that if one x boundary condition is Periodic then all other x BCs are also periodic.

  4. Out of curiousity, would a user ever want to run some tracers with LES and other tracers with constant diffusivity? Seems like it'll be a weird/inconsistent model.

Things to do before merging:

  • The TurbulenceClosures module has been quite significantly refactored, but we don't have an LES regression test (kind of my fault, I'll open an issue). I wonder if it's worth rerunning the stratified Couette flow verification as a manual regression test? Or we can add an LES regression test in master and then make sure this branch produces the same numbers.
  • Need to figure out why the model has slowed down by a factor of ~2-3x. Will try benchmarking the arbitrary-tracers-inner-loops branch again.

Other comments:

src/buoyancy.jl Show resolved Hide resolved
src/buoyancy.jl Show resolved Hide resolved
src/buoyancy.jl Show resolved Hide resolved
src/time_steppers.jl Show resolved Hide resolved
src/time_steppers.jl Show resolved Hide resolved
src/time_steppers.jl Outdated Show resolved Hide resolved
src/models.jl Show resolved Hide resolved
@glwagner
Copy link
Member Author

Out of curiousity, would a user ever want to run some tracers with LES and other tracers with constant diffusivity? Seems like it'll be a weird/inconsistent model.

That's not out of the question. We could make the closure itself a tuple for different tracers. This functionality could also be hacked together for smagorinsky if Pr = Inf for one or more tracers. However, we don't flexibly permit the mixing and matching of different closures in general. Something to ponder for the future, in the general category of how we allow users to more flexibly change the equation they are solving.

@glwagner
Copy link
Member Author

I've pushed some commits that fix the performance issue. This is done by using the Val type to pass the tracer index into the kernel functions that compute ∇_κ_∇c, rather than using the raw, unwrapped, integer tracer_index, eg

https://github.com/climate-machine/Oceananigans.jl/blob/606d40444038d058e32be71655a7239986114a56/src/time_steppers.jl#L172

The flux divergence for constant isotropic diffusivity, for example, is now

https://github.com/climate-machine/Oceananigans.jl/blob/606d40444038d058e32be71655a7239986114a56/src/TurbulenceClosures/constant_isotropic_diffusivity.jl#L42

Benchmarks on cyclops are now

Master

Oceananigans package status:
    Status `~/.julia/environments/v1.1/Project.toml`
  [9e8cae18] Oceananigans v0.11.1 #master (https://github.com/climate-machine/Oceananigans.jl.git)

┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/fLiQ1/src/indexing.jl:16
Running static ocean benchmark:  32× 32× 32  (GPU, Float64)...
Running static ocean benchmark:  64× 64× 64  (GPU, Float64)...
Running static ocean benchmark: 128×128×128  (GPU, Float64)...
Running static ocean benchmark: 256×256×256  (GPU, Float64)...
 ──────────────────────────────────────────────────────────────────────────────────────
        Static ocean benchmarks                Time                   Allocations      
                                       ──────────────────────   ───────────────────────
           Tot / % measured:                82.9s / 0.58%           8.26GiB / 0.37%    

 Section                       ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
 256×256×256  (GPU, Float64)       10    287ms  59.8%  28.7ms   7.73MiB  24.5%   791KiB
  32× 32× 32  (GPU, Float64)       10   75.9ms  15.8%  7.59ms   8.28MiB  26.3%   848KiB
 128×128×128  (GPU, Float64)       10   66.7ms  13.9%  6.67ms   7.79MiB  24.7%   798KiB
  64× 64× 64  (GPU, Float64)       10   50.6ms  10.5%  5.06ms   7.73MiB  24.5%   791KiB
 ──────────────────────────────────────────────────────────────────────────────────────

arbitrary-tracers-outer-loops

Oceananigans package status:
    Status `~/.julia/environments/v1.1/Project.toml`
  [9e8cae18] Oceananigans v0.11.0 #arbitrary-tracers-outer-loops (https://github.com/climate-machine/Oceananigans.jl.git)

┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/fLiQ1/src/indexing.jl:16
Running static ocean benchmark:  32× 32× 32  (GPU, Float64)...
Running static ocean benchmark:  64× 64× 64  (GPU, Float64)...
Running static ocean benchmark: 128×128×128  (GPU, Float64)...
Running static ocean benchmark: 256×256×256  (GPU, Float64)...
 ──────────────────────────────────────────────────────────────────────────────────────
        Static ocean benchmarks                Time                   Allocations      
                                       ──────────────────────   ───────────────────────
           Tot / % measured:                85.2s / 0.56%           8.46GiB / 0.43%    

 Section                       ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
 256×256×256  (GPU, Float64)       10    277ms  57.6%  27.7ms   9.16MiB  24.6%   938KiB
  32× 32× 32  (GPU, Float64)       10   82.2ms  17.1%  8.22ms   9.71MiB  26.1%  0.97MiB
 128×128×128  (GPU, Float64)       10   65.9ms  13.7%  6.59ms   9.16MiB  24.6%   938KiB
  64× 64× 64  (GPU, Float64)       10   55.5ms  11.6%  5.55ms   9.16MiB  24.6%   938KiB
 ──────────────────────────────────────────────────────────────────────────────────────

arbitrary-tracers-inner-loops

Oceananigans package status:
    Status `~/.julia/environments/v1.1/Project.toml`
  [9e8cae18] Oceananigans v0.11.0 #arbitrary-tracers-inner-loops (https://github.com/climate-machine/Oceananigans.jl.git)

┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/fLiQ1/src/indexing.jl:16
Running static ocean benchmark:  32× 32× 32  (GPU, Float64)...
Running static ocean benchmark:  64× 64× 64  (GPU, Float64)...
Running static ocean benchmark: 128×128×128  (GPU, Float64)...
Running static ocean benchmark: 256×256×256  (GPU, Float64)...
 ──────────────────────────────────────────────────────────────────────────────────────
        Static ocean benchmarks                Time                   Allocations      
                                       ──────────────────────   ───────────────────────
           Tot / % measured:                86.4s / 0.56%           8.40GiB / 0.37%    

 Section                       ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
 256×256×256  (GPU, Float64)       10    288ms  59.9%  28.8ms   7.71MiB  24.5%   790KiB
  32× 32× 32  (GPU, Float64)       10   78.3ms  16.3%  7.83ms   8.39MiB  26.6%   859KiB
 128×128×128  (GPU, Float64)       10   62.0ms  12.9%  6.20ms   7.71MiB  24.5%   790KiB
  64× 64× 64  (GPU, Float64)       10   52.6ms  10.9%  5.26ms   7.71MiB  24.5%   790KiB
 ──────────────────────────────────────────────────────────────────────────────────────

@ali-ramadhan
Copy link
Member

Here are the results of benchmark_tracers.jl between the arbitrary-tracers-outer-loops branch (this PR) and the arbitrary-tracers-inner-loops branch.

Outer loops is faster in all cases.

PS: arbitrary-tracers-inner-loops errored for 0 active + 0 passive tracers so this particular benchmark isn't included below, but maybe it doesn't matter as outer loops seem to perform much better.


arbitrary-tracers-outer-loops

 ───────────────────────────────────────────────────────────────────────────────────────────────────────────
                      Tracer benchmarks                             Time                   Allocations      
                                                            ──────────────────────   ───────────────────────
                      Tot / % measured:                           119s / 2.11%           19.7GiB / 0.48%    

 Section                                            ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 256×256×256 0 active +  0 passive (GPU, Float64)       10    206ms  8.24%  20.6ms   6.11MiB  6.37%   625KiB
 256×256×256 0 active +  1 passive (GPU, Float64)       10    234ms  9.36%  23.4ms   7.62MiB  7.94%   780KiB
 256×256×256 0 active +  2 passive (GPU, Float64)       10    258ms  10.3%  25.8ms   9.17MiB  9.56%   939KiB
 256×256×256 1 active +  0 passive (GPU, Float64)       10    232ms  9.28%  23.2ms   7.68MiB  8.01%   787KiB
 256×256×256 2 active +  0 passive (GPU, Float64)       10    266ms  10.6%  26.6ms   9.14MiB  9.53%   936KiB
 256×256×256 2 active +  3 passive (GPU, Float64)       10    348ms  13.9%  34.8ms   13.8MiB  14.4%  1.38MiB
 256×256×256 2 active +  5 passive (GPU, Float64)       10    409ms  16.3%  40.9ms   17.0MiB  17.7%  1.70MiB
 256×256×256 2 active + 10 passive (GPU, Float64)       10    551ms  22.0%  55.1ms   25.4MiB  26.5%  2.54MiB
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────

arbitrary-tracers-inner-loops

 ───────────────────────────────────────────────────────────────────────────────────────────────────────────
                      Tracer benchmarks                             Time                   Allocations      
                                                            ──────────────────────   ───────────────────────
                      Tot / % measured:                           120s / 2.20%           19.1GiB / 0.33%    

 Section                                            ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 256×256×256 0 active +  1 passive (GPU, Float64)       10    241ms  9.14%  24.1ms   7.09MiB  10.8%   726KiB
 256×256×256 0 active +  2 passive (GPU, Float64)       10    273ms  10.3%  27.3ms   7.68MiB  11.7%   786KiB
 256×256×256 1 active +  0 passive (GPU, Float64)       10    237ms  8.99%  23.7ms   7.06MiB  10.8%   723KiB
 256×256×256 2 active +  0 passive (GPU, Float64)       10    280ms  10.6%  28.0ms   7.79MiB  11.9%   798KiB
 256×256×256 2 active +  3 passive (GPU, Float64)       10    408ms  15.5%  40.8ms   9.80MiB  15.0%  0.98MiB
 256×256×256 2 active +  5 passive (GPU, Float64)       10    494ms  18.7%  49.4ms   11.1MiB  16.9%  1.11MiB
 256×256×256 2 active + 10 passive (GPU, Float64)       10    705ms  26.7%  70.5ms   15.0MiB  22.9%  1.50MiB
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────

@glwagner
Copy link
Member Author

Nice! Can post 32^3 results just to make sure we don't kill small models? I think its probable they will be comparable.

@glwagner
Copy link
Member Author

This is great news btw!

@ali-ramadhan
Copy link
Member

Hmmm, interestingly in your benchmarks the 128^3 models were faster than the 32^3 models!

Benchmark results from 32^3 models:

For 1-2 tracers inner and outer loops seem the same, but for many tracers, inner loops seems better now on the GPU. For the CPU, outer loops is still much better.

Not sure what to make of this. We're definitely not saturating the GPU so maybe it's not worth considering the case of small GPU models too much?

Outer loops seems like the way to go?

Outer loops (this PR)

───────────────────────────────────────────────────────────────────────────────────────────────────────────
                      Tracer benchmarks                             Time                   Allocations      
                                                            ──────────────────────   ───────────────────────
                      Tot / % measured:                           132s / 0.60%           15.9GiB / 0.63%    

 Section                                            ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────
  32× 32× 32 0 active +  0 passive (CPU, Float64)       10   32.2ms  4.06%  3.22ms    551KiB  0.52%  55.1KiB
  32× 32× 32 0 active +  1 passive (CPU, Float64)       10   39.2ms  4.95%  3.92ms    642KiB  0.61%  64.2KiB
  32× 32× 32 0 active +  2 passive (CPU, Float64)       10   42.1ms  5.32%  4.21ms    736KiB  0.70%  73.6KiB
  32× 32× 32 1 active +  0 passive (CPU, Float64)       10   38.4ms  4.85%  3.84ms    652KiB  0.62%  65.2KiB
  32× 32× 32 2 active +  0 passive (CPU, Float64)       10   43.2ms  5.46%  4.32ms    747KiB  0.71%  74.7KiB
  32× 32× 32 2 active +  3 passive (CPU, Float64)       10   55.6ms  7.02%  5.56ms   0.98MiB  0.96%   100KiB
  32× 32× 32 2 active +  5 passive (CPU, Float64)       10   63.9ms  8.07%  6.39ms   1.15MiB  1.12%   118KiB
  32× 32× 32 2 active + 10 passive (CPU, Float64)       10   80.5ms  10.2%  8.05ms   1.59MiB  1.54%   162KiB

  32× 32× 32 0 active +  0 passive (GPU, Float64)       10   29.1ms  3.67%  2.91ms   6.11MiB  5.94%   625KiB
  32× 32× 32 0 active +  1 passive (GPU, Float64)       10   33.2ms  4.19%  3.32ms   7.62MiB  7.41%   780KiB
  32× 32× 32 0 active +  2 passive (GPU, Float64)       10   36.9ms  4.66%  3.69ms   9.12MiB  8.88%   934KiB
  32× 32× 32 1 active +  0 passive (GPU, Float64)       10   33.5ms  4.23%  3.35ms   7.63MiB  7.42%   781KiB
  32× 32× 32 2 active +  0 passive (GPU, Float64)       10   37.2ms  4.70%  3.72ms   9.14MiB  8.89%   936KiB
  32× 32× 32 2 active +  3 passive (GPU, Float64)       10   60.6ms  7.65%  6.06ms   13.8MiB  13.4%  1.38MiB
  32× 32× 32 2 active +  5 passive (GPU, Float64)       10   68.3ms  8.63%  6.83ms   17.0MiB  16.5%  1.70MiB
  32× 32× 32 2 active + 10 passive (GPU, Float64)       10   97.9ms  12.4%  9.79ms   25.4MiB  24.7%  2.54MiB
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────

Inner loops

 ───────────────────────────────────────────────────────────────────────────────────────────────────────────
                      Tracer benchmarks                             Time                   Allocations      
                                                            ──────────────────────   ───────────────────────
                      Tot / % measured:                           129s / 0.81%           16.4GiB / 2.72%    

 Section                                            ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────
  32× 32× 32 0 active +  1 passive (CPU, Float64)       10   83.2ms  7.98%  8.32ms   55.6MiB  12.2%  5.56MiB
  32× 32× 32 0 active +  2 passive (CPU, Float64)       10   81.5ms  7.81%  8.15ms   55.7MiB  12.2%  5.57MiB
  32× 32× 32 1 active +  0 passive (CPU, Float64)       10   79.2ms  7.59%  7.92ms   55.6MiB  12.2%  5.56MiB
  32× 32× 32 2 active +  0 passive (CPU, Float64)       10   81.1ms  7.77%  8.11ms   55.7MiB  12.2%  5.57MiB
  32× 32× 32 2 active +  3 passive (CPU, Float64)       10    120ms  11.5%  12.0ms   56.0MiB  12.3%  5.60MiB
  32× 32× 32 2 active +  5 passive (CPU, Float64)       10    129ms  12.4%  12.9ms   56.1MiB  12.3%  5.61MiB
  32× 32× 32 2 active + 10 passive (CPU, Float64)       10    188ms  18.0%  18.8ms   56.6MiB  12.4%  5.66MiB
 
  32× 32× 32 0 active +  1 passive (GPU, Float64)       10   30.3ms  2.90%  3.03ms   7.05MiB  1.54%   722KiB
  32× 32× 32 0 active +  2 passive (GPU, Float64)       10   39.8ms  3.82%  3.98ms   7.73MiB  1.69%   792KiB
  32× 32× 32 1 active +  0 passive (GPU, Float64)       10   30.0ms  2.88%  3.00ms   7.06MiB  1.55%   723KiB
  32× 32× 32 2 active +  0 passive (GPU, Float64)       10   32.8ms  3.14%  3.28ms   7.70MiB  1.69%   789KiB
  32× 32× 32 2 active +  3 passive (GPU, Float64)       10   39.2ms  3.76%  3.92ms   9.69MiB  2.12%  0.97MiB
  32× 32× 32 2 active +  5 passive (GPU, Float64)       10   44.1ms  4.23%  4.41ms   11.1MiB  2.43%  1.11MiB
  32× 32× 32 2 active + 10 passive (GPU, Float64)       10   65.4ms  6.27%  6.54ms   15.1MiB  3.30%  1.51MiB
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────

@ali-ramadhan
Copy link
Member

The stratified Couette flow verficiation test was failing (after merging this PR into master) because stratified_couette_flow.jl uses κ = model.closure.κ but that's a NamedTuple now so it should be κ = model.closure.κ.T.

I've fixed this. Hopefully tests will pass now.

…--- cannot directly use checkpointing because underlying model type signatures have changed; must manually load fields as in other regression tests
@glwagner
Copy link
Member Author

@ali-ramadhan I've addressed your concerns: we now have a LES regression test, and some changes to inner kernels means that performance is now improved by this PR over master (for large models).

Copy link
Member

@ali-ramadhan ali-ramadhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@ali-ramadhan ali-ramadhan merged commit e60a723 into master Oct 17, 2019
@ali-ramadhan ali-ramadhan deleted the arbitrary-tracers-outer-loops branch October 17, 2019 19:07
arcavaliere pushed a commit to arcavaliere/Oceananigans.jl that referenced this pull request Nov 6, 2019
Arbitrary tracers

Former-commit-id: e60a723
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup 🧹 Paying off technical debt feature 🌟 Something new and shiny numerics 🧮 So things don't blow up and boil the lobsters alive performance 🏍️ So we can get the wrong answer even faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Each tracer should have its own diffusivity Arbitrary tracer fields.
2 participants