Consider running GC when allocating and synchronizing #2304

maleadt · 2024-03-26T13:23:59Z

src/pool.jl

codecov · 2024-03-26T14:04:10Z

Codecov Report

Attention: Patch coverage is 92.10526% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 60.45%. Comparing base (c5fcd73) to head (577beb9).

Files	Patch %	Lines
src/pool.jl	93.75%	4 Missing ⚠️
lib/cudadrv/synchronization.jl	88.88%	1 Missing ⚠️
lib/cudnn/src/convolution.jl	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2304       +/-   ##
===========================================
- Coverage   71.86%   60.45%   -11.42%     
===========================================
  Files         155      155               
  Lines       15020    14959       -61     
===========================================
- Hits        10794     9043     -1751     
- Misses       4226     5916     +1690

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

vchuravy · 2024-03-26T14:25:54Z

src/pool.jl

 mutable struct AllocStats
  alloc_count::Threads.Atomic{Int}
  alloc_bytes::Threads.Atomic{Int}

  free_count::Threads.Atomic{Int}
  free_bytes::Threads.Atomic{Int}

-  total_time::MaybeAtomicFloat64
+  total_time::Threads.Atomic{Float64}


Can we use @atomic total_time::Float64?

I guess we could update almost all of these then?

BioTurboNick · 2024-03-30T14:54:58Z

Trying this PR on my example from FluxML/Flux.jl#2414 on my NVIDIA RTX 4080 on Windows 11

... and it hung Julia and made the process unkillable in Task Manager.

EDIT: apparently this is often caused by a driver fault. https://superuser.com/questions/136272/how-can-i-kill-an-unkillable-process. And eventually I got a BSOD for KMODE_EXCEPTION_NOT_HANDLED in the NVIDIA driver

EDIT2: Though also this is in comparison with 5.2, not the current master. ... and the base of the branch hung my whole system 🙃

CarloLucibello · 2024-04-15T04:25:09Z

The example in https://discourse.julialang.org/t/gpu-memory-usage-increasing-on-each-epoch-flux/112942/2
is worth benchmarking as well. Let me know if you need help with testing!

maleadt · 2024-04-18T14:02:31Z

Thanks, the example from https://discourse.julialang.org/t/gpu-memory-usage-increasing-on-each-epoch-flux/112942 was useful. I tuned the heuristic, and in low-memory situations it significantly improves performance, while with more reasonable amounts of memory available it smoothens out the cost of garbage collection and results in slightly shorter pauses and more consistent execution times. It's possible to disable the heuristic by setting JULIA_CUDA_GC_EARLY=false, making it easy to compare.

Taking the Flux example from Discourse:

using Flux
using MLUtils: DataLoader
using CUDA
using NVTX

function increasing_gpu_memory_usage()
    n_obs = 300_000
    n_feature = 1000
    X = rand(n_feature, n_obs)
    y = rand(1, n_obs)
    train_data = DataLoader((X, y) |< gpu; batchsize = 2048, shuffle = false)

    model = Dense(n_feature, >) |< gpu
    loss(m, _x, _y) = Flux.Losses.mse(m(_x), _>)
    opt_state = Flux.setup(Flux.Adam(), model)
    Flux.train!(loss, model, train_data, opt_state)
    total_time = @elapsed begin
        CUDA.@profile external=true for epoch in 1:100
            NVTX.@range "Epoch $epoch" begin
                train_time = @elapsed Flux.train!(loss, model, train_data, opt_state)
                @info "Epoch $(epoch) train time $(round(train_time, digits=3))"
            end
        end
    end
    @info "Total time $(round(total_time, digits=3))"
    return
end

isinteractive() || increasing_gpu_memory_usage()

Old behavior, with a 4GiB memory limit:

❯ JULIA_CUDA_GC_EARLY=false JULIA_CUDA_HARD_MEMORY_LIMIT=4GiB julia --project wip.jl
...
[ Info: Epoch 90 train time 0.031
retry_reclaim: freed 2.865 GiB
[ Info: Epoch 91 train time 0.031
[ Info: Epoch 92 train time 0.027
retry_reclaim: freed 2.865 GiB
[ Info: Epoch 93 train time 0.03
retry_reclaim: freed 2.873 GiB
[ Info: Epoch 94 train time 0.031
retry_reclaim: freed 2.873 GiB
[ Info: Epoch 95 train time 0.03
retry_reclaim: freed 2.873 GiB
[ Info: Epoch 96 train time 0.031
[ Info: Epoch 97 train time 0.027
retry_reclaim: freed 2.873 GiB
[ Info: Epoch 98 train time 0.031
retry_reclaim: freed 2.865 GiB
[ Info: Epoch 99 train time 0.031
retry_reclaim: freed 2.865 GiB
[ Info: Epoch 100 train time 0.031
[ Info: Total time 4.307

Note that it takes a while to reach steady-state, so I'm only showing the final epochs.

Enabling the new heuristic:

❯ JULIA_CUDA_GC_EARLY=true JULIA_CUDA_HARD_MEMORY_LIMIT=4GiB julia --project wip.jl
...
[ Info: Epoch 90 train time 0.031
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
maybe_collect: collected 1.869 GiB (9.0% < 10.0%) while using 3.004 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 91 train time 0.033
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 92 train time 0.031
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 93 train time 0.031
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 94 train time 0.03
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 95 train time 0.03
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
maybe_collect: collected 1.869 GiB (9.0% < 10.0%) while using 3.004 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 96 train time 0.033
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 97 train time 0.03
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 98 train time 0.03
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 99 train time 0.03
maybe_collect: collected 1.877 GiB (9.0% < 10.0%) while using 3.003 GiB/4.000 GiB = 75.09% (> 75.0%) memory
[ Info: Epoch 100 train time 0.03
[ Info: Total time 3.76

Part of the advantage seems comes from the fact that collecting earlier makes it possible for memory to be made available to the memory pool without having to explicity synchronize. Before, we called the GC when we were already at 100% memory usage, and because memory gets freed asynchronously (i.e. it only becomes available when the free actually executes), that meant that we often had to also wait for the GPU to finish its current work. While now, because of collecting earlier, we give the free a chance to materialize obviating explicit synchronization.

Everybody, please test this out on your code, or share (easily reproducible) MWEs that illustrate problems.

maleadt · 2024-04-22T07:38:24Z

I don't consider this ready, but am going to go ahead and merge this to avoid excessive conflicts with the memory refactor I'm doing in #2335.

maleadt added enhancement New feature or request cuda array Stuff about CuArray. labels Mar 26, 2024

maleadt commented Mar 26, 2024

View reviewed changes

src/pool.jl Outdated Show resolved Hide resolved

vchuravy reviewed Mar 26, 2024

View reviewed changes

maleadt added 4 commits April 18, 2024 12:01

Rename available_memory to free_memory, matching Base.

670bd60

Add a simple collection heuristic on allocation and synchronization.

e0ca4a8

Improve the GC heuristic.

7d8dba2

Periodically re-estimate the actual available memory size.

f6976df

maleadt force-pushed the tb/maybe_collect branch from 399c1cb to f6976df Compare April 18, 2024 13:03

maleadt added 3 commits April 18, 2024 15:14

Simplify things by requiring CUDA 11.3 for stream-ordered allocations.

71a9f8f

Correct total memory calculation.

81caa5c

Tweaks.

577beb9

maleadt marked this pull request as ready for review April 18, 2024 19:38

maleadt merged commit 7e07ecc into master Apr 22, 2024
1 check passed

maleadt deleted the tb/maybe_collect branch April 22, 2024 07:38

gbaraldi mentioned this pull request Apr 30, 2024

Adopt maybe collect garbage collection scheme similar to CUDA.jl JuliaGPU/AMDGPU.jl#625

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider running GC when allocating and synchronizing #2304

Consider running GC when allocating and synchronizing #2304

maleadt commented Mar 26, 2024

codecov bot commented Mar 26, 2024 •

edited

Loading

vchuravy Mar 26, 2024

maleadt Mar 26, 2024

BioTurboNick commented Mar 30, 2024 •

edited

Loading

CarloLucibello commented Apr 15, 2024

maleadt commented Apr 18, 2024 •

edited

Loading

maleadt commented Apr 22, 2024

Consider running GC when allocating and synchronizing #2304

Consider running GC when allocating and synchronizing #2304

Conversation

maleadt commented Mar 26, 2024

codecov bot commented Mar 26, 2024 • edited Loading

Codecov Report

vchuravy Mar 26, 2024

Choose a reason for hiding this comment

maleadt Mar 26, 2024

Choose a reason for hiding this comment

BioTurboNick commented Mar 30, 2024 • edited Loading

CarloLucibello commented Apr 15, 2024

maleadt commented Apr 18, 2024 • edited Loading

maleadt commented Apr 22, 2024

codecov bot commented Mar 26, 2024 •

edited

Loading

BioTurboNick commented Mar 30, 2024 •

edited

Loading

maleadt commented Apr 18, 2024 •

edited

Loading