-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider running GC when allocating and synchronizing #2304
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2304 +/- ##
===========================================
- Coverage 71.86% 60.45% -11.42%
===========================================
Files 155 155
Lines 15020 14959 -61
===========================================
- Hits 10794 9043 -1751
- Misses 4226 5916 +1690 ☔ View full report in Codecov by Sentry. |
mutable struct AllocStats | ||
alloc_count::Threads.Atomic{Int} | ||
alloc_bytes::Threads.Atomic{Int} | ||
|
||
free_count::Threads.Atomic{Int} | ||
free_bytes::Threads.Atomic{Int} | ||
|
||
total_time::MaybeAtomicFloat64 | ||
total_time::Threads.Atomic{Float64} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use @atomic total_time::Float64
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could update almost all of these then?
Trying this PR on my example from FluxML/Flux.jl#2414 on my NVIDIA RTX 4080 on Windows 11 ... and it hung Julia and made the process unkillable in Task Manager. EDIT: apparently this is often caused by a driver fault. https://superuser.com/questions/136272/how-can-i-kill-an-unkillable-process. And eventually I got a BSOD for KMODE_EXCEPTION_NOT_HANDLED in the NVIDIA driver EDIT2: Though also this is in comparison with 5.2, not the current master. ... and the base of the branch hung my whole system 🙃 |
The example in https://discourse.julialang.org/t/gpu-memory-usage-increasing-on-each-epoch-flux/112942/2 |
399c1cb
to
f6976df
Compare
Thanks, the example from https://discourse.julialang.org/t/gpu-memory-usage-increasing-on-each-epoch-flux/112942 was useful. I tuned the heuristic, and in low-memory situations it significantly improves performance, while with more reasonable amounts of memory available it smoothens out the cost of garbage collection and results in slightly shorter pauses and more consistent execution times. It's possible to disable the heuristic by setting Taking the Flux example from Discourse: using Flux
using MLUtils: DataLoader
using CUDA
using NVTX
function increasing_gpu_memory_usage()
n_obs = 300_000
n_feature = 1000
X = rand(n_feature, n_obs)
y = rand(1, n_obs)
train_data = DataLoader((X, y) |< gpu; batchsize = 2048, shuffle = false)
model = Dense(n_feature, >) |< gpu
loss(m, _x, _y) = Flux.Losses.mse(m(_x), _>)
opt_state = Flux.setup(Flux.Adam(), model)
Flux.train!(loss, model, train_data, opt_state)
total_time = @elapsed begin
CUDA.@profile external=true for epoch in 1:100
NVTX.@range "Epoch $epoch" begin
train_time = @elapsed Flux.train!(loss, model, train_data, opt_state)
@info "Epoch $(epoch) train time $(round(train_time, digits=3))"
end
end
end
@info "Total time $(round(total_time, digits=3))"
return
end
isinteractive() || increasing_gpu_memory_usage() Old behavior, with a 4GiB memory limit:
Note that it takes a while to reach steady-state, so I'm only showing the final epochs. Enabling the new heuristic:
Part of the advantage seems comes from the fact that collecting earlier makes it possible for memory to be made available to the memory pool without having to explicity synchronize. Before, we called the GC when we were already at 100% memory usage, and because memory gets freed asynchronously (i.e. it only becomes available when the free actually executes), that meant that we often had to also wait for the GPU to finish its current work. While now, because of collecting earlier, we give the free a chance to materialize obviating explicit synchronization. Everybody, please test this out on your code, or share (easily reproducible) MWEs that illustrate problems. |
I don't consider this ready, but am going to go ahead and merge this to avoid excessive conflicts with the memory refactor I'm doing in #2335. |
Implements #2303; cc @gbaraldi