-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ResNet spending much time in CuArrays GC #149
Comments
Not much we can do at that point though, since the GPU is OOM, ie. JuliaGPU/CuArrays.jl#270. Querying memory and occasionally calling |
I am already running a very small batch size on the resnet model. Since the GC pressure from the cuda arrays is low, we will always end up in this situation (GPU memory being full) sooner or later? |
Yes, and extending the main GC to keep track of additional memory pressure and/or separate object pools doesn't seem like it'll be happening. Maybe we could maintain our own pressure metric and occasionally call out to the GC during |
While it is possible to do manual memory management in some cases, how about temporaries like in |
In case it's useful I threw this together: FluxML/Flux.jl#598. For a ResNet there's a risk that we're just holding on to too much memory even with the GC and making it thrash. |
I don't think it is the parameters of the model that is taking up space but all the temporaries that we allocate until we hit oom and then
|
@KristofferC You got the code for this model? I'm having a look at some mem alloc perf improvements. |
Hopefully https://github.com/KristofferC/resnet should work. Just message otherwise. |
I'm getting an image loading problem with https://github.com/KristofferC/resnet:
Image in question changes upon every run. How many memory does your model need for the initial run? EDIT: also fails on my 6GB Titan, so doesn't seem like an OOM. The To see which allocations are being an issue (ie. where we need to early-free) you could run with |
Seems like you just cant load images using Sorry, but I am not sure how to debug where the allocations are coming from. Just enable |
I'm having a look at this right now (in the context of denizyuret/Knet.jl#417 but that shouldn't matter). |
@KristofferC Could you have a try with JuliaGPU/CuArrays.jl#277? |
WIll try! |
Ok so now I get
So most of the time is now being spent outside my tracking regions and I will need to update it to include more regions. Note however:
Previously, the graph of GPU usage looked like: Now it looks like: |
It's weird, it is like I am hitting different "modes" on the GPU. Now I had a quite fast run:
|
I'm seeing something similar... denizyuret/Knet.jl#417 (comment) EDIT: found some logic bugs in the manager though. Fixing those, although that shouldn't cause the nondeterminism we're seeing. |
With those fixes I'm not seeing the changing behavior again. But anyhow, I'm pretty sure I got rid of the costly If you have any more timing results or insights as to where your resnet model might run into CuArrays problems, just open a new issue. I won't have any time to look into profiling it myself though. |
Ah, found the FileIO issue: needed ImageMagick.jl. Strange how that error didn't go through when running under your benchmark script (interrupting the hang showed a stacktrace into a yield). |
GC time definitely still is relevant. |
Spending way too much time in there. There's no good option though, the reclaim is slow because it calls The first thing to try is to sprinkle some An alternative approach could be to split larger blocks in order to fullfill allocation requests, but that will require yet another rework of the allocator. |
You've done so much on the CuArrays.jl side, maybe it fair we do some on the Flux side ;). |
Yeah, seeing how JuliaGPU/CuArrays.jl#279 (comment) shows that the allocator performs really well on the CuArrays test suite (which also allocates a ton), I'd say that adapting the higher-up layers will benefit us more. I tried returning larger buffers but that increases memory pressure too much to be beneficial on the ResNet model. I'll leave this issue open until I have the time to develop some better tools for tracing which outstanding objects are hurting the allocator. |
Took a while, but finally having another look at this. Working with a super simple CuArrays allocator, doing a straight CUDA malloc and when that fails calling into the GC and trying again, similarly directly freeing memory, the resnet model by @KristofferC only works if I allow 6GB of GPU memory to be allocated. This is without pooling, so there's no additional allocations or memory set aside or kept alive. That's pretty bad, right, seeing https://github.com/JuliaGPU/CuArrays.jl/issues/273#issuecomment-460661339 and how PyTorch consumes about 4GB (but with pooling memory, so it's really less). |
Let's close this, see #137 (comment). Please open new issues with updated MWEs if the issue persists. |
I was profiling why a resnet model (https://github.com/KristofferC/resnet) was running extremely slow on Flux.
Sprinkling some sections using https://github.com/KristofferC/TimerOutputs.jl and training the model a little bit I got:
(edit: the timings below are stale due to changes in CuArrays, see https://github.com/JuliaGPU/CuArrays.jl/issues/273#issuecomment-461943376 for an update)
The
gc true
section refers to only this line:https://github.com/JuliaGPU/CuArrays.jl/blob/61e25a2d239da77a5e8f3dc9746f9f62cd9e1380/src/memory.jl#L256
It seems this line is being called too often compared to how expensive a
gc(true)
call is.The text was updated successfully, but these errors were encountered: