You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I set all GPU variables to nothing and call CUDA.reclaim(), my GPU memory remains full (does not go back to initial usage).
Currently the models being loaded onto the GPU is a BERT model from Transformers.jl, which is only loaded onto the gpu when training or testing, but is offloaded back to the cpu when not in use.
All the code to create the BERT model is in a module called BERTModule with no global variables. I create a few BERT models in the main module global scope by calling those functions from the BERTModule module and I store them in global variables in the main module global scope. Then I train and predict using each of the models which causes my GPU memory usage to quickly increase. Then when I set all of the variables in the main module global scope to nothing and call CUDA.reclaim(), my GPU memory usage either drops a few tens or hundreds of MB's or not at all, nowhere close to its initial value.
Furthermore, when training the many BERT modules sequentially, I ran out of memory when training one, but then called CUDA.reclaim which reclaimed a small amount of GPU memory and then I tried training the model again and it worked. The error message from the REPL for this case is below. As you can see, I try training the model (training data/size remains constant), the GPU runs out of memory, but then when I call CUDA.reclaim() and try training the model again, it works.
These appear to be bugs because CUDA.reclaim() should not be necessary to explicitly call and because when setting all variables to nothing and calling reclaim(), the expected behaviour is for GPU memory usage to go back down to the resting GPU usage.
If it is relevant, currently I am using an Nvidia 1050ti.
Is it possible that storing the compiled functions from BERTModule are taking up the storage on the GPU? If so, is there a way I can clear some of those functions from memory?
Memory handling and GC integration has changed significantly, so I don't think this issue as reported here is still relevant. If the problem persists on CUDA.jl#master, feel free to open a new issue!
When I set all GPU variables to nothing and call CUDA.reclaim(), my GPU memory remains full (does not go back to initial usage).
Currently the models being loaded onto the GPU is a BERT model from Transformers.jl, which is only loaded onto the gpu when training or testing, but is offloaded back to the cpu when not in use.
All the code to create the BERT model is in a module called BERTModule with no global variables. I create a few BERT models in the main module global scope by calling those functions from the BERTModule module and I store them in global variables in the main module global scope. Then I train and predict using each of the models which causes my GPU memory usage to quickly increase. Then when I set all of the variables in the main module global scope to nothing and call CUDA.reclaim(), my GPU memory usage either drops a few tens or hundreds of MB's or not at all, nowhere close to its initial value.
Furthermore, when training the many BERT modules sequentially, I ran out of memory when training one, but then called CUDA.reclaim which reclaimed a small amount of GPU memory and then I tried training the model again and it worked. The error message from the REPL for this case is below. As you can see, I try training the model (training data/size remains constant), the GPU runs out of memory, but then when I call CUDA.reclaim() and try training the model again, it works.
These appear to be bugs because CUDA.reclaim() should not be necessary to explicitly call and because when setting all variables to nothing and calling reclaim(), the expected behaviour is for GPU memory usage to go back down to the resting GPU usage.
If it is relevant, currently I am using an Nvidia 1050ti.
The text was updated successfully, but these errors were encountered: