-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory error #18
Comments
Hi, @nullterminated. The standard models were designed to fit on GPUs with 4GB of memory or more - that is why you're running out of memory. If you decrease the batch size, you should be able to run just about any network you want with 3GB of GPU memory. It will just take a bit longer. |
I've added a feature that should help avoid these issues in the future. I'm going to go ahead and close this issue in faith that changing the batch size will fix your problem for you for the present. Please re-open it if you still have an issue here. |
That does the trick. It appears the default value is a batch size around 100. I made that 50 and now it is training without memory issues. Thanks! |
After I changed the batch size but I also met the same issue. And then I changed the mode from GPU to CPU, it did work. The problem of my issue is my computer can't store the parameter of my net. |
Same problem... Hardware from the output.log:
|
What is your input image size? |
The biggest one is 20.5mb (3008x3952). |
Those are pretty large images and fully convolutional networks are quite memory hungry. Note that if you are using images of variable sizes, you need to set the batch size to 1 anyway (this is already set in FCN-Alexnet from the semantic segmentation example). You could try resizing your images to a lower size, if that does not destroy too much information. That is the first thing I would try. Another solution is to take random crops from the bigger images. You need labels to be cropped in the same way though so the usual crop parameter in Caffe's data layer cannot be used. A Python layer would be suitable to perform the cropping in the context of a quick experiment. Another solution would be to increase the stride of the first convolutional layer to reduce the size of its output feature map. However you then need to make corresponding changes in the deconvolutional layer of the network and you need to calculate the new offset to apply in the final |
Thanks! Will first try tiling the data set to smaller fovs. |
smaller tiles solved the issue. |
Note that during inference you should be able to use the original image size (as demonstrated on the binary segmentation example) - up to a limit of course: inference needs about a third of the GPU memory required for training. |
Thanks! Will keep that in mind. |
I want to train the GoogleNet model with 1024 x 1024 but it is out of memory. Then I resize images to 800 x 800 and the batch size is 10 and then it works but the accuracy is only about 80%. Then I resize images to 680 x 680 and the batch size is 20 and then it still works but the accuracy can go to 90%. It seems the batch size influences the accuracy. Is it right? |
Although my error looks similar to issue 3, I thought I should open this as a separate issue. I choose alexnet as my model and leave all settings default. I have a 970m with 3GB of memory. The output on my terminal says:
2015-03-22 14:01:32 [20150322-140130-3826] [DEBUG] Train Caffe Model task queued.
2015-03-22 14:01:32 [20150322-140130-3826] [INFO ] Train Caffe Model task started.
2015-03-22 14:01:33 [20150322-140130-3826] [DEBUG] memory required: 793 MB
2015-03-22 14:01:34 [20150322-140130-3826] [DEBUG] memory required: 793 MB
2015-03-22 14:01:47 [20150322-140130-3826] [DEBUG] Network accuracy #0: 73.4714
2015-03-22 14:01:47 [20150322-140130-3826] [ERROR] Train Caffe Model: Check failed: error == cudaSuccess (2 vs. 0) out of memory
2015-03-22 14:01:48 [20150322-140130-3826] [ERROR] Train Caffe Model task failed with error code -6
From the caffe_output.log, I see:
I0322 14:01:34.870321 4054 solver.cpp:42] Solver scaffolding done.
I0322 14:01:34.870345 4054 solver.cpp:222] Solving
I0322 14:01:34.870350 4054 solver.cpp:223] Learning Rate Policy: step
I0322 14:01:34.870358 4054 solver.cpp:266] Iteration 0, Testing net (#0)
I0322 14:01:47.719110 4054 solver.cpp:315] Test net output #0: accuracy = 0.734714
I0322 14:01:47.719143 4054 solver.cpp:315] Test net output #1: loss = 1.53392 (* 1 = 1.53392 loss)
F0322 14:01:47.932499 4054 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x7fe678bebc6c (unknown)
@ 0x7fe678bebbb8 (unknown)
@ 0x7fe678beb5ba (unknown)
@ 0x7fe678bee551 (unknown)
@ 0x7fe67901bb4b caffe::SyncedMemory::mutable_gpu_data()
@ 0x7fe67901c893 caffe::Blob<>::mutable_gpu_diff()
@ 0x7fe67904605f caffe::CuDNNPoolingLayer<>::Backward_gpu()
@ 0x7fe678f2d158 caffe::Net<>::BackwardFromTo()
@ 0x7fe678f2d211 caffe::Net<>::Backward()
@ 0x7fe678f49ca1 caffe::Solver<>::Step()
@ 0x7fe678f4a72f caffe::Solver<>::Solve()
@ 0x40610f train()
@ 0x40412b main
@ 0x7fe6780f2ec5 (unknown)
@ 0x404775 (unknown)
@ (nil) (unknown)
The text was updated successfully, but these errors were encountered: