Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory error #18

Closed
nullterminated opened this issue Mar 22, 2015 · 13 comments
Closed

Out of memory error #18

nullterminated opened this issue Mar 22, 2015 · 13 comments

Comments

@nullterminated
Copy link

Although my error looks similar to issue 3, I thought I should open this as a separate issue. I choose alexnet as my model and leave all settings default. I have a 970m with 3GB of memory. The output on my terminal says:

2015-03-22 14:01:32 [20150322-140130-3826] [DEBUG] Train Caffe Model task queued.
2015-03-22 14:01:32 [20150322-140130-3826] [INFO ] Train Caffe Model task started.
2015-03-22 14:01:33 [20150322-140130-3826] [DEBUG] memory required: 793 MB
2015-03-22 14:01:34 [20150322-140130-3826] [DEBUG] memory required: 793 MB
2015-03-22 14:01:47 [20150322-140130-3826] [DEBUG] Network accuracy #0: 73.4714
2015-03-22 14:01:47 [20150322-140130-3826] [ERROR] Train Caffe Model: Check failed: error == cudaSuccess (2 vs. 0) out of memory
2015-03-22 14:01:48 [20150322-140130-3826] [ERROR] Train Caffe Model task failed with error code -6

From the caffe_output.log, I see:

I0322 14:01:34.870321 4054 solver.cpp:42] Solver scaffolding done.
I0322 14:01:34.870345 4054 solver.cpp:222] Solving
I0322 14:01:34.870350 4054 solver.cpp:223] Learning Rate Policy: step
I0322 14:01:34.870358 4054 solver.cpp:266] Iteration 0, Testing net (#0)
I0322 14:01:47.719110 4054 solver.cpp:315] Test net output #0: accuracy = 0.734714
I0322 14:01:47.719143 4054 solver.cpp:315] Test net output #1: loss = 1.53392 (* 1 = 1.53392 loss)
F0322 14:01:47.932499 4054 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x7fe678bebc6c (unknown)
@ 0x7fe678bebbb8 (unknown)
@ 0x7fe678beb5ba (unknown)
@ 0x7fe678bee551 (unknown)
@ 0x7fe67901bb4b caffe::SyncedMemory::mutable_gpu_data()
@ 0x7fe67901c893 caffe::Blob<>::mutable_gpu_diff()
@ 0x7fe67904605f caffe::CuDNNPoolingLayer<>::Backward_gpu()
@ 0x7fe678f2d158 caffe::Net<>::BackwardFromTo()
@ 0x7fe678f2d211 caffe::Net<>::Backward()
@ 0x7fe678f49ca1 caffe::Solver<>::Step()
@ 0x7fe678f4a72f caffe::Solver<>::Solve()
@ 0x40610f train()
@ 0x40412b main
@ 0x7fe6780f2ec5 (unknown)
@ 0x404775 (unknown)
@ (nil) (unknown)

@lukeyeager
Copy link
Member

Hi, @nullterminated. The standard models were designed to fit on GPUs with 4GB of memory or more - that is why you're running out of memory. If you decrease the batch size, you should be able to run just about any network you want with 3GB of GPU memory. It will just take a bit longer.

@lukeyeager
Copy link
Member

I've added a feature that should help avoid these issues in the future. I'm going to go ahead and close this issue in faith that changing the batch size will fix your problem for you for the present. Please re-open it if you still have an issue here.

@nullterminated
Copy link
Author

That does the trick. It appears the default value is a batch size around 100. I made that 50 and now it is training without memory issues. Thanks!

@EnQing626
Copy link

After I changed the batch size but I also met the same issue. And then I changed the mode from GPU to CPU, it did work. The problem of my issue is my computer can't store the parameter of my net.

@klaimane
Copy link

Same problem...
Set the batch size to 1 manually but it didn't work.
Are there any other solutions?

Hardware
Tesla K80 (#0)
Memory
7.56 GB / 11.2 GB (67.2%)
GPU Utilization
98%
Temperature
41 °C
Process #159052
CPU Utilization
113.0%
Memory
1.4 GB (0.6%)

from the output.log:

I1110 10:41:47.261068 158591 net.cpp:761] Ignoring source layer upscore_21classes
I1110 10:41:47.261715 158591 caffe.cpp:251] Starting Optimization
I1110 10:41:47.261730 158591 solver.cpp:279] Solving
I1110 10:41:47.261734 158591 solver.cpp:280] Learning Rate Policy: step
I1110 10:41:47.263772 158591 solver.cpp:337] Iteration 0, Testing net (#0)
I1110 10:41:55.251052 158591 solver.cpp:404] Test net output #0: accuracy = 0
I1110 10:41:55.251092 158591 solver.cpp:404] Test net output #1: loss = 3.04452 (* 1 = 3.04452 loss)
I1110 10:41:55.762611 158591 solver.cpp:228] Iteration 0, loss = 3.04452
I1110 10:41:55.762641 158591 solver.cpp:244] Train net output #0: loss = 3.04452 (* 1 = 3.04452 loss)
I1110 10:41:55.762686 158591 sgd_solver.cpp:106] Iteration 0, lr = 0.0001
I1110 10:42:03.353044 158591 solver.cpp:228] Iteration 4, loss = 2.76947
I1110 10:42:03.353076 158591 solver.cpp:244] Train net output #0: loss = 2.76947 (* 1 = 2.76947 loss)
I1110 10:42:03.353085 158591 sgd_solver.cpp:106] Iteration 4, lr = 0.0001
I1110 10:42:08.067147 158591 solver.cpp:228] Iteration 8, loss = 2.11253
I1110 10:42:08.067178 158591 solver.cpp:244] Train net output #0: loss = 2.11253 (* 1 = 2.11253 loss)
I1110 10:42:08.067185 158591 sgd_solver.cpp:106] Iteration 8, lr = 0.0001
I1110 10:42:13.411054 158591 solver.cpp:228] Iteration 12, loss = 1.45452
I1110 10:42:13.411083 158591 solver.cpp:244] Train net output #0: loss = 1.45452 (* 1 = 1.45452 loss)
I1110 10:42:13.411092 158591 sgd_solver.cpp:106] Iteration 12, lr = 0.0001
F1110 10:42:19.409853 158591 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x7ffff1b519fd google::LogMessage::Fail()
@ 0x7ffff1b537cc google::LogMessage::SendToLog()
@ 0x7ffff1b515ec google::LogMessage::Flush()
@ 0x7ffff1b540de google::LogMessageFatal::~LogMessageFatal()
@ 0x7ffff742b821 caffe::SyncedMemory::to_gpu()
@ 0x7ffff742ab89 caffe::SyncedMemory::mutable_gpu_data()
@ 0x7ffff72a6642 caffe::Blob<>::mutable_gpu_data()
@ 0x7ffff7409926 caffe::BaseConvolutionLayer<>::backward_gpu_gemm()
@ 0x7ffff745c27b caffe::DeconvolutionLayer<>::Forward_gpu()
@ 0x7ffff73070f5 caffe::Net<>::ForwardFromTo()
@ 0x7ffff7307467 caffe::Net<>::Forward()
@ 0x7ffff741f737 caffe::Solver<>::Step()
@ 0x7ffff741fff9 caffe::Solver<>::Solve()
@ 0x40a47b train()
@ 0x40752c main
@ 0x7fffe9ec1b15 __libc_start_main
@ 0x407d9d (unknown)

@gheinrich
Copy link
Contributor

What is your input image size?

@klaimane
Copy link

The biggest one is 20.5mb (3008x3952).
I'm using variable sizes.

@gheinrich
Copy link
Contributor

Those are pretty large images and fully convolutional networks are quite memory hungry. Note that if you are using images of variable sizes, you need to set the batch size to 1 anyway (this is already set in FCN-Alexnet from the semantic segmentation example).

You could try resizing your images to a lower size, if that does not destroy too much information. That is the first thing I would try.

Another solution is to take random crops from the bigger images. You need labels to be cropped in the same way though so the usual crop parameter in Caffe's data layer cannot be used. A Python layer would be suitable to perform the cropping in the context of a quick experiment.

Another solution would be to increase the stride of the first convolutional layer to reduce the size of its output feature map. However you then need to make corresponding changes in the deconvolutional layer of the network and you need to calculate the new offset to apply in the final Crop layer, which isn't difficult but is a bit tedious.

@klaimane
Copy link

Thanks! Will first try tiling the data set to smaller fovs.

@klaimane
Copy link

smaller tiles solved the issue.

@gheinrich
Copy link
Contributor

Note that during inference you should be able to use the original image size (as demonstrated on the binary segmentation example) - up to a limit of course: inference needs about a third of the GPU memory required for training.

@klaimane
Copy link

Thanks! Will keep that in mind.

@MRCSTJM
Copy link

MRCSTJM commented Apr 15, 2018

I want to train the GoogleNet model with 1024 x 1024 but it is out of memory. Then I resize images to 800 x 800 and the batch size is 10 and then it works but the accuracy is only about 80%. Then I resize images to 680 x 680 and the batch size is 20 and then it still works but the accuracy can go to 90%. It seems the batch size influences the accuracy. Is it right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants