Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_net.bin crash when solving with own data (out of memory) #241

Closed
johnswan opened this issue Mar 19, 2014 · 3 comments
Closed

train_net.bin crash when solving with own data (out of memory) #241

johnswan opened this issue Mar 19, 2014 · 3 comments

Comments

@johnswan
Copy link

Hi,

This issue is similar to #26 , but because it happens when Solving, the suggested fix doesn't work.

When I use my own data, train_net.bin crashes with the cuda_malloc error during solving. As the above issue 26 indicates GPU running out of memory, and suggests reducing the images per batch, I tried this by changing my _val.prototxt and _train.prototxt to no avail. As the issue occurs during solving, I also varied the parameters in _solver.prototxt. However the same issue occurs in exactly the same place, suggesting that the changes have no effect.

I0319 01:17:16.370738 2094641536 net.cpp:162] Collecting Learning Rate and Weight Decay.
I0319 01:17:16.370748 2094641536 net.cpp:156] Network initialization done.
I0319 01:17:16.370816 2094641536 solver.cpp:36] Solver scaffolding done.
I0319 01:17:16.370861 2094641536 solver.cpp:47] Solving CaffeNet
F0319 01:17:18.464231 2094641536 syncedmem.cpp:45] Check failed: (cudaMalloc(&gpu_ptr_, size_)) == cudaSuccess (2 vs. 0)
*** Check failure stack trace: ***
@ 0x10a6fea7e google::LogMessage::Fail()
@ 0x10a6fe325 google::LogMessage::SendToLog()
@ 0x10a6fe7b7 google::LogMessage::Flush()
@ 0x10a7011dd google::LogMessageFatal::~LogMessageFatal()
@ 0x10a6fed73 google::LogMessageFatal::~LogMessageFatal()
@ 0x10498fe20 caffe::SyncedMemory::to_gpu()
@ 0x10498fb9e caffe::SyncedMemory::mutable_gpu_data()
@ 0x104960d57 caffe::Blob<>::mutable_gpu_data()
@ 0x10496af49 caffe::ConvolutionLayer<>::Forward_gpu()
@ 0x104981d93 caffe::Net<>::ForwardPrefilled()
@ 0x10498b146 caffe::Solver<>::Solve()
@ 0x10495104b main
@ 0x7fff8c05f7e1 start
@ 0x2 (unknown)

Thanks
John

@shelhamer
Copy link
Member

Have you looked at #183 ? All the problems like this we've seen so far have been fixed by either

  • reducing memory usage, like you have tried
  • updating the driver
  • otherwise re-configuring the GPU (fan speed, power issues, etc.)

@johnswan
Copy link
Author

Hi,

I got past the problem by trial and error.

I found that if I set:
batchsize: 4

then the solver will run. However, this seems to be a sweet spot. If I set any other value, then either I get the error above, or the solver will run but the loss will rocket straight up to 87.3365 and stay there.

In fact often, even with batchsize: 4, the loss will often go to 87.3365. I usually have to perform several runs before it settles down.

Example:
I0320 01:10:28.377015 2088923520 solver.cpp:65] Iteration 160, loss = 1.96365
I0320 01:10:35.063892 2088923520 solver.cpp:207] Iteration 180, lr = 0.01
I0320 01:10:35.064174 2088923520 solver.cpp:65] Iteration 180, loss = 5.61734
I0320 01:10:41.647512 2088923520 solver.cpp:207] Iteration 200, lr = 0.01
I0320 01:10:41.647795 2088923520 solver.cpp:65] Iteration 200, loss = 87.3365
I0320 01:10:48.203501 2088923520 solver.cpp:207] Iteration 220, lr = 0.01
I0320 01:10:48.203774 2088923520 solver.cpp:65] Iteration 220, loss = 87.3365

When this happens, I restart and sometimes things will settle down (seemingly randomly):
I0320 01:46:00.294448 2088923520 solver.cpp:65] Iteration 1980, loss = 0.53467
I0320 01:46:04.025198 2088923520 solver.cpp:207] Iteration 2000, lr = 0.01
I0320 01:46:04.025532 2088923520 solver.cpp:65] Iteration 2000, loss = 0.872246
I0320 01:46:04.025542 2088923520 solver.cpp:87] Iteration 2000, Testing net
I0320 01:46:28.027384 2088923520 solver.cpp:114] Test score #0: 0.5735
I0320 01:46:28.027421 2088923520 solver.cpp:114] Test score #1: 1.32002

Is this normal?

By the way, I'm using Mac OS X 10.8.5 with the latest CUDA driver 5.5.47 and GPU driver 8.16.76 310.40.00.20f04. I have a Macbook Pro 15" Retina with 16Gb RAM and a NVIDIA GeForce GT 650M with 1Gb VRAM.

Thanks
John

@shelhamer
Copy link
Member

Training in this way isn't feasible: the loss is diverging due to the variance of the gradient. The variance is too high with a minibatch size of only four, since the expectation is taken only those four instances at a time. This could be somewhat tempered by lowering the learning rate, but training would never make progress at such a rate. Training the ImageNet model requires ~3.5 gb of video memory (or 3 gb if a concurrent testing net isn't used).

If at all possible, you could try scaling your data to smaller sizes s.t. more will fit in memory and you can increase the minibatch size.

For further understanding on why the variance matters and what the minibatch size has to do with it, read [1].

[1] L. Bottou. Stochastic gradient tricks. In G. Montavon, G. B. Orr, and K.-R. Mu ̈ller, editors, Neural Networks, Tricks of the Trade, Reloaded, Lecture Notes in Computer Science (LNCS 7700), pages 430– 445. Springer, 2012.

@shelhamer shelhamer changed the title train_net.bin crash when Solving with own data train_net.bin crash when solving with own data (out of memory) Apr 19, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants