-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train_net.bin crash when solving with own data (out of memory) #241
Comments
Have you looked at #183 ? All the problems like this we've seen so far have been fixed by either
|
Hi, I got past the problem by trial and error. I found that if I set: then the solver will run. However, this seems to be a sweet spot. If I set any other value, then either I get the error above, or the solver will run but the loss will rocket straight up to 87.3365 and stay there. In fact often, even with batchsize: 4, the loss will often go to 87.3365. I usually have to perform several runs before it settles down. Example: When this happens, I restart and sometimes things will settle down (seemingly randomly): Is this normal? By the way, I'm using Mac OS X 10.8.5 with the latest CUDA driver 5.5.47 and GPU driver 8.16.76 310.40.00.20f04. I have a Macbook Pro 15" Retina with 16Gb RAM and a NVIDIA GeForce GT 650M with 1Gb VRAM. Thanks |
Training in this way isn't feasible: the loss is diverging due to the variance of the gradient. The variance is too high with a minibatch size of only four, since the expectation is taken only those four instances at a time. This could be somewhat tempered by lowering the learning rate, but training would never make progress at such a rate. Training the ImageNet model requires ~3.5 gb of video memory (or 3 gb if a concurrent testing net isn't used). If at all possible, you could try scaling your data to smaller sizes s.t. more will fit in memory and you can increase the minibatch size. For further understanding on why the variance matters and what the minibatch size has to do with it, read [1]. [1] L. Bottou. Stochastic gradient tricks. In G. Montavon, G. B. Orr, and K.-R. Mu ̈ller, editors, Neural Networks, Tricks of the Trade, Reloaded, Lecture Notes in Computer Science (LNCS 7700), pages 430– 445. Springer, 2012. |
Hi,
This issue is similar to #26 , but because it happens when Solving, the suggested fix doesn't work.
When I use my own data, train_net.bin crashes with the cuda_malloc error during solving. As the above issue 26 indicates GPU running out of memory, and suggests reducing the images per batch, I tried this by changing my _val.prototxt and _train.prototxt to no avail. As the issue occurs during solving, I also varied the parameters in _solver.prototxt. However the same issue occurs in exactly the same place, suggesting that the changes have no effect.
I0319 01:17:16.370738 2094641536 net.cpp:162] Collecting Learning Rate and Weight Decay.
I0319 01:17:16.370748 2094641536 net.cpp:156] Network initialization done.
I0319 01:17:16.370816 2094641536 solver.cpp:36] Solver scaffolding done.
I0319 01:17:16.370861 2094641536 solver.cpp:47] Solving CaffeNet
F0319 01:17:18.464231 2094641536 syncedmem.cpp:45] Check failed: (cudaMalloc(&gpu_ptr_, size_)) == cudaSuccess (2 vs. 0)
*** Check failure stack trace: ***
@ 0x10a6fea7e google::LogMessage::Fail()
@ 0x10a6fe325 google::LogMessage::SendToLog()
@ 0x10a6fe7b7 google::LogMessage::Flush()
@ 0x10a7011dd google::LogMessageFatal::~LogMessageFatal()
@ 0x10a6fed73 google::LogMessageFatal::~LogMessageFatal()
@ 0x10498fe20 caffe::SyncedMemory::to_gpu()
@ 0x10498fb9e caffe::SyncedMemory::mutable_gpu_data()
@ 0x104960d57 caffe::Blob<>::mutable_gpu_data()
@ 0x10496af49 caffe::ConvolutionLayer<>::Forward_gpu()
@ 0x104981d93 caffe::Net<>::ForwardPrefilled()
@ 0x10498b146 caffe::Solver<>::Solve()
@ 0x10495104b main
@ 0x7fff8c05f7e1 start
@ 0x2 (unknown)
Thanks
John
The text was updated successfully, but these errors were encountered: