Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check failed: error == cudaSuccess (77 vs. 0) #598

Closed
gaozunqi opened this issue Feb 24, 2016 · 15 comments
Closed

Check failed: error == cudaSuccess (77 vs. 0) #598

gaozunqi opened this issue Feb 24, 2016 · 15 comments

Comments

@gaozunqi
Copy link

everything is OK, but this happend...(the message below is copied from the 'caffe_output.log' file)

I0224 18:20:47.449920 6492 solver.cpp:314] Iteration 0, Testing net (#0)
F0224 18:20:47.715102 6492 cudnn_conv_layer.cu:56] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7f18f808cea4 (unknown)
@ 0x7f18f808cdeb (unknown)
@ 0x7f18f808c7bf (unknown)
@ 0x7f18f808fa35 (unknown)
@ 0x7f18f8a1c119 caffe::CuDNNConvolutionLayer<>::Forward_gpu()
@ 0x7f18f88fc002 caffe::Net<>::ForwardFromTo()
@ 0x7f18f88fc127 caffe::Net<>::ForwardPrefilled()
@ 0x7f18f89fa923 caffe::Solver<>::Test()
@ 0x7f18f89fb0a6 caffe::Solver<>::TestAll()
@ 0x7f18f8a0319f caffe::Solver<>::Step()
@ 0x7f18f8a03e7e caffe::Solver<>::Solve()
@ 0x408602 train()
@ 0x4052eb main
@ 0x7f18f758ca40 (unknown)
@ 0x4059b9 _start
@ (nil) (unknown)

@gheinrich
Copy link
Contributor

Hi @gaozunqi can you check the versions of your tools:

$ dpkg -s cuda
Package: cuda
...
Version: 7.5-18
Depends: cuda-7-5 (= 7.5-18)

$ dpkg -s libcudnn4
Package: libcudnn4
...
Version: 4.0.7

$ dpkg -s caffe-nv
Package: caffe-nv
...
Version: 0.14.2-1

If you have those versions, can you create an issue on NVIDIA/Caffe with details on your network topology and system (GPU, etc.)?

@gaozunqi
Copy link
Author

@gheinrich

go@go-Lenovo-Erazer-Z400:/usr/local/lib$ dpkg -s cuda
Package: cuda
Status: install ok installed
Priority: optional
Section: devel
Installed-Size: 8
Maintainer: cudatools <[email protected]>
Architecture: amd64
Version: 7.0-28
Depends: cuda-7-0 (= 7.0-28)

when I run 'dpkg -s libcudnn4' and 'dpkg -s caffe-nv', it says the computer has not installed the package(I don't know why,maybe I have not install it by *.deb???)
in fact, my cudnn version is "cudnn-7.0-linux-x64-v3.0-rc" and I install it by command like this:

sudo cp lib* /usr/local/cuda/lib64/
sudo cp cudnn.h /usr/local/cuda/include/
cd /usr/local/cuda/lib64/
sudo ln -s libcudnn.so.7.0.58 libcudnn.so.7.0
sudo ln -s libcudnn.so.7.0 libcudnn.so

and my caffe-fork version is the master version

@gaozunqi
Copy link
Author

gaozunqi commented Mar 1, 2016

@gheinrich hello,I have re-installed the digits and caffe nv by the doc in digits, it works! thank u.
but another problem is:
Test net output #0: accuracy = 0.0435789
Test net output #1: loss = -nan (* 1 = -nan loss)

the accuracy has values,but the loss is also -nan.
Have u meet the problem?

@gaozunqi
Copy link
Author

gaozunqi commented Mar 3, 2016

@lukeyeager hi,could u help me the problem above?
Test net output #0: accuracy = 0.0435789
Test net output #1: loss = -nan (* 1 = -nan loss)

@gheinrich
Copy link
Contributor

Hi @gaozunqi what type of neural network are you training? Have you enabled mean image subtraction?

@gaozunqi
Copy link
Author

gaozunqi commented Mar 3, 2016

I want to train the tast in "http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html",the Flickr Style data and the net in caffe official example.
I only changed the net's ImageData layer to the Data type so that it can read data from lmdb in DIGITS .The official net structure is in the "caffe_home/models/finetune_flickr_style/train_val.prototxt" and I select to subtraction the image mean in DIGITS

@gaozunqi
Copy link
Author

gaozunqi commented Mar 3, 2016

@gheinrich

@gaozunqi
Copy link
Author

gaozunqi commented Mar 3, 2016

@gheinrich I also traind a mnist model, it's OK, no problem happens.

@gheinrich
Copy link
Contributor

@gaozunqi I notice this example is using a lower initial learning rate (0.001) than the default in DIGITS (0.01). You might want to try that learning rate, or even lower if the loss keeps diverging.

@gaozunqi
Copy link
Author

gaozunqi commented Mar 3, 2016

@gheinrich Yeah! you are ritght! Thanks
I turned the lr to 0.001,and it works. Could u tell me why? And sometimes I use the adagrad method to train the model, the loss gains instead of reduce,do you know the reason?

@gheinrich
Copy link
Contributor

Test net output #1: loss = -nan (* 1 = -nan loss) means the loss function has diverged (nan means 'not a number', this is the computer's interpretation of infinity).

During learning, at the end of back propagation you will know the direction in which you need to move in order to reduce the loss. However since you are doing batched learning you don't want to move exactly to the target, since that target might suit the particular batch you're learning on but not the others. So you want to make a small step in the right direction so that after learning from many batches you will get closer to a solution that fits the entire dataset.

The learning rate is a measure of how large a step you're making. If the learning rate is too high then you will make large steps which might get you further from the optimal target.

This is similar to playing golf: if you're close to the hole but push the ball too hard, the ball might end up further from the hole than it was initially. If you keep pushing too hard you will end up infinitely far away from the hole.

@gheinrich
Copy link
Contributor

See http://caffe.berkeleyvision.org/tutorial/solver.html for information on the different types of solvers in Caffe.

@gaozunqi
Copy link
Author

gaozunqi commented Mar 3, 2016

@gheinrich thanks for your patient. I have understand the theory u said in the paragraph 2,3,and 4, but still don't know why the loss = -nan.
In my opinion,if I rise the lr, eventhougt I further from the optimal target, the loss may Increased with the increase of training times,but why it turns to -nan? or it means the loss is very large?

@gheinrich
Copy link
Contributor

If the loss keeps increasing because you're using too high a learning rate, eventually it will become larger than any number than can be represented by a float . At that point, it will show nan.

@gaozunqi
Copy link
Author

gaozunqi commented Mar 3, 2016

@gheinrich OK,your explanation is good.Thank u! : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants