Check failed: error == cudaSuccess (77 vs. 0) #598

gaozunqi · 2016-02-24T10:34:00Z

everything is OK, but this happend...(the message below is copied from the 'caffe_output.log' file)

I0224 18:20:47.449920 6492 solver.cpp:314] Iteration 0, Testing net (#0)
F0224 18:20:47.715102 6492 cudnn_conv_layer.cu:56] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7f18f808cea4 (unknown)
@ 0x7f18f808cdeb (unknown)
@ 0x7f18f808c7bf (unknown)
@ 0x7f18f808fa35 (unknown)
@ 0x7f18f8a1c119 caffe::CuDNNConvolutionLayer<>::Forward_gpu()
@ 0x7f18f88fc002 caffe::Net<>::ForwardFromTo()
@ 0x7f18f88fc127 caffe::Net<>::ForwardPrefilled()
@ 0x7f18f89fa923 caffe::Solver<>::Test()
@ 0x7f18f89fb0a6 caffe::Solver<>::TestAll()
@ 0x7f18f8a0319f caffe::Solver<>::Step()
@ 0x7f18f8a03e7e caffe::Solver<>::Solve()
@ 0x408602 train()
@ 0x4052eb main
@ 0x7f18f758ca40 (unknown)
@ 0x4059b9 _start
@ (nil) (unknown)

gheinrich · 2016-02-24T12:48:33Z

Hi @gaozunqi can you check the versions of your tools:

$ dpkg -s cuda
Package: cuda
...
Version: 7.5-18
Depends: cuda-7-5 (= 7.5-18)

$ dpkg -s libcudnn4
Package: libcudnn4
...
Version: 4.0.7

$ dpkg -s caffe-nv
Package: caffe-nv
...
Version: 0.14.2-1

If you have those versions, can you create an issue on NVIDIA/Caffe with details on your network topology and system (GPU, etc.)?

gaozunqi · 2016-02-27T01:29:41Z

@gheinrich

go@go-Lenovo-Erazer-Z400:/usr/local/lib$ dpkg -s cuda
Package: cuda
Status: install ok installed
Priority: optional
Section: devel
Installed-Size: 8
Maintainer: cudatools <[email protected]>
Architecture: amd64
Version: 7.0-28
Depends: cuda-7-0 (= 7.0-28)

when I run 'dpkg -s libcudnn4' and 'dpkg -s caffe-nv', it says the computer has not installed the package(I don't know why,maybe I have not install it by *.deb???)
in fact, my cudnn version is "cudnn-7.0-linux-x64-v3.0-rc" and I install it by command like this:

sudo cp lib* /usr/local/cuda/lib64/
sudo cp cudnn.h /usr/local/cuda/include/
cd /usr/local/cuda/lib64/
sudo ln -s libcudnn.so.7.0.58 libcudnn.so.7.0
sudo ln -s libcudnn.so.7.0 libcudnn.so

and my caffe-fork version is the master version

gaozunqi · 2016-03-01T13:00:44Z

@gheinrich hello,I have re-installed the digits and caffe nv by the doc in digits, it works! thank u.
but another problem is:
Test net output #0: accuracy = 0.0435789
Test net output #1: loss = -nan (* 1 = -nan loss)

the accuracy has values,but the loss is also -nan.
Have u meet the problem?

gaozunqi · 2016-03-03T08:31:03Z

@lukeyeager hi,could u help me the problem above?
Test net output #0: accuracy = 0.0435789
Test net output #1: loss = -nan (* 1 = -nan loss)

gheinrich · 2016-03-03T08:41:44Z

Hi @gaozunqi what type of neural network are you training? Have you enabled mean image subtraction?

gaozunqi · 2016-03-03T08:51:59Z

I want to train the tast in "http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html",the Flickr Style data and the net in caffe official example.
I only changed the net's ImageData layer to the Data type so that it can read data from lmdb in DIGITS .The official net structure is in the "caffe_home/models/finetune_flickr_style/train_val.prototxt" and I select to subtraction the image mean in DIGITS

gaozunqi · 2016-03-03T08:52:11Z

@gheinrich

gaozunqi · 2016-03-03T08:53:28Z

@gheinrich I also traind a mnist model, it's OK, no problem happens.

gheinrich · 2016-03-03T09:00:07Z

@gaozunqi I notice this example is using a lower initial learning rate (0.001) than the default in DIGITS (0.01). You might want to try that learning rate, or even lower if the loss keeps diverging.

gaozunqi · 2016-03-03T09:37:26Z

@gheinrich Yeah! you are ritght! Thanks
I turned the lr to 0.001,and it works. Could u tell me why? And sometimes I use the adagrad method to train the model, the loss gains instead of reduce,do you know the reason?

gheinrich · 2016-03-03T10:13:04Z

Test net output #1: loss = -nan (* 1 = -nan loss) means the loss function has diverged (nan means 'not a number', this is the computer's interpretation of infinity).

During learning, at the end of back propagation you will know the direction in which you need to move in order to reduce the loss. However since you are doing batched learning you don't want to move exactly to the target, since that target might suit the particular batch you're learning on but not the others. So you want to make a small step in the right direction so that after learning from many batches you will get closer to a solution that fits the entire dataset.

The learning rate is a measure of how large a step you're making. If the learning rate is too high then you will make large steps which might get you further from the optimal target.

This is similar to playing golf: if you're close to the hole but push the ball too hard, the ball might end up further from the hole than it was initially. If you keep pushing too hard you will end up infinitely far away from the hole.

gheinrich · 2016-03-03T10:17:41Z

See http://caffe.berkeleyvision.org/tutorial/solver.html for information on the different types of solvers in Caffe.

gaozunqi · 2016-03-03T11:21:59Z

@gheinrich thanks for your patient. I have understand the theory u said in the paragraph 2,3,and 4, but still don't know why the loss = -nan.
In my opinion，if I rise the lr, eventhougt I further from the optimal target, the loss may Increased with the increase of training times，but why it turns to -nan? or it means the loss is very large?

gheinrich · 2016-03-03T12:08:16Z

If the loss keeps increasing because you're using too high a learning rate, eventually it will become larger than any number than can be represented by a float . At that point, it will show nan.

gaozunqi · 2016-03-03T12:12:10Z

@gheinrich OK,your explanation is good.Thank u! : )

gaozunqi closed this as completed Mar 3, 2016

shuait mentioned this issue Nov 23, 2016

error Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered my89/SituationCrf#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check failed: error == cudaSuccess (77 vs. 0) #598

Check failed: error == cudaSuccess (77 vs. 0) #598

gaozunqi commented Feb 24, 2016

gheinrich commented Feb 24, 2016

gaozunqi commented Feb 27, 2016

gaozunqi commented Mar 1, 2016

gaozunqi commented Mar 3, 2016

gheinrich commented Mar 3, 2016

gaozunqi commented Mar 3, 2016

gaozunqi commented Mar 3, 2016

gaozunqi commented Mar 3, 2016

gheinrich commented Mar 3, 2016

gaozunqi commented Mar 3, 2016

gheinrich commented Mar 3, 2016

gheinrich commented Mar 3, 2016

gaozunqi commented Mar 3, 2016

gheinrich commented Mar 3, 2016

gaozunqi commented Mar 3, 2016

Check failed: error == cudaSuccess (77 vs. 0) #598

Check failed: error == cudaSuccess (77 vs. 0) #598

Comments

gaozunqi commented Feb 24, 2016

gheinrich commented Feb 24, 2016

gaozunqi commented Feb 27, 2016

gaozunqi commented Mar 1, 2016

gaozunqi commented Mar 3, 2016

gheinrich commented Mar 3, 2016

gaozunqi commented Mar 3, 2016

gaozunqi commented Mar 3, 2016

gaozunqi commented Mar 3, 2016

gheinrich commented Mar 3, 2016

gaozunqi commented Mar 3, 2016

gheinrich commented Mar 3, 2016

gheinrich commented Mar 3, 2016

gaozunqi commented Mar 3, 2016

gheinrich commented Mar 3, 2016

gaozunqi commented Mar 3, 2016