Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient tests fail on Samsung Chromebook 2 #28

Open
psyhtest opened this issue Apr 29, 2016 · 9 comments
Open

Gradient tests fail on Samsung Chromebook 2 #28

psyhtest opened this issue Apr 29, 2016 · 9 comments

Comments

@psyhtest
Copy link

psyhtest commented Apr 29, 2016

@naibaf7

In configuration USE_GREENTEA := 1, I see lots of Caffe test failures on Samsung Chromebook 2 (ARM Cortex-A15 CPU, ARM Mali-T628 GPU) with this fork (latest commit 04503ee).

What they all seem to have in common is the word "Gradient" in their name. For example:

$ ./build/test/test_all.testbin --gtest_filter=*TestBNLLGradient
Setting to use device 0
Note: Google Test filter = *TestBNLLGradient
[==========] Running 4 tests from 4 test cases.
[----------] Global test environment set-up.
[----------] 1 test from NeuronLayerTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] NeuronLayerTest/0.TestBNLLGradient
[       OK ] NeuronLayerTest/0.TestBNLLGradient (16 ms)
[----------] 1 test from NeuronLayerTest/0 (16 ms total)

[----------] 1 test from NeuronLayerTest/1, where TypeParam = caffe::CPUDevice<double>
[ RUN      ] NeuronLayerTest/1.TestBNLLGradient
[       OK ] NeuronLayerTest/1.TestBNLLGradient (15 ms)
[----------] 1 test from NeuronLayerTest/1 (16 ms total)

[----------] 1 test from NeuronLayerTest/2, where TypeParam = caffe::GPUDevice<float>
[ RUN      ] NeuronLayerTest/2.TestBNLLGradient
[       OK ] NeuronLayerTest/2.TestBNLLGradient (497 ms)
[----------] 1 test from NeuronLayerTest/2 (497 ms total)

[----------] 1 test from NeuronLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] NeuronLayerTest/3.TestBNLLGradient
./include/caffe/test/test_gradient_check_util.hpp:184: Failure
The difference between computed_gradient and estimated_gradient is 1.5616071734673349, which exceeds threshold_ * scale, where
computed_gradient evaluates to 1.5616071734673349,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.015616071734673349.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = 1.2703554366494645; objective+ = 3.0355741737564483; objective- = 3.0355741737564483
...
[  FAILED  ] NeuronLayerTest/3.TestBNLLGradient, where TypeParam = caffe::GPUDevice<double> (530 ms)
[----------] 1 test from NeuronLayerTest/3 (531 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 4 test cases ran. (1061 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NeuronLayerTest/3.TestBNLLGradient, where TypeParam = caffe::GPUDevice<double>

The suspicious line is:

computed_gradient evaluates to 1.5616071734673349,
estimated_gradient evaluates to 0,

but sometimes I see the reverse of this situation when it is computed_gradient evaluates to 0, but estimated_gradient evaluates to a non-zero.

This happens both for float and double tests.

Any ideas?

@naibaf7
Copy link
Owner

naibaf7 commented Apr 29, 2016

I'd need to know if there is a pattern on the data index:
(top_id, top_data_id, blob_id, feat_id)=0,0,0,0;
Can you find that out? Or just post some more index + values of the failures.

Not having one of those GPUs myself it is a bit difficult to track this problem.
The runtests are fine on Intel, AMD and nVidia chips otherwise.

@295988101
Copy link

@psyhtest I'm going to run caffe on ARM, using opencl or cuda. But I have tried another opencl caffe version and finally I failed for the complex cross compile. Could you tell me, if you successfully use caffe on arm(opencl or cuda)?So nice as you if you can tell me some details. My mail is [email protected]

best regards to you.

@jyegerlehner
Copy link

jyegerlehner commented May 6, 2016

@psyhtest The example you provide shows the test passes on the GPU for floats and the same test fails on the GPU for doubles. Are all the failed tests only double precision and on the GPU? If so, this suggests to me perhaps the MALI GPU does not support double, which is optional under the spec. Search for CL_DEVICE_DOUBLE_FP_CONFIG and clGetDeviceInfo, which is how the GPU indicates if it supports double.

@psyhtest
Copy link
Author

psyhtest commented May 6, 2016

@naibaf7
I'll attach a full log showing the Gradient related failures shortly.

@zhenghuitian
Yes, I gave up on AMD's port of Caffe too because it used OpenCL 1.2 and C++ templates in kernels. But I did manage to run Caffe with a couple of patches to clBLAS v2.4. I'm also working on support for CLBlast. Our vision is to create an open framework for optimising CNNs on embedded platforms, which is outlined in our IWOCL abstract. All comments and contributions are welcome!

@jyegerlehner
I strongly suspect that Mali does support double precision, as I was managing the OpenCL compiler team at ARM when it was implemented :). But perhaps I wasn't doing my job properly, and this omission somehow wasn't detected by conformance testing?.. :)

@jyegerlehner
Copy link

@psyhtest Hah hah OK I guess that rules that out. I thought it was a rather parsimonious theory though.

@295988101
Copy link

@psyhtest Thank you for your answer. It do help me a lot. I am preparing to use CLBlast instead of clBlas because my arm has not AMD gpu. I am reading your IWOCL abstract.Thank you again.

@psyhtest
Copy link
Author

@naibaf7

Please see the full (compressed) log from running the following command:

LD_LIBRARY_PATH=/data/install/lib-openblas-v0.2.18/lib:$LD_LIBRARY_PATH \
/data/caffe-naibaf7/build/test/test_all.testbin --gtest_filter=*Gradient* \
> /chronos_downloads/caffe-naibaf7.6c0fbdc.Gradient.log 2>&1
...
[==========] 494 tests from 138 test cases ran. (39528323 ms total)
[  PASSED  ] 384 tests.
[  FAILED  ] 110 tests, listed below:
...

Also attached is my Makefile.config.

@psyhtest
Copy link
Author

I also observed a similar failure on Odroid-XU3 (similar chip to Chromebook 2 but with the Mali driver v4.0, rather than v6.0):

[ RUN      ] DeconvolutionLayerTest/2.TestGradient
./include/caffe/test/test_gradient_check_util.hpp:184: Failure
The difference between computed_gradient and estimated_gradient is 2, which exceeds threshold_ * scale, where
computed_gradient evaluates to 2,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.0020000000949949026.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,1,0; feat = 0.97146224975585938; objective+ = -1.53898024559021; objective- = -1.53898024559021
./include/caffe/test/test_gradient_check_util.hpp:184: Failure
The difference between computed_gradient and estimated_gradient is 2, which exceeds threshold_ * scale, where
computed_gradient evaluates to 2,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.0020000000949949026.
debug: (top_id, top_data_id, blob_id, feat_id)=0,1,1,0; feat = 0.97146224975585938; objective+ = -1.1870282888412476; objective- = -1.1870282888412476

It is, however, much more intermittent. (I could not reproduce it since.)

@naibaf7
Copy link
Owner

naibaf7 commented May 11, 2016

@psyhtest
Yes.. I am currently looking if there are obvious parts of the code/kernels that could be problematic on these devices. After that I would like to do actual tests on the hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants