Gradient tests fail on Samsung Chromebook 2 #28

psyhtest · 2016-04-29T17:20:40Z

In configuration USE_GREENTEA := 1, I see lots of Caffe test failures on Samsung Chromebook 2 (ARM Cortex-A15 CPU, ARM Mali-T628 GPU) with this fork (latest commit 04503ee).

What they all seem to have in common is the word "Gradient" in their name. For example:

$ ./build/test/test_all.testbin --gtest_filter=*TestBNLLGradient
Setting to use device 0
Note: Google Test filter = *TestBNLLGradient
[==========] Running 4 tests from 4 test cases.
[----------] Global test environment set-up.
[----------] 1 test from NeuronLayerTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] NeuronLayerTest/0.TestBNLLGradient
[       OK ] NeuronLayerTest/0.TestBNLLGradient (16 ms)
[----------] 1 test from NeuronLayerTest/0 (16 ms total)

[----------] 1 test from NeuronLayerTest/1, where TypeParam = caffe::CPUDevice<double>
[ RUN      ] NeuronLayerTest/1.TestBNLLGradient
[       OK ] NeuronLayerTest/1.TestBNLLGradient (15 ms)
[----------] 1 test from NeuronLayerTest/1 (16 ms total)

[----------] 1 test from NeuronLayerTest/2, where TypeParam = caffe::GPUDevice<float>
[ RUN      ] NeuronLayerTest/2.TestBNLLGradient
[       OK ] NeuronLayerTest/2.TestBNLLGradient (497 ms)
[----------] 1 test from NeuronLayerTest/2 (497 ms total)

[----------] 1 test from NeuronLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] NeuronLayerTest/3.TestBNLLGradient
./include/caffe/test/test_gradient_check_util.hpp:184: Failure
The difference between computed_gradient and estimated_gradient is 1.5616071734673349, which exceeds threshold_ * scale, where
computed_gradient evaluates to 1.5616071734673349,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.015616071734673349.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = 1.2703554366494645; objective+ = 3.0355741737564483; objective- = 3.0355741737564483
...
[  FAILED  ] NeuronLayerTest/3.TestBNLLGradient, where TypeParam = caffe::GPUDevice<double> (530 ms)
[----------] 1 test from NeuronLayerTest/3 (531 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 4 test cases ran. (1061 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NeuronLayerTest/3.TestBNLLGradient, where TypeParam = caffe::GPUDevice<double>

The suspicious line is:

computed_gradient evaluates to 1.5616071734673349,
estimated_gradient evaluates to 0,

but sometimes I see the reverse of this situation when it is computed_gradient evaluates to 0, but estimated_gradient evaluates to a non-zero.

This happens both for float and double tests.

Any ideas?

The text was updated successfully, but these errors were encountered:

naibaf7 · 2016-04-29T22:37:14Z

I'd need to know if there is a pattern on the data index:
(top_id, top_data_id, blob_id, feat_id)=0,0,0,0;
Can you find that out? Or just post some more index + values of the failures.

Not having one of those GPUs myself it is a bit difficult to track this problem.
The runtests are fine on Intel, AMD and nVidia chips otherwise.

295988101 · 2016-04-30T05:41:28Z

@psyhtest I'm going to run caffe on ARM, using opencl or cuda. But I have tried another opencl caffe version and finally I failed for the complex cross compile. Could you tell me, if you successfully use caffe on arm(opencl or cuda)?So nice as you if you can tell me some details. My mail is [email protected]

best regards to you.

jyegerlehner · 2016-05-06T08:07:53Z

@psyhtest The example you provide shows the test passes on the GPU for floats and the same test fails on the GPU for doubles. Are all the failed tests only double precision and on the GPU? If so, this suggests to me perhaps the MALI GPU does not support double, which is optional under the spec. Search for CL_DEVICE_DOUBLE_FP_CONFIG and clGetDeviceInfo, which is how the GPU indicates if it supports double.

psyhtest · 2016-05-06T16:05:40Z

@naibaf7
I'll attach a full log showing the Gradient related failures shortly.

@zhenghuitian
Yes, I gave up on AMD's port of Caffe too because it used OpenCL 1.2 and C++ templates in kernels. But I did manage to run Caffe with a couple of patches to clBLAS v2.4. I'm also working on support for CLBlast. Our vision is to create an open framework for optimising CNNs on embedded platforms, which is outlined in our IWOCL abstract. All comments and contributions are welcome!

@jyegerlehner
I strongly suspect that Mali does support double precision, as I was managing the OpenCL compiler team at ARM when it was implemented :). But perhaps I wasn't doing my job properly, and this omission somehow wasn't detected by conformance testing?.. :)

jyegerlehner · 2016-05-06T17:46:09Z

@psyhtest Hah hah OK I guess that rules that out. I thought it was a rather parsimonious theory though.

295988101 · 2016-05-07T09:15:08Z

@psyhtest Thank you for your answer. It do help me a lot. I am preparing to use CLBlast instead of clBlas because my arm has not AMD gpu. I am reading your IWOCL abstract.Thank you again.

psyhtest · 2016-05-10T08:56:21Z

@naibaf7

Please see the full (compressed) log from running the following command:

LD_LIBRARY_PATH=/data/install/lib-openblas-v0.2.18/lib:$LD_LIBRARY_PATH \
/data/caffe-naibaf7/build/test/test_all.testbin --gtest_filter=*Gradient* \
> /chronos_downloads/caffe-naibaf7.6c0fbdc.Gradient.log 2>&1
...
[==========] 494 tests from 138 test cases ran. (39528323 ms total)
[  PASSED  ] 384 tests.
[  FAILED  ] 110 tests, listed below:
...

Also attached is my Makefile.config.

psyhtest · 2016-05-11T06:58:42Z

I also observed a similar failure on Odroid-XU3 (similar chip to Chromebook 2 but with the Mali driver v4.0, rather than v6.0):

[ RUN      ] DeconvolutionLayerTest/2.TestGradient
./include/caffe/test/test_gradient_check_util.hpp:184: Failure
The difference between computed_gradient and estimated_gradient is 2, which exceeds threshold_ * scale, where
computed_gradient evaluates to 2,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.0020000000949949026.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,1,0; feat = 0.97146224975585938; objective+ = -1.53898024559021; objective- = -1.53898024559021
./include/caffe/test/test_gradient_check_util.hpp:184: Failure
The difference between computed_gradient and estimated_gradient is 2, which exceeds threshold_ * scale, where
computed_gradient evaluates to 2,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.0020000000949949026.
debug: (top_id, top_data_id, blob_id, feat_id)=0,1,1,0; feat = 0.97146224975585938; objective+ = -1.1870282888412476; objective- = -1.1870282888412476

It is, however, much more intermittent. (I could not reproduce it since.)

naibaf7 · 2016-05-11T10:44:53Z

@psyhtest
Yes.. I am currently looking if there are obvious parts of the code/kernels that could be problematic on these devices. After that I would like to do actual tests on the hardware.

psyhtest mentioned this issue May 12, 2016

CLBlast support #32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient tests fail on Samsung Chromebook 2 #28

Gradient tests fail on Samsung Chromebook 2 #28

psyhtest commented Apr 29, 2016 •

edited

Loading

naibaf7 commented Apr 29, 2016

295988101 commented Apr 30, 2016

jyegerlehner commented May 6, 2016 •

edited

Loading

psyhtest commented May 6, 2016

jyegerlehner commented May 6, 2016

295988101 commented May 7, 2016

psyhtest commented May 10, 2016

psyhtest commented May 11, 2016

naibaf7 commented May 11, 2016

Gradient tests fail on Samsung Chromebook 2 #28

Gradient tests fail on Samsung Chromebook 2 #28

Comments

psyhtest commented Apr 29, 2016 • edited Loading

naibaf7 commented Apr 29, 2016

295988101 commented Apr 30, 2016

jyegerlehner commented May 6, 2016 • edited Loading

psyhtest commented May 6, 2016

jyegerlehner commented May 6, 2016

295988101 commented May 7, 2016

psyhtest commented May 10, 2016

psyhtest commented May 11, 2016

naibaf7 commented May 11, 2016

psyhtest commented Apr 29, 2016 •

edited

Loading

jyegerlehner commented May 6, 2016 •

edited

Loading