Skip to content
This repository has been archived by the owner on Apr 18, 2023. It is now read-only.

support new webgl backend. #297

Merged
merged 11 commits into from
Dec 7, 2018
Merged

Conversation

GreyZzzzzzXh
Copy link

No description provided.

@@ -20,16 +20,14 @@ class ImageClassificationModel {
throw Error('Fails to initialize neural network context');
}
this._nn = nnNative;
} else if (this._backend === 'WASM' || this._backend === 'WebGL2') {
} else if (this._backend === 'WASM' || this._backend === 'WebGL') {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed all WebGL2 to WebGL in polyfill and examples.

if (this._backend === 'WebGL2') {
options.useWebGL2 = true;
}
options.backend = this._backend;
this._model = await this._nn.createModel(options);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced options.useWebGL2 with options.backend.
For polyfill, options.backend should be the string 'WASM' or 'WebGL'.

package.json Outdated
@@ -35,7 +35,8 @@
"ndarray-ops": "^1.2.2",
"ndarray-squeeze": "^1.0.2",
"selenium-webdriver": "^3.1.0",
"webpack": "^3.5.5"
"webpack": "^3.5.5",
"@tensorflow/tfjs-core": "^0.13.6"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cd webml-polyfill
npm install

@GreyZzzzzzXh
Copy link
Author

I changed the backend name to 'WebGL', and revised some interfaces.
@huningxin , @Wenzhao-Xiang, please help review, thanks!
These changes may effect test and benchmark, please take a look, @BruceDai @ibelem .

@BruceDai
Copy link
Contributor

BruceDai commented Nov 7, 2018

@GreyZzzzzzXh yes, we will update test cases & benchmark test for testing WASM / WebGL polyfill backend, thanks

@Wenzhao-Xiang
Copy link
Contributor

@GreyZzzzzzXh Seems inception_v3 and squeezenet for image_classification_tflite have some issues with my machine:
squeezenet:

Label Probability
admiral 30.07%
refrigerator 24.34%
wine bottle 10.36%

inception_v3:

Label Probability
fountain 100.00%
toilet tissue 0.00%
bolete 0.00%

But the result was correct in @GreyZzzzzzXh and @pinzhenx 's machine with the same code.
Does tensorflow.js have some requests for machine?

@Wenzhao-Xiang
Copy link
Contributor

And the results have some difference with PC and mobile phone(One plus 3T, chrome) with WebGL backend. For example, with mobilenet_v1:
PC:

Label Probability
bee eater 95.59%
partridge 1.73%
indigo bunting 1.71%

Mobile phone:

Label Probability
bee eater 95.41%
partridge 1.69%
indigo bunting 1.67%

It seems like the same issue as what Wenyao write in his graduation paper:

On mobile devices, WebGL uses 16-bit signed integers to store data by default, while on the computer,
WebGL uses 32-bit signed integers to store data by default.

@Wenzhao-Xiang Wenzhao-Xiang requested review from Wenzhao-Xiang and removed request for Wenzhao-Xiang November 7, 2018 07:02
@GreyZzzzzzXh
Copy link
Author

@Wenzhao-Xiang, thanks for testing. There are still some issues on mobile devices.

On mobile devices, WebGL uses 16-bit signed integers to store data by default, while on the computer,
WebGL uses 32-bit signed integers to store data by default.

May be like this, but if so we have to revise the source code of tfjs.

It may take some time to figure out, so.. maybe we should still remain WebGL2 backend and name this new backend TFJS temporarily.

@huningxin
Copy link
Contributor

It may take some time to figure out, so.. maybe we should still remain WebGL2 backend and name this new backend TFJS temporarily.

Can we make it a reproducible test case for TensorFlow.js team? For example, put it to a public accessible link in your github pages. We can file a bug to TF.js. And I can help bridge TF.js folks to look into it.

@GreyZzzzzzXh
Copy link
Author

GreyZzzzzzXh commented Nov 8, 2018

I host the modified code on my github pages.
Visit https://greyzzzzzzxh.github.io/webml to see the results.

Test by CTS: https://greyzzzzzzxh.github.io/webml/test/index.html?backend=webgl&grep=CTS

@huningxin
Copy link
Contributor

I host the modified code on my github pages.
Visit https://greyzzzzzzxh.github.io/webml-examples to see the results.

Thanks for doing that!

Besides the examples link, could you please give the steps to reproduce the issue? For example, which example, which backend, which model? And please give the expected results. And what is the incorrect result when the issue happens? Also please give the test configuration, e.g. Chrome version (got from chrome://version/), GPU driver version (got from chrome://gpu/) etc.,. Please consult @BruceDai for this kind of bug reporting. Thanks!

@GreyZzzzzzXh
Copy link
Author

@huningxin , ok i'll do more test and provide detail info next week, thanks!

@GreyZzzzzzXh
Copy link
Author

GreyZzzzzzXh commented Nov 14, 2018

The result of the mobile end is different from that of the PC side.

Test Env:
Chrome Version: 70.0.3538.80 (official build) (32-bit)
Platform: Android 8.1.0, Pixel 2
GPU driver version: 258.0
tfjs-core version: 0.13.10

Expected Result:
tested on PC platform, e.g.

  • Linux ubuntu 16.04, Chrome 69.0.3497.100 (Official Build) (64-bit)
  • Windows 10, Chrome 70.0.3538.102 (Official Build) (32-bit)

Mobilenet v1:

# Label Probability
1 bee eater 95.59%
2 jacamar 1.73%
3 brambling 1.71%

Mobilenet v2:

# Label Probability
1 bee eater 84.14%
2 indigo bunting 1.07%
3 brambling 0.76%

Inception v3:

# Label Probability
1 bee eater 96.31%
2 partridge 0.11%
3 indigo bunting 0.04%

Squeezenet:

# Label Probability
1 bee eater 96.71%
2 goldfinch 1.77%
3 ladybug 0.45%

screenshot from 2018-11-21 17-02-28

Actual Result:

The output doesn't match the expected result.

Mobilenet v1:

# Label Probability
1 bee eater 95.41%
2 brambling 1.69%
3 jacamar 1.67%

Mobilenet v2:

# Label Probability
1 bee eater 84.86%
2 indigo bunting 0.94%
3 brambling 0.83%

Inception v3:

# Label Probability
1 bee eater 99.22%
2 partridge 0.11%
3 indigo bunting 0.04%

Squeezenet:

# Label Probability
1 bee eater 96.88%
2 goldfinch 1.80%
3 ladybug 0.39%

screenshot from 2018-11-21 17-02-42

How to Reproduce:

@GreyZzzzzzXh
Copy link
Author

Results and reproduction steps are described above, please take a look @huningxin .

@huningxin
Copy link
Contributor

@GreyZzzzzzXh , thanks! How about the expected results? And please list the devices which can deliver expected results. That would be also helpful.

@GreyZzzzzzXh
Copy link
Author

How about the expected results

data in the first four tables is expected result.

the devices which can deliver expected results

PC platform, e.g.

  • Linux ubuntu 16.04, Chrome 69.0.3497.100 (Official Build) (64-bit)
  • Windows 10, Chrome 70.0.3538.102 (Official Build) (32-bit)

@GreyZzzzzzXh
Copy link
Author

Besides, visit https://greyzzzzzzxh.github.io/webml/test/index.html?backend=webgl&grep=CTS for case testing.

computer side:

passes: 127
failures: 6

but many cases fail on mobile device:

passes: 71
failures: 62

@huningxin
Copy link
Contributor

@GreyZzzzzzXh , please complete the actual results in #297 (comment). Thanks!

BTW, when testing on the device with incorrect results, are there any errors reported in console?

@GreyZzzzzzXh
Copy link
Author

please complete the actual results in #297 (comment)

done.

when testing on the device with incorrect results, are there any errors reported in console?

no errors or warnings reported in console.

@GreyZzzzzzXh
Copy link
Author

GreyZzzzzzXh commented Nov 21, 2018

Besides, tested on tfjs-converter mobilenet demo, tensorflow.js still shows different precision on the computer side and mobile phone side.

Test Env:
tfjs version: 0.13.3
tfjs-core version: 0.13.8
Windows 10, Chrome 70.0.3538.77 (official build) (64-bit)
Android 8.1.0, Pixel 2, Chrome 70.0.3538.80 (official build) (32-bit)

Results:

Windows 10:
image

Android 8.1.0:
screenshot_20181121-212017

@huningxin
Copy link
Contributor

Thanks for these details!

@huningxin
Copy link
Contributor

@GreyZzzzzzXh , could you please take a look at tensorflow/tfjs#265. It sounds like tfjs uses float16 on mobile. Is that the root cause?

@GreyZzzzzzXh
Copy link
Author

It sounds like tfjs uses float16 on mobile.

thanks! I will do some investigation about this.

@GreyZzzzzzXh
Copy link
Author

This precision issue can be fixed by upgrading GLSL version to 300 es.
WebGL1.0 dont work now so we just use webgl2.0 backend and still call it WebGL2.

@huningxin
Copy link
Contributor

yes i'll record the changes in README.md, and now i import modified tfjs-core in src/nn/webgl2/tfjs-core for use.

According to GreyZzzzzzXh/tfjs-core@39a56bf, could you please explain the root cause and your method of "upgrade GLSL to version 300 es"?

I found you check-in tfjs-core, I would suggest to avoid that. Let's figure out what need to be fixed in tfjs-core. Then propose a fix to tfjs-core repo.

@GreyZzzzzzXh
Copy link
Author

GreyZzzzzzXh commented Nov 29, 2018

the root cause

In general, webgl1.0 and 2.0 uses different versions of the shading language (GLGL100 for webgl1.0 and GLSL300es for webgl2.0). But in tfjs, only GLSL100 is used as the shading language for webgl1.0 and 2.0.

The inaccuracy on the phone seems to be because the version of GLSL also has an impact on accuracy. After I changed the GLSL100 to 300es, the float can achieve 32-bit accuracy on the phone, and originally only 16 bits (tested with tf.ENV.backend.floatPrecision()).

your method of "upgrade GLSL to version 300 es"

I added some comments about how to upgrade GLSL version in GreyZzzzzzXh/tfjs-core@39a56bf.
There are five main places that need to be changed:

  1. Declare the shading language version in the shader code as #version 300 es.
  2. Replace attribute with in.
  3. Replace varying with in/out.
  4. Replace texture2D with texture.
  5. There is no built-in variable gl_FragColor in GLSL300es, so we need to define an out variable for the output.

Besides, i set precision highp sampler2D; and made some changes related to function round().
See GreyZzzzzzXh/tfjs-core@39a56bf for more detail.

@GreyZzzzzzXh
Copy link
Author

Let's figure out what need to be fixed in tfjs-core. Then propose a fix to tfjs-core repo.

Yeah, the best way is to solve this problem by tfjs team. But if it takes a while to fix, I suggest to import the modified tfjs for temporary use.

@pinzhenx
Copy link
Contributor

I found something that might help:
https://medium.com/@invicticide/patching-an-npm-dependency-without-going-completely-insane-aa0b110a80c

@huningxin
Copy link
Contributor

Yeah, the best way is to solve this problem by tfjs team

Please go ahead to file a bug and open a PR to tfjs-core with your solution.

I suggest to import the modified tfjs for temporary use.

Please don't import source code. If we want to maintain a version of tfjs-core, you can publish your version in npm and npm install from there.

originally only 16 bits (tested with tf.ENV.backend.floatPrecision()).

Probably, we can report this out. Then test cases can handle the lower precision backend differently. @BruceDai

@GreyZzzzzzXh
Copy link
Author

I published the modified tfjs-core in npm. Now we can install it by npm install. Thanks @huningxin and @pinzhenx .

Next I will improve this fix and open a PR to tfjs-core..

@GreyZzzzzzXh GreyZzzzzzXh changed the title support new webgl2 backend. support new webgl backend. Dec 3, 2018
prepareModel() {
this._model._operands.forEach(operand => {
if (utils.isTensor(operand.type)) {
let type = this._getOperandType(operand.type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

output.buffer.set(operand.dataSync());
});

// console.log(tf.memory());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this comment?

}

inputs.forEach(input => {
let operand = this._operands[input.index];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?


inputs.forEach(input => {
let operand = this._operands[input.index];
let inputTensor = tf.tensor(input.buffer, operand.shape, operand.dtype);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?


switch(op) {
case OperationCode.ADD: {
let in1 = operands[inputs[0]];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

let input = operands[inputs[0]];
let targetShape = operands[inputs[1]];
let output = operands[outputs[0]];
output.assign(input.reshape(targetShape.dataSync()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do dataSync? I understand it leads to memory read back from GPU to CPU which is bad for performance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the definition of reshape in NN API, targetShape is a 1-D tensor whose value is stored in GPU. But for tf.reshape, target shape should be an array of integers..

output.assign(input.reshape(targetShape.dataSync()));
} break;
case OperationCode.CONCATENATION: {
if (outputs.length < 1 || inputs.length < 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify the condition?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since tfjs will give detailed reasons for errors, can we remove this check?

let bias = operands[inputs[2]];
let activation = FuseFunctionMap.get(operands[inputs[3]].value[0]);
let output = operands[outputs[0]];
let batchSize = input.shape[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/nn/webgl/WebGLModel.js Show resolved Hide resolved
src/nn/webgl/WebGLModel.js Show resolved Hide resolved
@huningxin
Copy link
Contributor

@GreyZzzzzzXh , finish my review with some comments. Please take a look. Thanks!

@GreyZzzzzzXh
Copy link
Author

I made some changes, PTAL, thanks! @huningxin

@huningxin
Copy link
Contributor

The Travis CI fails due to "chrome installation error". @ibelem , could you or someone please have a check?

@ibelem
Copy link
Member

ibelem commented Dec 7, 2018

@huningxin Please feel free to merge this PR since there is the issue of Travis CI and passed with AppVeyor. We are reported Travis CI issue to upstream and also trying other workrounds.

@huningxin
Copy link
Contributor

Thanks for the great work. Looks good to me!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants