improve perf of packed depthwise conv2d #4909

pyu10055 · 2021-04-08T03:44:03Z

This is achieved in two ways:

eliminate unnecessary read of the texture when padding is odd and dilation is 1)
localized the dot prod accumulation inside the loop, this reduce the cache miss on GPU. observed 30% improvement for certain models with 5x5 filter and 100% gain on android mid end phone.

The accuracy issue is still happening, which on mac both pack and unpack shaders have the same accuracy issues comparing to CPU, but their result matches.
On windows, the unpack shader matches with CPU result, which pack version differs.
But the error rate seems to be compatible with the conv2d vs cpu error rate.

refer #1679
To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.

This change is

…he accuracy issue, enabled the flag by default

annxingyuan

Wowwwww ping i'm so impressed... and really glad to see this flag finally get turned on... 2 years later!!!

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @lina128 and @pyu10055)

tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 182 at r1 (raw file):

              } else {
                if (nextTexelOffset === 1) {
                  mainLoop += `

Could add a note about the logic behind this branching?

tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 282 at r1 (raw file):

            mainLoop += `
              vec4 wTexelR${r}C${c + 1} = getW(${r}, ${c + 1}, d1, q);
              dotProd += vec4(xR${r}C${c + 1}.xy * wTexelR${r}C${c + 1}.xz, xR${

I guess creating temporary vars for the xy/zw didn't resolve the windows issue?

tfjs-backend-webgl/src/flags_webgl.ts, line 66 at r1 (raw file):

/** Whether we will pack the depthwise conv op. */
// TODO: https://github.com/tensorflow/tfjs/issues/1679
ENV.registerFlag('WEBGL_PACK_DEPTHWISECONV', () => true);

🥳

pyu10055

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @annxingyuan, @lina128, and @pyu10055)

tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 182 at r1 (raw file):

Previously, annxingyuan (Ann Yuan) wrote…

Could add a note about the logic behind this branching?

added

tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 282 at r1 (raw file):

Previously, annxingyuan (Ann Yuan) wrote…

I guess creating temporary vars for the xy/zw didn't resolve the windows issue?

no, I tried many combinations, temp variable, even accumulate the 4 channels separately with 4 floats. I have not yet find a way to improve accuracy.
but there are some interesting observation, I have used a fixed weight, when the weight is power of 2, no accuracy loss is observed. I am not sure
if that is caused by the multiplication or the addition.

lina128

Thank you Ping, this is great!

Reviewable status: complete! 2 of 1 approvals obtained (waiting on @annxingyuan and @pyu10055)

tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 66 at r1 (raw file):

     */
    for (let r = 0; r < filterHeight; r++) {
      for (let texelC = 0; texelC < (texelsAcross / 2 + 1); texelC++) {

Will this condition be calculated in every iteration?

tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 182 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

added

I couldn't see the note.

tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 282 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

no, I tried many combinations, temp variable, even accumulate the 4 channels separately with 4 floats. I have not yet find a way to improve accuracy.
but there are some interesting observation, I have used a fixed weight, when the weight is power of 2, no accuracy loss is observed. I am not sure
if that is caused by the multiplication or the addition.

Can you add a note about why do the localized dot production, I'm guessing this may help other shaders that have similar logic?

lina128

Reviewable status: complete! 2 of 1 approvals obtained (waiting on @annxingyuan and @pyu10055)

tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 360 at r2 (raw file):

        //intialize dotProd with a small epsilon seems to reduce GPU accuracy loss.
        vec4 dotProd = vec4(0.000000000000001);

This is counterintuitive, accuracy is improved by imposing a small error. Curious why :)

…rmance

improve perf of packed depthwise conv2d, seems it has also resolved t…

c0cf886

…he accuracy issue, enabled the flag by default

pyu10055 requested review from lina128 and annxingyuan April 8, 2021 03:44

google-cla bot added the cla: yes label Apr 8, 2021

added comment

3fbab15

annxingyuan approved these changes Apr 8, 2021

View reviewed changes

pyu10055 commented Apr 8, 2021

View reviewed changes

lina128 approved these changes Apr 8, 2021

View reviewed changes

pyu10055 added 5 commits April 8, 2021 13:24

added comments for the optimizations;

29fff9b

added more checks for unused channels

8b940bb

added small initialization to dotProd vec4

b249ef0

Merge branch 'master' into pack_depthwise

7e284b9

fix boundary for even padding

5ebdaca

lina128 approved these changes Apr 12, 2021

View reviewed changes

Merge branch 'master' into pack_depthwise

c5e0bfc

lina128 merged commit 8cb707b into master Apr 12, 2021

wingman-jr-addon added a commit to wingman-jr-addon/wingman_jr that referenced this pull request Jul 19, 2021

Disabling flag related to TF.js tensorflow/tfjs#4909 - improves perfo…

503aaf0

…rmance

This was referenced Jul 19, 2021

Check 3.4.0 with WEBGL_PACK_DEPTHWISECONV=false - good wingman-jr-addon/wingman_jr#135

Closed

WEBGL_PACK_DEPTHWISECONV=true seems to cause significant first inference performance drop #5343

Closed

pyu10055 deleted the pack_depthwise branch October 11, 2021 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve perf of packed depthwise conv2d #4909

improve perf of packed depthwise conv2d #4909

pyu10055 commented Apr 8, 2021 •

edited

Loading

annxingyuan left a comment

pyu10055 left a comment

lina128 left a comment

lina128 left a comment

improve perf of packed depthwise conv2d #4909

improve perf of packed depthwise conv2d #4909

Conversation

pyu10055 commented Apr 8, 2021 • edited Loading

annxingyuan left a comment

Choose a reason for hiding this comment

pyu10055 left a comment

Choose a reason for hiding this comment

lina128 left a comment

Choose a reason for hiding this comment

lina128 left a comment

Choose a reason for hiding this comment

pyu10055 commented Apr 8, 2021 •

edited

Loading