Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve perf of packed depthwise conv2d #4909

Merged
merged 8 commits into from
Apr 12, 2021
Merged

improve perf of packed depthwise conv2d #4909

merged 8 commits into from
Apr 12, 2021

Conversation

pyu10055
Copy link
Collaborator

@pyu10055 pyu10055 commented Apr 8, 2021

This is achieved in two ways:

  1. eliminate unnecessary read of the texture when padding is odd and dilation is 1)
  2. localized the dot prod accumulation inside the loop, this reduce the cache miss on GPU. observed 30% improvement for certain models with 5x5 filter and 100% gain on android mid end phone.

The accuracy issue is still happening, which on mac both pack and unpack shaders have the same accuracy issues comparing to CPU, but their result matches.
On windows, the unpack shader matches with CPU result, which pack version differs.
But the error rate seems to be compatible with the conv2d vs cpu error rate.

refer #1679
To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

…he accuracy issue, enabled the flag by default
Copy link
Contributor

@annxingyuan annxingyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wowwwww ping i'm so impressed... and really glad to see this flag finally get turned on... 2 years later!!!

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @lina128 and @pyu10055)


tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 182 at r1 (raw file):

              } else {
                if (nextTexelOffset === 1) {
                  mainLoop += `

Could add a note about the logic behind this branching?


tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 282 at r1 (raw file):

            mainLoop += `
              vec4 wTexelR${r}C${c + 1} = getW(${r}, ${c + 1}, d1, q);
              dotProd += vec4(xR${r}C${c + 1}.xy * wTexelR${r}C${c + 1}.xz, xR${

I guess creating temporary vars for the xy/zw didn't resolve the windows issue?


tfjs-backend-webgl/src/flags_webgl.ts, line 66 at r1 (raw file):

/** Whether we will pack the depthwise conv op. */
// TODO: https://github.com/tensorflow/tfjs/issues/1679
ENV.registerFlag('WEBGL_PACK_DEPTHWISECONV', () => true);

🥳

Copy link
Collaborator Author

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @annxingyuan, @lina128, and @pyu10055)


tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 182 at r1 (raw file):

Previously, annxingyuan (Ann Yuan) wrote…

Could add a note about the logic behind this branching?

added


tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 282 at r1 (raw file):

Previously, annxingyuan (Ann Yuan) wrote…

I guess creating temporary vars for the xy/zw didn't resolve the windows issue?

no, I tried many combinations, temp variable, even accumulate the 4 channels separately with 4 floats. I have not yet find a way to improve accuracy.
but there are some interesting observation, I have used a fixed weight, when the weight is power of 2, no accuracy loss is observed. I am not sure
if that is caused by the multiplication or the addition.

Copy link
Collaborator

@lina128 lina128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Ping, this is great!

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @annxingyuan and @pyu10055)


tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 66 at r1 (raw file):

     */
    for (let r = 0; r < filterHeight; r++) {
      for (let texelC = 0; texelC < (texelsAcross / 2 + 1); texelC++) {

Will this condition be calculated in every iteration?


tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 182 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

added

I couldn't see the note.


tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 282 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

no, I tried many combinations, temp variable, even accumulate the 4 channels separately with 4 floats. I have not yet find a way to improve accuracy.
but there are some interesting observation, I have used a fixed weight, when the weight is power of 2, no accuracy loss is observed. I am not sure
if that is caused by the multiplication or the addition.

Can you add a note about why do the localized dot production, I'm guessing this may help other shaders that have similar logic?

Copy link
Collaborator

@lina128 lina128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @annxingyuan and @pyu10055)


tfjs-backend-webgl/src/conv_packed_gpu_depthwise.ts, line 360 at r2 (raw file):

        //intialize dotProd with a small epsilon seems to reduce GPU accuracy loss.
        vec4 dotProd = vec4(0.000000000000001);

This is counterintuitive, accuracy is improved by imposing a small error. Curious why :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants