Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[perf] improve shader compilation for WebGL with KHR_parallel_shader_compile extension #5205

Closed
pyu10055 opened this issue Jun 9, 2021 · 18 comments · Fixed by #5240
Closed
Assignees
Labels
type:bug Something isn't working

Comments

@pyu10055
Copy link
Collaborator

pyu10055 commented Jun 9, 2021

Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow.js):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow.js installed from (npm or script link):
  • TensorFlow.js version (use command below): 3.7.0
  • Browser version: NA
  • Tensorflow.js Converter Version: NA

Describe the current behavior
The initial inference on current TFJS webGL backend is much slower, which is caused by shader compilation and texture allocation.

Describe the expected behavior
With the latest extension KHR_parallel_shader_compile, there is a chance to speed up the shader compilation and reduce the initial inference time.
Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/CodePen/any notebook.

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

@pyu10055 pyu10055 added the type:bug Something isn't working label Jun 9, 2021
@pyu10055
Copy link
Collaborator Author

pyu10055 commented Jun 9, 2021

cc @qjia7

@qjia7
Copy link
Contributor

qjia7 commented Jun 10, 2021

@pyu10055 Will be someone be assigned for this issue or do you need any help from us?

@pyu10055
Copy link
Collaborator Author

pyu10055 commented Jun 10, 2021

@qjia7 If you have bandwidth, we would love to have you help with the initial investigation. As of today, our shader compilations are performed at per-op execution time. It would be interesting to see how would the extension fit into this scenarios.

@qjia7
Copy link
Contributor

qjia7 commented Jun 11, 2021

There are several things we can have a try:

  1. Simply apply KHR_parallel_shader_compile extension and don't let the shader compilation block the main process. So it can partially hide the data uploading time. Previous the process is upload data to gpu -> compile shader -> run. Now, the two processes upload data to gpu and compile shader can be parallel.
  2. Use uniforms instead of constant value so that the shader generation doesn't depend on runtime shapes. We can get the static shader string of each op before it's executed. It's also the ideal scenario to use KHR_parallel_shader_compile to let multiple shaders really parallel. However, in this scenario, we don't know how many shaders need to be pre-compiled. To compile all of them, it may doesn't make sense since the users may only use several of them. A method may be that we can pre-compile some widely/frequently used shaders, like conv2d, matmul, depthwiseConv2d, add, relu. And leave others to be compiled in execution. The advantage of uniforms is not only for KHR_parallel_shader_compile. It can also greatly reduce the shader variants, which means the total shader numbers will also be great reduced.
  3. The issue in step 2 is that in backend, we don't know the model information. So we don't know the ops set to be executed. One idea is that in upper level, we can see the model and the whole graph. Maybe we can provide a path to backend and tell the backend 'Hi, these ops will be executed, can you precompile them?'. In this case, we can do more things in backend, not only for compilation, also execution optimization.

We'd love to try 1 and 2. But 3 needs your help since it will change the upper framework. How do you think?

@vladmandic
Copy link
Contributor

vladmandic commented Jun 12, 2021

just my $0.02...

first - i LOVE this proposal! This is probably the biggest issue with WebGL nowadays as slow app startup turns users away.

(1) doesn't do much due to how tfjs shader compilation is structured
(3) is really interesting and should be doable without any changes to existing code: you could extend class GraphModel to add method warmup which can be executed optionally
(and to make it clean, such warmup method should call equivalent method in the backend (for any backend other than GL, it would simply return immediately))

enumerating ops in GraphModel once it's loaded is easy and fast:

const model: GraphModel = tf.loadGraphModel('test/model.json');
const ops: Record<string, Array<string>> = {};
for (const op of Object.values(model.executor.graph.nodes) as Array<{category: string, op: string}>) {
  if (!ops[op.category]) ops[op.category] = [];
  if (!ops[op.category].includes(op.op)) ops[op.category].push(op.op);
}
console.log('ops used by model:', ops);

output:

ops used by model: {
  graph: [ 'Const', 'Placeholder', 'Identity' ],
  convolution: [ '_FusedConv2D', 'FusedDepthwiseConv2dNative', 'DepthwiseConv2dNative', 'Conv2D', 'MaxPool' ],
  arithmetic: [ 'Mul', 'Add', 'FloorDiv', 'FloorMod', 'Sub' ],
  basic_math: [ 'Relu6', 'Relu', 'Sigmoid' ],
  reduction: [ 'Mean' ],
  image: [ 'ResizeBilinear' ],
  slice_join: [ 'ConcatV2', 'GatherV2', 'StridedSlice' ],
  transformation: [ 'Reshape', 'Cast', 'ExpandDims' ],
  logical: [ 'Equal' ],
  evaluation: [ 'TopKV2' ]
}

@pyu10055
Copy link
Collaborator Author

pyu10055 commented Jun 14, 2021

@qjia7 I agree with @vladmandic that option 2 and 3 looks like crucial to gain performance gain on the parallel compilation.

Similar to the warm up run, the graph model can have a compilation step, and the engine should have a compile API in comparison to current execution API, to avoid any texture upload.

@wingman-jr-addon
Copy link

(Non-technical comment: I write a browser plugin that basically blocks browser functionality for 10 seconds during model loading, so I'm quite happy to hear about performance improvement ideas here and plan to watch the progress eagerly! Thanks!)

@qjia7 qjia7 self-assigned this Jun 15, 2021
@qjia7
Copy link
Contributor

qjia7 commented Jun 15, 2021

Thanks for your inputs. I will take a look at the step 2 Use uniforms instead of constant value as a start.

qjia7 added a commit to qjia7/tfjs that referenced this issue Jun 22, 2021
PERF
Fix tensorflow#5205

This PR adds the shapes uniforms support and enables it for unary/binary
ops.
pyu10055 pushed a commit that referenced this issue Jul 1, 2021
FEATURE
* webgl: Add shapes uniforms to reduce shader compilation time

PERF
Fix #5205

This PR adds the shapes uniforms support and enables it for unary/binary
ops.

* fix the bot failure

* Add annotation for the key composition.

* address comments

* Disable shapes uniforms by default and enable it in integration test
@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@wingman-jr-addon
Copy link

@qjia7 Thanks for your hard work! I was so excited to give this a try as I saw TF.js 3.8.0 was released! My plugin is still back on 2.7.0, so I did a quick upgrade.
Unfortunately, somewhere between 2.7.0 and 3.8.0 the performance for model load plus first inference time became much worse. I have linked off in the issue above but the overall time went from about 9 seconds to 13.5 seconds just due to TF.js version change, so it doesn't look like I'll be upgrading quite yet.

What kind of performance numbers were others here seeing from this PR? (also @vladmandic @pyu10055 )

@qjia7 qjia7 reopened this Jul 19, 2021
@qjia7
Copy link
Contributor

qjia7 commented Jul 19, 2021

@wingman-jr-addon This issue has not been finished. It may be closed by accident. Currently, using shapes uniforms is disabled by default. You need to set WEBGL_USE_SHAPES_UNIFORMS to true to use it. And so far, only unary/binary applied it. For other ops, they are on the way. For example, conv2d, you can find here #5297. I don't expect it will bring big perf regression since it's disabled by default. Can you share me a reproducible example? I can double-check it whether it's related with the changes by bringing uniforms.

@wingman-jr-addon
Copy link

Thank you for the detailed explanation @qjia7 - if it's hidden behind a flag, I'm guessing that this regression has nothing to do with your recent work. Based on that, let me do some bisecting on versions and see if I can narrow the cause down a bit further and then provide a minimal reproduction either here or in an appropriate issue.

@wingman-jr-addon
Copy link

@qjia7 Through bisection I've narrowed it down to a change that occurred between 3.3.0 and 3.4.0. I'll do some more looking but that means it is definitely not related to this functionality.

@vladmandic
Copy link
Contributor

I've tested this on my notebook with 3 different models of medium-high complexity

Model DataSet WEBGL_PACK_DEPTHWISECONV WEBGL_USE_SHAPES_UNIFORMS Warmup Execution Note
Inception-v4 ImageNet True False 11.2sec 42ms Default
Inception-v4 ImageNet False False 10.8sec 45ms
Inception-v4 ImageNet False True 10.8sec 45ms
Inception-v4 ImageNet True True 11.2sec 42ms
SSD/MobileNet-v2 OpenImages True False 14.7 2.1sec Default
SSD/MobileNet-v2 OpenImages False False 13.3sec 2.2sec
SSD/MobileNet-v2 OpenImages False True 12.7sec 2.1sec
SSD/MobileNet-v2 OpenImages True True 13.6sec 2.1sec
EfficientDet-D4 CoCo True False 23.1sec 12.9sec Default
EfficientDet-D4 CoCo False False 16.1sec 14.5sec
EfficientDet-D4 CoCo False True 15.9sec 14.0sec
EfficientDet-D4 CoCo True True 21.1sec 13.0ms

All-in-all:

  • WEBGL_USE_SHAPES_UNIFORMS helps to significantly reduce warmup with NO negative impact on subsequent inference
  • WEBGL_PACK_DEPTHWISECONV increases warmup too much even if subsequent inference is slightly faster

As it is, I'll be setting WEBGL_USE_SHAPES_UNIFORMS=True and WEBGL_PACK_DEPTHWISECONV=False on my projects as even with uniforms enabled (which does help), it's still too slow on warmup

Note: Chrome does extensive shader caching between sessions, so simple page reload is not sufficient and full browser restart is needed between tests

@wingman-jr-addon
Copy link

Thank you @vladmandic for your much more thorough analysis. I'm sure that took quite some time. I'll be watching over on the issue where you cross-posted as we look at this issue specifically.

@vladmandic
Copy link
Contributor

see #5689 for fully reproducible code and additional performance notes.

@rthadur
Copy link
Contributor

rthadur commented Sep 24, 2022

Related PR has been merged , closing this issue. Thank you

@rthadur rthadur closed this as completed Sep 24, 2022
@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants