Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WEBGL_PACK_DEPTHWISECONV=true seems to cause significant first inference performance drop #5343

Closed
wingman-jr-addon opened this issue Jul 19, 2021 · 17 comments · Fixed by #5714
Assignees
Labels
comp:webgpu type:bug Something isn't working

Comments

@wingman-jr-addon
Copy link

Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template

System information

ideapad FLEX5-1570
Processor	Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz   2.90 GHz
Installed RAM	16.0 GB (15.9 GB usable)
System type	64-bit operating system, x64-based processor
Pen and touch	Pen and touch support with 10 touch points

Current behavior - Upgrading from 3.3.0 to 3.4.0 experienced major performance drop on load+first inference time. 3.3.0 sees times of about 8.8s, 3.4.0 sees times about 14.4s. It pains me to report a bug related to WEBGL_PACK as so much work has gone into this feature, but ... It appears that setting WEBGL_PACK_DEPTHWISECONV=false on 3.4.0 returns to performance found in 3.3.0. Regression with default flags has been found to exist in at least 3.6.0 and 3.8.0 as well. (This was found on a bisection to upgrade from 2.7.0 to 3.8.0 to get the new shader compilation performance improvements started in #5205 )

Expected behavior - 3.4.0 with the flag default WEBGL_PACK_DEPTHWISECONV=true offers similar or better performance to 3.3.0.

Minimal reproduction: wingman-jr-addon/wingman_jr#136
Note this is a Firefox plugin, but TF.js is loaded via a content tab rather in the background context so it should be acting quite similarly to a normal browsing context.

Attached is output from Firefox's about:support, which includes more detailed graphics issues that may be relevant to the matter at hand.
FF90_about_support.txt

@jinjingforever
Copy link
Collaborator

Hi Ahmed,

I think you probably have more knowledge about the webgl backend so assigning this to you:) Please help take a look when you have a chance. Really appreciate it!

@wingman-jr-addon
Copy link
Author

(Also, I see that I neglected to link off to the PR that led me to this issue - #4909 - I don't know that the work done in the bulk of the PR is by any means the cause, but the change of the flag's default option is what caused the regression.)

@pyu10055
Copy link
Collaborator

@wingman-jr-addon If the initialization is larger than before, there are possibly two causes:

  1. There are configuration of the depthwise conv2d ops that require more packed shaders to be compiled comparing to unpacked depthwise kernel. But I do see any setup could cause that.
  2. The packed shader takes much longer to compile compares to unpacked version.

Item 2 might be browser specific, can you help to verify if this behavior occurs on firefox and chrome web page? Thanks

@wingman-jr-addon
Copy link
Author

Thanks for idea to try @pyu10055 . I changed the plugin to run in a web page and then ran across Firefox 90 and Chrome 92.
In general, Chrome had significantly better performance, but there was still a performance gap between the flag being on and off for reload times. I would run the test by opening the browser, loading the web page, recording the time, and then hitting refresh and making note of times after that. The first load times tended to be much higher than subsequent times in Chrome. Also, I was only running one browser at a time during testing in case of some sort of GPU contention.

Here are the raw times in seconds:
FF 90, defaults: 14.5, 13.5, 14.2
FF 90, flag=false: 9.6, 8.9, 9.1, 8.5
Chrome 92, defaults: 13.9, 7.0, 6.8, 6.9, 7.0
Chrome 92, flag=false: 14.9, 5.2, 4.8

I reloaded the last Chrome test and tried it several more times, seeing reload times of 4.5-5.5 seconds.

So, I guess I'm not sure what to make of the results for Chrome - I'm not sure if I should trust the reload results or if I should run a bunch of "first run" tests. For Firefox, the results are slower across the board but consistent on reload.

Let me know what you think. Thanks!

@pyu10055
Copy link
Collaborator

pyu10055 commented Jul 23, 2021

My guess is that Chrome have better caching on the shader across page reloads.
One thing you could try, not necessarily relates to depthwise conv2d, use uniform variables for unary and binary ops.

tf.env().set('WEBGL_USE_SHAPES_UNIFORMS', true);

We have observe significant reduction of initial loading time.

@wingman-jr-addon
Copy link
Author

Thanks for the tip, I am seeing about a 0.5-1.0s reduction in load time on FF 90!

But getting back to the matter at hand - any clue why we might be seeing such a performance gap on WEBGL_PACK_DEPTHWISECONV=true though?

It could be that it just depends on hardware and mine is not the primary target, I'd just like to make sure there isn't some other issue hanging out.

@qjia7
Copy link
Contributor

qjia7 commented Jul 27, 2021

@wingman-jr-addon @pyu10055 My guess WEBGL_PACK_DEPTHWISECONV=true brings more shaders to compile. You can compare the binaryCaches's size in backend_webgl.ts between WEBGL_PACK_DEPTHWISECONV=true and WEBGL_PACK_DEPTHWISECONV=false to see how many shaders are added (they probably are encode/decode matrix shaders. If that's the case, I think this PR #5297 may resolve your issue by using uniforms to encode/decode matrix programs). And, in another side, we should check why so many encode/decode shaders are introduced. Can them be avoided/reduced? In my opinion, if we go to the packed path, it will be best if we only pack data at the beginning and unpack the data at the last and try to avoid encode/decode the data in the middle. Maybe we can check if there is any chance to optimize it.

@vladmandic
Copy link
Contributor

vladmandic commented Jul 29, 2021

Cross-posting from #5205

I've tested this on my notebook with 3 different models of medium-high complexity

Model DataSet WEBGL_PACK_DEPTHWISECONV WEBGL_USE_SHAPES_UNIFORMS Warmup Execution Note
Inception-v4 ImageNet True False 11.2sec 42ms Default
Inception-v4 ImageNet False False 10.8sec 45ms
Inception-v4 ImageNet False True 10.8sec 45ms
Inception-v4 ImageNet True True 11.2sec 42ms
SSD/MobileNet-v2 OpenImages True False 14.7 2.1sec Default
SSD/MobileNet-v2 OpenImages False False 13.3sec 2.2sec
SSD/MobileNet-v2 OpenImages False True 12.7sec 2.1sec
SSD/MobileNet-v2 OpenImages True True 13.6sec 2.1sec
EfficientDet-D4 CoCo True False 23.1sec 12.9sec Default
EfficientDet-D4 CoCo False False 16.1sec 14.5sec
EfficientDet-D4 CoCo False True 15.9sec 14.0sec
EfficientDet-D4 CoCo True True 21.1sec 13.00sec

All-in-all:

  • WEBGL_USE_SHAPES_UNIFORMS helps to significantly reduce warmup with NO negative impact on subsequent inference
  • WEBGL_PACK_DEPTHWISECONV increases warmup too much even if subsequent inference is slightly faster

As it is, I'll be setting WEBGL_USE_SHAPES_UNIFORMS=True and WEBGL_PACK_DEPTHWISECONV=False on my projects as even with uniforms enabled (which does help), it's still too slow on warmup

Note: Chrome does extensive shader caching between sessions, so simple page reload is not sufficient and full browser restart is needed between tests

@wingman-jr-addon
Copy link
Author

@rthadur Would you agree the awaiting response label can probably be removed from this issue now?

@vladmandic
Copy link
Contributor

@ahmedsabie @qjia7 @rthadur @pyu10055 Any updates on this? As you can see, WEBGL_PACK_DEPTHWISECONV=True (which is default value) has a massive negative performance impact - and it's gotten far worse in newer versions of TFJS.

This is a major regression and it has very little updates.

And yes, using WEBGL_USE_SHAPES_UNIFORMS is much better, but - a) it's not a solution, it's an alternative, b) it's not widely implemented, c) almost nobody knows about it.

@rthadur rthadur assigned pyu10055 and unassigned ahmedsabie Sep 23, 2021
@vladmandic
Copy link
Contributor

see #5689 for fully reproducible code and additional performance notes.

@pyu10055
Copy link
Collaborator

pyu10055 commented Oct 8, 2021

@vladmandic @wingman-jr-addon I think this could be caused by the packed depthwise conv2d shader could be much larger in size than unpacked depthwise conv2d version. This could be related to the filter size, since the packed version expand the loop of the filter width into code. Can you share what is the filter size for depthwise conv2d in your model?

And the other question is, we have a way to make the initial warm non UI blocking, basically by yielding the JS thread and removing all GL block calls (parallel shader compilation). But the overall warm time might still be similar. Will this behavior be helpful for your use cases?

@vladmandic
Copy link
Contributor

vladmandic commented Oct 8, 2021

Can you share what is the filter size for depthwise conv2d in your model?

I have seen the same behavior in almost every off-the-shelf model
Just pick any from TFHub - I've provided performance data for Inception-v4, EfficientDet-D4, EfficientNet-B5, MobileNet-v2
And only model that doesn't have the problem is ancient MobileNet-v2
(and since I have an automated test for this, I can reproduce using any given model)

What is the intended benefit of the packed conv2d shader? I don't see much benefit of it:

  • few percent faster inference (never more than 2-5%, hardly worth it)
  • few hundred percent slowdown in warmup (average 50-400%, massive)

And the other question is, we have a way to make the initial warm non UI blocking, basically by yielding the JS thread and removing all GL block calls (parallel shader compilation). But the overall warm time might still be similar. Will this behavior be helpful for your use cases?

Yes and No :)

  • Yes for overall usage as any reduction of blocking calls is very welcome.
  • No as I already use web workers in most cases anyhow,
    so my UI is not really blocked - but since app requires models to work, not much use of app until warmup has finished

@pyu10055
Copy link
Collaborator

pyu10055 commented Oct 8, 2021

We have seen significant performance gain on mobile device with the packed depthwise conv2d shader, especially for android devices.
Thank you for the insights, we will see if we can avoid the slow start up time it introduces.
As for UI blocking, even with web worker, the UI can be blocked, since chrome has a single GL command queue, any blocking GL call will prevent new GL commands to be flushed to GPU.

@vladmandic
Copy link
Contributor

Thanks @pyu10055,

Perhaps as a start, you could do conditional WEBGL_PACK_DEPTHWISECONV = isMobile() ?
It doesn't solve a problem, just makes less people immediately affected

I just tried on Android. Yes, inference performance difference on Android is visible (unlike on desktop)
IMO - still, negative impact of slow warmup causing users to simply give up and close the app outweighs performance benefits of actual inference

Re: UI Blocking - true, if there is any GL usage elsewhere. Anyhow, if you can make it non-blocking, that is very much welcome

And when will WEBGL_USE_SHAPES_UNIFORMS become a default?

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@vladmandic
Copy link
Contributor

closing the loop after testing using todays code in main branch:

warmup is now about 2x faster
no material difference regardless if WEBGL_PACK_DEPTHWISECONV is enabled or disabeld so that issue is resolved
do note that enabling WEBGL_USE_SHAPES_UNIFORMS performs much better (2x faster warmup) regardless of packing (actually packing improvements make it even faster)!

webgl default

  • warmup initial 53sec in 3.9.0 -> 32sec in main branch
  • warmup cached 15sec -> 12sec

webgl with uniforms enabled

  • warmup initial 23sec -> 18sec
  • warmup cached 14sec > 5sec

@pyu10055 please consider enabling uniforms as default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:webgpu type:bug Something isn't working
Projects
None yet
7 participants