Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: OpenCL via CLBlast? #173

Open
misutoneko opened this issue Nov 23, 2022 · 8 comments
Open

Idea: OpenCL via CLBlast? #173

misutoneko opened this issue Nov 23, 2022 · 8 comments
Labels
ideas Interesting ideas for experimentation

Comments

@misutoneko
Copy link

misutoneko commented Nov 23, 2022

Hi,

Nice project! Thanks for your work, I wish I had better hw to make use of it :D

I haven't seen anyone mentioning CLBlast here?
They actually provide a wrapper so that CLBlast can be used as a drop-in replacement for OpenBLAS.
I've now tried this, it works and it was easy enough even for me :D
But to get the best performance you'd need to adjust it a lot more I guess.

Here's my naïve patch in case anyone wants to play with it:
whisper.cpp_CLBlast.patch.gz

Note that the patch simply replaces the existing OpenBLAS implementation.
Also, CLBlast needs to be compiled with -DNETLIB=ON to enable the wrapper.

@j1nx
Copy link

j1nx commented Nov 23, 2022

Interesting. It recently got mentioned here by @StuartIanNaylor ;
#89 (comment)

What hardware are you running this at, if I may ask? The reason, I ask is because I wonder if you have proper Vulkan GPU support that could make use of; https://github.com/kpet/clvk as CLBlast appears to be supported for it; https://github.com/kpet/clvk/blob/main/docs/supported-applications.md

(Me also wondering what the status is of the Mesa3d Vulkan Broadcom driver for the RPi4 and if so, if the above might help as well for those systems. Most likely again more a proof of concept, but would be cool to have some sort of GPU enabled deep learning on the RPI4 to play with)

@StuartIanNaylor
Copy link

StuartIanNaylor commented Nov 23, 2022

I have a Rock5b with a Mali G610 which as a SBC & Soc is a bit bleeding edge as Mesa/Panfrost stops at Mali G57 Valhall OpenGL ES (v9) OpenGL 3.1

I have been doing some testing with the G610 as currently its just a Rockchip blob but using the OpenCL drivers with https://github.com/StuartIanNaylor/rock5b-wav2letter-bench which works and for ML is about equiv of the CPU for ML.
The above is just the ArmNN examples with some fixes but you just need to point OpenCL to your OpenGL driver which I have forgot the .so of VC6
Also I think in the ArmNN example it doesn't use GpuAcc on the Pi not because it may not work but that it wasn't a Mali as it maybe just an OpenCL ML driver but that is just a guess, but the Pi has been OpenGL compliant for a while?

If it isn't you can prob run the tests in https://github.com/KhronosGroup/OpenCL-CTS if the ArmNN is a fail.

But yeah might be really interesting.

https://cnugteren.github.io/clblast/clblast.html

@StuartIanNaylor
Copy link

StuartIanNaylor commented Nov 23, 2022

If I install with -DNETLIB=ON then add your patch then performance is terrible.
But if I check

cat /sys/devices/platform/fb000000.gpu/utilisation
7

7% load is all I am getting 15% at max so guess prob needs someone with much better knowledge than me.
I don't think it will work anyway in just substituting CLBlast for OpenBlas as that is not the point as it takes someone with the knowledge of the model where paralism can take place with both CLBlast & OpenBlas se we are working on CPU & GPU at the same time and not merely substituting one for the other.
I know from ArmNN that even with my current bad driver the G610MP4 & CPU are approx equivalent.

Which if I scaled my GPU from the 7% load to 100% then its x14 which divide the times by 14 and yeah about cpu equiv.
So even if I got this installed correctly and tuned I would only ever be the same, but that isn't the point its to split out accross gpu/cpu in threads so comp is working in parallel...

Which is far beyond my ability but if clear parrallelism exists then with the G610 x2 is possible, well a bit less due inefficiences and that the code maybe similar to the load of ArmNN that provided about 7% load on the cpu so stealing that.
I am thinking because a transformer has a clear partition of encoder & decoder or even by layers as the TFLite delegate can do that maybe its possible but far out of my realms.

CPU OS Config Model Threads Load [ms] Encode [ms]
NEON BLAS tiny 8 236.65 35018.07
NEON BLAS base 8 335.71 67945.20
NEON BLAS small 8 641.83 263145.69

Whilst normal optimised cpu

CPU OS Config Model Threads Load [ms] Encode [ms]
rk3588 Debian11 NEON tiny 8 232.45 2768.78
rk3588 Debian11 NEON base 8 308.36 6374.82
rk3588 Debian11 NEON small 8 626.23 25784.05
rk3588 Debian11 NEON medium 8 1667.23 86026.82
rk3588 Debian11 NEON large 8 4307.16 161328.59

@misutoneko
Copy link
Author

misutoneko commented Nov 23, 2022

Thanks! Yes you're right it needs some serious work.
In fact CLBlast docs recommend against using the wrapper because of hampered performance.
It's really just a convenience to get already existing code running.
But I just wanted to get the idea out since there are folks out there that might be able to do something about it :D

As I mentioned I'm a bit hw challenged myself but sure, Vulkan would also be good if the hw supports it.
CLBlast can apparently be used even with OpenCL 1.1 level GPUs, so that'd be part of the charm for anyone stuck with the older stuff.
I run the tests with GTX660, which I don't think even has 2GB of VRAM because it can't do the small model without coredumping :(
The base.en model seems fine however.

EDIT: It seems this GPU only has 1.5GB of VRAM and that causes even the base.en model to crash sometimes (often times). So yeah, for older GPUs to be feasible, VRAM usage should be lowered quite a lot.
On the bright side, I've noticed this AVX/AVX2 support really helps a lot! It means I can now use large model with just the CPU. Fortunately my CPU was a slightly more modern model than was the GPU ;)

@StuartIanNaylor
Copy link

StuartIanNaylor commented Nov 23, 2022

PS the ability is there as I looked at my power meter after the test that was running near 10 watts with confusion and then remembered on my other screen that had power saved I was still running the streaming version of whisper :)
So unknowing I did have x2 instances running at the same time cpu/gpu.

I am more interested in embedded as my results are not bad as running whisper takes about 5watts whilst the GPU could be as low as 1.5watt as seems about 1/3 when running similar taks.

I have forgot what my RTX3050 got but from the full version of whisper as I got it because its only 140watt! In nvidia's crazy wattage world.
Its amazing what the likes of the M1 and even $150 embedded SBC such as the RK3588 Rock5b are providing with the watts they use and what @ggerganov has running.

@ggerganov ggerganov added the ideas Interesting ideas for experimentation label Nov 23, 2022
@marmistrz
Copy link
Contributor

For me, CLBlast provides a ~12.5% speedup compared to vanilla whisper.cpp

Vanilla whisper:

whisper_print_timings:     fallbacks =  11 p /  20 h
whisper_print_timings:     load time =   184.15 ms
whisper_print_timings:      mel time =  1010.46 ms
whisper_print_timings:   sample time =  2715.31 ms /  2306 runs (    1.18 ms per run)
whisper_print_timings:   encode time = 23960.04 ms /    11 runs ( 2178.19 ms per run)
whisper_print_timings:   decode time = 17336.37 ms /  2266 runs (    7.65 ms per run)
whisper_print_timings:    total time = 45225.82 ms

CLBlast:

whisper_print_timings:     fallbacks =   8 p /  23 h
whisper_print_timings:     load time =    93.86 ms
whisper_print_timings:      mel time =   950.52 ms
whisper_print_timings:   sample time =  2599.31 ms /  2202 runs (    1.18 ms per run)
whisper_print_timings:   encode time = 20387.99 ms /    11 runs ( 1853.45 ms per run)
whisper_print_timings:   decode time = 16615.79 ms /  2170 runs (    7.66 ms per run)
whisper_print_timings:    total time = 40669.39 ms


OpenBLAS:
whisper_print_timings:     fallbacks =  33 p /  77 h
whisper_print_timings:     load time =   132.15 ms
whisper_print_timings:      mel time =   948.88 ms
whisper_print_timings:   sample time =  9583.82 ms /  7918 runs (    1.21 ms per run)
whisper_print_timings:   encode time = 113946.63 ms /    35 runs ( 3255.62 ms per run)
whisper_print_timings:   decode time = 121515.83 ms /  7790 runs (   15.60 ms per run)
whisper_print_timings:    total time = 246182.56 ms

Test configuration:

  • i5-8365U / UHD Graphics 620, Arch Linux
  • whisper.cpp 1.2.0

@ilovefreesw
Copy link

For me, CLBlast provides a ~12.5% speedup compared to vanilla whisper.cpp

Vanilla whisper:

whisper_print_timings:     fallbacks =  11 p /  20 h
whisper_print_timings:     load time =   184.15 ms
whisper_print_timings:      mel time =  1010.46 ms
whisper_print_timings:   sample time =  2715.31 ms /  2306 runs (    1.18 ms per run)
whisper_print_timings:   encode time = 23960.04 ms /    11 runs ( 2178.19 ms per run)
whisper_print_timings:   decode time = 17336.37 ms /  2266 runs (    7.65 ms per run)
whisper_print_timings:    total time = 45225.82 ms

CLBlast:

whisper_print_timings:     fallbacks =   8 p /  23 h
whisper_print_timings:     load time =    93.86 ms
whisper_print_timings:      mel time =   950.52 ms
whisper_print_timings:   sample time =  2599.31 ms /  2202 runs (    1.18 ms per run)
whisper_print_timings:   encode time = 20387.99 ms /    11 runs ( 1853.45 ms per run)
whisper_print_timings:   decode time = 16615.79 ms /  2170 runs (    7.66 ms per run)
whisper_print_timings:    total time = 40669.39 ms


OpenBLAS:
whisper_print_timings:     fallbacks =  33 p /  77 h
whisper_print_timings:     load time =   132.15 ms
whisper_print_timings:      mel time =   948.88 ms
whisper_print_timings:   sample time =  9583.82 ms /  7918 runs (    1.21 ms per run)
whisper_print_timings:   encode time = 113946.63 ms /    35 runs ( 3255.62 ms per run)
whisper_print_timings:   decode time = 121515.83 ms /  7790 runs (   15.60 ms per run)
whisper_print_timings:    total time = 246182.56 ms

Test configuration:

  • i5-8365U / UHD Graphics 620, Arch Linux
  • whisper.cpp 1.2.0

That's odd. Mine is opposite. It's two times slower than vanilla .... I am Intel UHD 630 Windows 11.

@nullr0ute
Copy link

Out of interest what's the command people are using to get the above stats, I'm looking at various options via CLBlast and would be interested to be able to provide comparable perf feedback :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ideas Interesting ideas for experimentation
Projects
None yet
Development

No branches or pull requests

7 participants