Idea: OpenCL via CLBlast? #173

misutoneko · 2022-11-23T12:16:16Z

Hi,

Nice project! Thanks for your work, I wish I had better hw to make use of it :D

I haven't seen anyone mentioning CLBlast here?
They actually provide a wrapper so that CLBlast can be used as a drop-in replacement for OpenBLAS.
I've now tried this, it works and it was easy enough even for me :D
But to get the best performance you'd need to adjust it a lot more I guess.

Here's my naïve patch in case anyone wants to play with it:
whisper.cpp_CLBlast.patch.gz

Note that the patch simply replaces the existing OpenBLAS implementation.
Also, CLBlast needs to be compiled with -DNETLIB=ON to enable the wrapper.

j1nx · 2022-11-23T13:01:17Z

Interesting. It recently got mentioned here by @StuartIanNaylor ;
#89 (comment)

What hardware are you running this at, if I may ask? The reason, I ask is because I wonder if you have proper Vulkan GPU support that could make use of; https://github.com/kpet/clvk as CLBlast appears to be supported for it; https://github.com/kpet/clvk/blob/main/docs/supported-applications.md

(Me also wondering what the status is of the Mesa3d Vulkan Broadcom driver for the RPi4 and if so, if the above might help as well for those systems. Most likely again more a proof of concept, but would be cool to have some sort of GPU enabled deep learning on the RPI4 to play with)

StuartIanNaylor · 2022-11-23T13:56:37Z

I have a Rock5b with a Mali G610 which as a SBC & Soc is a bit bleeding edge as Mesa/Panfrost stops at Mali G57 Valhall OpenGL ES (v9) OpenGL 3.1

I have been doing some testing with the G610 as currently its just a Rockchip blob but using the OpenCL drivers with https://github.com/StuartIanNaylor/rock5b-wav2letter-bench which works and for ML is about equiv of the CPU for ML.
The above is just the ArmNN examples with some fixes but you just need to point OpenCL to your OpenGL driver which I have forgot the .so of VC6
Also I think in the ArmNN example it doesn't use GpuAcc on the Pi not because it may not work but that it wasn't a Mali as it maybe just an OpenCL ML driver but that is just a guess, but the Pi has been OpenGL compliant for a while?

If it isn't you can prob run the tests in https://github.com/KhronosGroup/OpenCL-CTS if the ArmNN is a fail.

But yeah might be really interesting.

https://cnugteren.github.io/clblast/clblast.html

StuartIanNaylor · 2022-11-23T15:23:16Z

If I install with -DNETLIB=ON then add your patch then performance is terrible.
But if I check

cat /sys/devices/platform/fb000000.gpu/utilisation
7

7% load is all I am getting 15% at max so guess prob needs someone with much better knowledge than me.
I don't think it will work anyway in just substituting CLBlast for OpenBlas as that is not the point as it takes someone with the knowledge of the model where paralism can take place with both CLBlast & OpenBlas se we are working on CPU & GPU at the same time and not merely substituting one for the other.
I know from ArmNN that even with my current bad driver the G610MP4 & CPU are approx equivalent.

Which if I scaled my GPU from the 7% load to 100% then its x14 which divide the times by 14 and yeah about cpu equiv.
So even if I got this installed correctly and tuned I would only ever be the same, but that isn't the point its to split out accross gpu/cpu in threads so comp is working in parallel...

Which is far beyond my ability but if clear parrallelism exists then with the G610 x2 is possible, well a bit less due inefficiences and that the code maybe similar to the load of ArmNN that provided about 7% load on the cpu so stealing that.
I am thinking because a transformer has a clear partition of encoder & decoder or even by layers as the TFLite delegate can do that maybe its possible but far out of my realms.

Config	Model	Threads	Load [ms]	Encode [ms]
NEON BLAS	tiny	8	236.65	35018.07
NEON BLAS	base	8	335.71	67945.20
NEON BLAS	small	8	641.83	263145.69

Whilst normal optimised cpu

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
rk3588	Debian11	NEON	tiny	8	232.45	2768.78
rk3588	Debian11	NEON	base	8	308.36	6374.82
rk3588	Debian11	NEON	small	8	626.23	25784.05
rk3588	Debian11	NEON	medium	8	1667.23	86026.82
rk3588	Debian11	NEON	large	8	4307.16	161328.59

misutoneko · 2022-11-23T15:44:57Z

Thanks! Yes you're right it needs some serious work.
In fact CLBlast docs recommend against using the wrapper because of hampered performance.
It's really just a convenience to get already existing code running.
But I just wanted to get the idea out since there are folks out there that might be able to do something about it :D

As I mentioned I'm a bit hw challenged myself but sure, Vulkan would also be good if the hw supports it.
CLBlast can apparently be used even with OpenCL 1.1 level GPUs, so that'd be part of the charm for anyone stuck with the older stuff.
I run the tests with GTX660, which I don't think even has 2GB of VRAM because it can't do the small model without coredumping :(
The base.en model seems fine however.

EDIT: It seems this GPU only has 1.5GB of VRAM and that causes even the base.en model to crash sometimes (often times). So yeah, for older GPUs to be feasible, VRAM usage should be lowered quite a lot.
On the bright side, I've noticed this AVX/AVX2 support really helps a lot! It means I can now use large model with just the CPU. Fortunately my CPU was a slightly more modern model than was the GPU ;)

StuartIanNaylor · 2022-11-23T16:05:30Z

PS the ability is there as I looked at my power meter after the test that was running near 10 watts with confusion and then remembered on my other screen that had power saved I was still running the streaming version of whisper :)
So unknowing I did have x2 instances running at the same time cpu/gpu.

I am more interested in embedded as my results are not bad as running whisper takes about 5watts whilst the GPU could be as low as 1.5watt as seems about 1/3 when running similar taks.

I have forgot what my RTX3050 got but from the full version of whisper as I got it because its only 140watt! In nvidia's crazy wattage world.
Its amazing what the likes of the M1 and even $150 embedded SBC such as the RK3588 Rock5b are providing with the watts they use and what @ggerganov has running.

marmistrz · 2023-03-04T11:01:41Z

For me, CLBlast provides a ~12.5% speedup compared to vanilla whisper.cpp

Vanilla whisper:

whisper_print_timings:     fallbacks =  11 p /  20 h
whisper_print_timings:     load time =   184.15 ms
whisper_print_timings:      mel time =  1010.46 ms
whisper_print_timings:   sample time =  2715.31 ms /  2306 runs (    1.18 ms per run)
whisper_print_timings:   encode time = 23960.04 ms /    11 runs ( 2178.19 ms per run)
whisper_print_timings:   decode time = 17336.37 ms /  2266 runs (    7.65 ms per run)
whisper_print_timings:    total time = 45225.82 ms

CLBlast:

whisper_print_timings:     fallbacks =   8 p /  23 h
whisper_print_timings:     load time =    93.86 ms
whisper_print_timings:      mel time =   950.52 ms
whisper_print_timings:   sample time =  2599.31 ms /  2202 runs (    1.18 ms per run)
whisper_print_timings:   encode time = 20387.99 ms /    11 runs ( 1853.45 ms per run)
whisper_print_timings:   decode time = 16615.79 ms /  2170 runs (    7.66 ms per run)
whisper_print_timings:    total time = 40669.39 ms


OpenBLAS:
whisper_print_timings:     fallbacks =  33 p /  77 h
whisper_print_timings:     load time =   132.15 ms
whisper_print_timings:      mel time =   948.88 ms
whisper_print_timings:   sample time =  9583.82 ms /  7918 runs (    1.21 ms per run)
whisper_print_timings:   encode time = 113946.63 ms /    35 runs ( 3255.62 ms per run)
whisper_print_timings:   decode time = 121515.83 ms /  7790 runs (   15.60 ms per run)
whisper_print_timings:    total time = 246182.56 ms

Test configuration:

i5-8365U / UHD Graphics 620, Arch Linux
whisper.cpp 1.2.0

ilovefreesw · 2023-08-10T12:15:30Z

For me, CLBlast provides a ~12.5% speedup compared to vanilla whisper.cpp

Vanilla whisper:

whisper_print_timings:     fallbacks =  11 p /  20 h
whisper_print_timings:     load time =   184.15 ms
whisper_print_timings:      mel time =  1010.46 ms
whisper_print_timings:   sample time =  2715.31 ms /  2306 runs (    1.18 ms per run)
whisper_print_timings:   encode time = 23960.04 ms /    11 runs ( 2178.19 ms per run)
whisper_print_timings:   decode time = 17336.37 ms /  2266 runs (    7.65 ms per run)
whisper_print_timings:    total time = 45225.82 ms

CLBlast:

whisper_print_timings:     fallbacks =   8 p /  23 h
whisper_print_timings:     load time =    93.86 ms
whisper_print_timings:      mel time =   950.52 ms
whisper_print_timings:   sample time =  2599.31 ms /  2202 runs (    1.18 ms per run)
whisper_print_timings:   encode time = 20387.99 ms /    11 runs ( 1853.45 ms per run)
whisper_print_timings:   decode time = 16615.79 ms /  2170 runs (    7.66 ms per run)
whisper_print_timings:    total time = 40669.39 ms


OpenBLAS:
whisper_print_timings:     fallbacks =  33 p /  77 h
whisper_print_timings:     load time =   132.15 ms
whisper_print_timings:      mel time =   948.88 ms
whisper_print_timings:   sample time =  9583.82 ms /  7918 runs (    1.21 ms per run)
whisper_print_timings:   encode time = 113946.63 ms /    35 runs ( 3255.62 ms per run)
whisper_print_timings:   decode time = 121515.83 ms /  7790 runs (   15.60 ms per run)
whisper_print_timings:    total time = 246182.56 ms

Test configuration:

i5-8365U / UHD Graphics 620, Arch Linux
whisper.cpp 1.2.0

That's odd. Mine is opposite. It's two times slower than vanilla .... I am Intel UHD 630 Windows 11.

nullr0ute · 2023-12-01T17:54:27Z

Out of interest what's the command people are using to get the above stats, I'm looking at various options via CLBlast and would be interested to be able to provide comparable perf feedback :)

ggerganov added the ideas Interesting ideas for experimentation label Nov 23, 2022

Topping1 mentioned this issue Dec 16, 2022

Experiments with GPU CUDA acceleration...sort of #220

Open

codingbutstillalive mentioned this issue May 21, 2023

Maybe consider CLBlast for acceleration on older GPUs (also AMD ones) nomic-ai/gpt4all#672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: OpenCL via CLBlast? #173

Idea: OpenCL via CLBlast? #173

misutoneko commented Nov 23, 2022 •

edited

Loading

j1nx commented Nov 23, 2022 •

edited

Loading

StuartIanNaylor commented Nov 23, 2022 •

edited

Loading

StuartIanNaylor commented Nov 23, 2022 •

edited

Loading

misutoneko commented Nov 23, 2022 •

edited

Loading

StuartIanNaylor commented Nov 23, 2022 •

edited

Loading

marmistrz commented Mar 4, 2023

ilovefreesw commented Aug 10, 2023

nullr0ute commented Dec 1, 2023

Idea: OpenCL via CLBlast? #173

Idea: OpenCL via CLBlast? #173

Comments

misutoneko commented Nov 23, 2022 • edited Loading

j1nx commented Nov 23, 2022 • edited Loading

StuartIanNaylor commented Nov 23, 2022 • edited Loading

StuartIanNaylor commented Nov 23, 2022 • edited Loading

misutoneko commented Nov 23, 2022 • edited Loading

StuartIanNaylor commented Nov 23, 2022 • edited Loading

marmistrz commented Mar 4, 2023

ilovefreesw commented Aug 10, 2023

nullr0ute commented Dec 1, 2023

misutoneko commented Nov 23, 2022 •

edited

Loading

j1nx commented Nov 23, 2022 •

edited

Loading

StuartIanNaylor commented Nov 23, 2022 •

edited

Loading

StuartIanNaylor commented Nov 23, 2022 •

edited

Loading

misutoneko commented Nov 23, 2022 •

edited

Loading

StuartIanNaylor commented Nov 23, 2022 •

edited

Loading