Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommendations for performance when running whisper.cpp on VPS? #524

Open
jalustig opened this issue Feb 22, 2023 · 3 comments
Open

Recommendations for performance when running whisper.cpp on VPS? #524

jalustig opened this issue Feb 22, 2023 · 3 comments
Labels
performance CPU and memory usage - results and comparisons

Comments

@jalustig
Copy link

I'm experimenting with running whisper at scale on a VPS cluster, but am not getting good performance, it is quite slow even on dedicated CPU hardware. Here are the CPU stats which are being output when I run ./main: system_info: n_threads = 2 / 2 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

Is the lack of BLAS one potential reason why it's slow? I have also specifically built it with OpenBLAS, for some reason it isn't actually running with BLAS.

@nxtreaming
Copy link

I use the following VPS configurartion, it can reach 90% realtime, ie 36min audio needs 40min to process

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

@ggerganov ggerganov added the performance CPU and memory usage - results and comparisons label Feb 27, 2023
@ggerganov
Copy link
Owner

Don't think you can do anything at the moment to improve the performance.
In the future, quantised models might be useful for such use-cases, so keep track of progress in #540

@sigaloid
Copy link

sigaloid commented Jul 4, 2023

In my experience it's very hard to improve performance without offloading to GPU. Even throwing dramatically more cores at it does not work. AMD EPYC 7532 at 128/128 threads runs no faster than 12/128.

The sweet spot is probably 6-8 cores, quantized if accuracy allows, and scaling out the workload across a cluster.

I wish whisper.cpp scaled better, though there's some performance discussion in #200 and hopefully this can be improved over time.

Even when running with triple Titan RTX's, a Xeon E5-2643 v4 24-core, and 512GB of ram, I only get 1.66x realtime for the large model (~7 min for ~13min audio). If nothing else, this proves you cannot throw more resources at it to speed it up.

When compared to openai/whisper proper, it handles offloading to cuda devices much more efficiently - same machine with whisper runs about 7x realtime.

When compared to openai/whisper CPU, however, whisper.cpp pulls ahead by a long shot. I don't have exact numbers but roughly 0.33x realtime? on the large model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance CPU and memory usage - results and comparisons
Projects
None yet
Development

No branches or pull requests

4 participants