Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming discussion #7

Open
Moelf opened this issue Apr 26, 2023 · 2 comments
Open

Streaming discussion #7

Moelf opened this issue Apr 26, 2023 · 2 comments

Comments

@Moelf
Copy link

Moelf commented Apr 26, 2023

So there's not really an streaming API, more like a POC from https://github.com/ggerganov/whisper.cpp/blob/master/examples/stream/stream.cpp

the main idea is this:

  1. you start with some buffer (the audio_async is a thin wrapper around a circular buffer)
    https://github.com/ggerganov/whisper.cpp/blob/70567eff232773d6786c91585d040f53c36b87a4/examples/common-sdl.h#L15

  2. in the !use_vad case, you simply wait until enough audio are available, and audio.get(params.length_ms, pcmf32) dumps into the float32 vector pcmf32

  3. run whisper_full(ctx, wparams, pcmf32.data(), pcmf32.size()) normally

  4. use whisper_full_n_segments(ctx) and whisper_full_get_segment_text(ctx, i) normally

  5. the only different thing is afterwards you want to add token from last full segment into wparams.prompt_tokens for next segment

the general idea of audio buffer is to pad n seconds, n < 30 into 30s, so as you speak, you're inference 1s + 29s silence, then 2s + 28s silence etc. depending on how large step_ms is.


In the use_vad case, we have more pcmf32 related vectors to swap audio data around (~slide window)
https://github.com/ggerganov/whisper.cpp/blob/70567eff232773d6786c91585d040f53c36b87a4/examples/stream/stream.cpp#L162-L164

the pcmf32 and friends are the actual sample you copy to and from for direct usage

@aviks
Copy link
Owner

aviks commented Apr 26, 2023

I know @jpsamaroo was experimenting along these lines.

@jpsamaroo
Copy link
Contributor

jpsamaroo commented May 9, 2023

As a quick overview of what I implemented:

I use PortAudio.jl to provide the input stream in 4-second increments, and write it into a rotating buffer of 5 seconds of total length (although these periods are configurable; they just seem to work for me). I convert all audio into 16K with this code from the README:

# Whisper expects 16kHz sample rate and Float32 data
sout = SampleBuf(Float32, 16000, round(Int, length(s)*(16000/samplerate(s))), nchannels(s))  
write(SampleBufSink(sout), SampleBufSource(s))  # Resample

All this happens continuously in one task, and a copy of the 5-second buffer is copied into a Channel everytime we sample 4-seconds from the PortAudio stream. In another task, I then transcribe with Whisper using max-threads (will post a PR for wparam configuration momentarily), append it to an ever-growing string (which is the full transcription from start to end), and print the latest result.

Gist here: https://gist.github.com/jpsamaroo/aff348ae04f392f1e8683b59cbe6bda7

One thing you'll notice is the BAD THINGS HAPPENED, which is detecting my observation that sometimes Whisper gets "stuck" and just repeats the same transcription (usually after about 60 seconds of this "streaming" transcription). It's probably just something I'm doing wrong, but if anyone has any ideas on why it happens, I'd love to know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants