A possible solution to Whisper hallucination #679

KaiserChr · 2022-12-13T17:01:10Z

KaiserChr
Dec 13, 2022

Hi all,

I was having trouble with whisper creating "ghost transcripts" at the end of a given sound file. These often consist of repeats and shuffles of the text of previous chunks and turned out to be quite detrimental to the overall quality of the transcript.

I looked into this a little and found a possible solution:
The parameter: 'condition_on_previous_text'
This is set to True by default and helps whisper (to my understanding) to keep the context going between chunks.

My working hypothesis is, that the problem arises if the last chunk is short (like a couple of seconds) compared to the text initializing the next chunks transcription. Then the models seems to somehow have a problem with disambiguation and starts "seeing things".

A more elegant solution than just setting condition_on_previous_text to False would be something like this (not properly debugged yet):

After line 178 of whisper/transcribe.py exchange:

decode_options["prompt"] = all_tokens[prompt_reset_since:]

with something like this:

# Lucid Whisper
lucid_threshold = 0.3 #current working name for a threshold to determine permissible chunk length for a healthy transcript
if ((seek + N_FRAMES) / num_frames < 1.0) or (seek == 0):  # first chunk, ergo no context or next chunk will be fully within num_frames ergo should be fine
    decode_options["prompt"] = all_tokens[prompt_reset_since:]
else:  # next chunk will not be fully within num_frames i.e. last chunk, calculate lucid_score
    lucid_score = (num_frames - seek) / N_FRAMES
    if lucid_score < lucid_threshold and "prompt" in decode_options: #Lucid Score below threshold, erasing context!
        decode_options["prompt"] = []
    else: #Lucid Score above threshold, keeping context!
        decode_options["prompt"] = all_tokens[prompt_reset_since:]

Im still tinkering and debugging, but so far it worked for me and I just wanted to post it in case it helps anybody else!

p.s.
I was encountering the problem when working with german audio on the medium and large models

EDIT: 14.12.2022 further debugged and formatted code a bit better

UPDATE: 21.12.2022 Approach using VAD, see my post below

abodacs · 2022-12-13T18:00:47Z

abodacs
Dec 13, 2022

Thanks @KaiserChr
I tried to a bit formate your solution

#current working name for a threshold to determine permissible chunk length for a healthy transcript
lucid_threshold = 0.3 
#first chunk, ergo no context or next chunk will be fully within num_frames
if ((seek+N_FRAMES)/num_frames < 1.0) or (seek == 0): 
    if "prompt" in decode_options:
        decode_options["prompt"] = all_tokens[prompt_reset_since:]
    else: #next chunk is not first chunk and will not be fully within num_frames i.e. last chunk, calculate lucid_score
        lucid_score = (num_frames - seek) / N_FRAMES
if lucid_score < lucid_threshold and "prompt" in decode_options:
    decode_options["prompt"] = []
else:
    decode_options["prompt"] = all_tokens[prompt_reset_since:]

2 replies

KaiserChr Dec 13, 2022
Author

Thank you very much!

KaiserChr Dec 14, 2022
Author

Small update: I further debugged and updated the code in the initial post, so the one you kindly formatted for me (looked up how to do it better myself in the meantime) is deprecated.

dgoryeo · 2022-12-14T11:13:29Z

dgoryeo
Dec 14, 2022

Well done @KaiserChr! What is the right step to suggest merging this into main Whisper code? Would be great to get a solution like this as early as possible.

1 reply

KaiserChr Dec 14, 2022
Author

Thank you for the kind words!
I'm still debugging and the fix suggested above doesn't totally eliminate the problem in all cases (but seems to make it significantly better).
Still trying to figure out why the problem still persists in some cases, and will update the initial post with my findings!

crecos · 2022-12-16T00:08:42Z

crecos
Dec 16, 2022

I tried using this solution:

But the output is still the same:

[00:00.000 --> 00:02.000]  You
[00:30.000 --> 00:32.000]  You
[01:00.000 --> 01:02.000]  You
[01:30.000 --> 01:32.000]  You
[02:00.000 --> 02:02.000]  You
[02:30.000 --> 02:32.000]  You
[03:00.000 --> 03:02.000]  You
[03:08.000 --> 03:10.000]  You
[03:16.000 --> 03:18.000]  You
[03:18.000 --> 03:20.000]  You
[03:29.000 --> 03:31.000]  You
[03:31.000 --> 03:33.000]  You
[03:42.000 --> 03:44.000]  You
[03:44.000 --> 03:46.000]  You
[03:46.000 --> 03:48.000]  You
[03:59.000 --> 04:01.000]  You

I'm doing something wrong?

C:\Users\whisper\AppData\Roaming\Python\Python310\Scripts\whisper.exe C:\whisper\quiet-place.mp3 --language English --model large

1 reply

KaiserChr Dec 16, 2022
Author

Have you tried setting 'condition_on_previous_text' to False?

If with this setting the problem still persists, my (possible) fix unfortunately wont help, bc its just a way to do the same thing, not generally but only for chunks that, as to my current understanding, may cause the problem.

My working hypothesis is that, when the problem is underdefined (in that there is too little information provided by the encoder) the decoder may show an aberrant behavior. Especially for very short or very quiet chunks this seems to be the case.
So far I have not managed to eliminate the problem, only alleviating it for some (most) of my samples.

I am still trying different approaches to push the model "over the brink" in a reproducible manner to verify my hypothesis.

crecos · 2022-12-16T11:02:10Z

crecos
Dec 16, 2022

Yes, I didn't set

--condition_on_previous_text False

in the previous example.

I have now launched Whisper with:

whisper.exe C:\tmp-whisper\quiet-place.mp3 --language English --model large --condition_on_previous_text False

but the result is still not good:

[23:45.500 --> 24:12.500]  Come a little bit closer, hear what I have to say.
[24:12.500 --> 24:35.500]  Just like children sleeping, we could dream this night away.
[24:35.500 --> 24:42.940]  There's a full moon rising, let's go dancing in the light.
[24:42.940 --> 24:58.740]  We know where the music's playing, let's go out and feel the night.
[25:12.940 --> 25:14.940]  You
[25:42.940 --> 25:52.940]  ♪♪
[25:52.940 --> 26:02.960]  ♪♪
[26:02.960 --> 26:12.960]  ♪♪
[26:12.960 --> 26:22.980]  ♪♪
[26:52.980 --> 26:54.980]  You
[27:52.980 --> 27:53.980]  Thank you.
[27:53.980 --> 27:54.980]  Thank you.
[27:54.980 --> 27:55.980]  Thank you.
[27:55.980 --> 27:57.980]  Thank you.
[28:25.980 --> 28:35.980]  ♪♪
[28:35.980 --> 28:46.000]  ♪♪
[28:46.000 --> 28:48.340]  It's alright, it's alright.
[29:16.000 --> 29:26.000]  ♪♪
[29:26.000 --> 29:36.020]  ♪♪
[29:36.020 --> 29:41.020]  and
[29:41.020 --> 29:46.020]  and
[29:46.020 --> 29:51.020]  and
[29:51.020 --> 29:56.020]  and
[29:56.020 --> 30:01.020]  and

Thank you for your effort

0 replies

whicks1 · 2022-12-16T15:33:34Z

whicks1
Dec 16, 2022

In my shop we hacked together a python script to clean up the vtt output to make it a bit more normative (always add 00: hour timestamps), add cue IDs, shorten the line lengths, add some NOTE metadata and misc other changes. For this problem we read the file into the webvtt library and compare the text of each cue to the previous. If they are exactly equal, we drop the current and move on to the next. Not perfect by any stretch and you still will have the 1st ghost but it beats having the endless repeats.

2 replies

dgoryeo Dec 17, 2022

Thanks @whicks1 , by any chance would you be able to share your script?

silvacarl2 Dec 18, 2022

question: couldnt it detect really poor transcription but just summarizing all of the match probabilitiea and if it is less than 75%, throw out the transcription?

i am trying to find a way that if the recording is really bad, instead of just making stuff up it returns nothing.

fleek · 2022-12-19T14:01:05Z

fleek
Dec 19, 2022

The best way is to split the audio file up into audio slices using VAD and feeding into whisper. There is a parameter "condition_on_previous_text", set it to false to force it to forget, the problem is that when this parameter is true, it remembers what it output previously and if the current output cannot produce anything, it will just use the last output. With the parameter set to false, it will forget and if it cannot decipher the output it will just output blank.

But I find that it is still best to just transcribe slices of audio, especially if it is conversational, because every statement itself will express an idea unlike long essays when the whole paragraph expresses an idea.

Sometimes being able to infer from the previous statement may not be the best idea for conversations.

Remember whisper is trained using mostly subtitles from youtube, so you will get funny outputs sometimes.

5 replies

silvacarl2 Dec 19, 2022

thx, testing it out now!

dgoryeo Dec 19, 2022

@silvacarl2 , take a look at this implementation with VAD: it does a reasonably good job:
https://colab.research.google.com/github/ANonEntity/WhisperWithVAD/blob/main/WhisperWithVAD.ipynb

rudymohammadbali Jul 3, 2023

@dgoryeo Does this work with 2 and 3 seconds of audio files? because I'm making real-time transcription.

dgoryeo Jul 4, 2023

@iamironman0 , I don't believe so --I think one shouldn't need VAD for any short audio segments.

rudymohammadbali Jul 4, 2023

@dgoryeo thanks for replying. however is there a solution for short audio files?

silvacarl2 · 2022-12-19T21:14:58Z

silvacarl2
Dec 19, 2022

yeah this is a riot, trying it against a few difficult cases now./

12 replies

fleek Dec 21, 2022

options = {"task":"transcribe","language":"English","fp16":fp16,'no_speech_threshold':0.1, "condition_on_previous_text": False, "logprob_threshold": -1.00, "without_timestamps":True}
result = whisper.transcribe(wmodel, audio=s['fname'], verbose=False, **options)

silvacarl2 Dec 21, 2022

that might work. the other idea i had was if the overall average of the transcription words is less than say 65%, it is probably wrong or somewhat wrong.

silvacarl2 Dec 21, 2022

,"fp16":fp16 - is there an fp32?

'no_speech_threshold':0.1 - this looks interesting, need to find more on it.

"condition_on_previous_text": False - yup checking that out.

"logprob_threshold": -1.00 - what does this do?

fleek Dec 21, 2022

,"fp16":fp16 - is there an fp32?

'no_speech_threshold':0.1 - this looks interesting, need to find more on it.

"condition_on_previous_text": False - yup checking that out.

"logprob_threshold": -1.00 - what does this do?

fp16 is boolean, it can be set to true only if running on GPU, otherwise set it to false. It is more to save memory on the gpu if your gpu has little memory. If you set to true and running on CPU, it will still default to fp32 but will throw out some warning message to say that fp16 is only available on GPU.

no_speech_threshold is to tell whisper that as long as whisper thinks there is 10% chance of no-speech, just treat it as no speech. Too high a value and whisper will think there is speech it cannot figure out and it will just throw some random rubbish. I find it good to set a low value especially if the audio file has music or lots of noise.

logprob_threshold, I can't remember that this really means, it has been sometime since I spent time tweaking the parameters. I think it has to do something with the logmel processing.

fleek Dec 21, 2022

I have uploaded my code to https://github.com/fleek/VADtransciber, you can give it a try.

silvacarl2 · 2022-12-20T19:45:07Z

silvacarl2
Dec 20, 2022

cool, will have a look at that.

0 replies

silvacarl2 · 2022-12-21T00:36:11Z

silvacarl2
Dec 21, 2022

as soon as i get a chance to test this, i will post here. we have 10M samples to work from LOL.

0 replies

KaiserChr · 2022-12-21T08:42:16Z

KaiserChr
Dec 21, 2022
Author

Hi,

I looked into the problem some more and found out a couple of things:
Firstly, we also implemented a version using VAD (right now its webrtcvad but we are looking to upgrade to Silero as I heared a lot of good things about it), as long passages of silence seem to confuse the decoder.
Secondly, as our goal right now is a real time solution using the medium model for quality (in german), I played around with the parameters and found that these:

        kwargs['language'] = 'de'
        kwargs['verbose'] = True
        kwargs['task'] = 'transcribe'
        kwargs['temperature'] = 0
        kwargs['best_of'] = None
        kwargs['beam_size'] = None
        kwargs['patience'] = None
        kwargs['length_penalty'] = None
        kwargs['suppress_tokens'] = "-1"
        kwargs['initial_prompt'] = None
        kwargs['condition_on_previous_text'] = False 
        kwargs['fp16'] = True #for GPU 
        kwargs['compression_ratio_threshold'] = 2.4
        kwargs['logprob_threshold'] = -0.5
        kwargs['no_speech_threshold'] = 0.2

seem to work really well IF you use Cuda (for medium model), as the performance on CPU was so slow that the queue was filling up in an unacceptable manner, because the model sometimes tries to understand a silent part of the audio multiple times (a behavior I tried to curb with the parameters given above), and without GPU that is just too slow.

Further debugging makes me believe the decoder is the source for this slowdown, as the _main_loop in the decoding.py submodule takes a long time feeding the tokens and the current audio_features to the decoder.

With GPU this loop is so fast that no significant slowdown is noticable, albeit at a large hardware cost.

Once we have a working version with Silero I plan to put the code we use up here on Github so anyone interested can take a look.

Please note that this version does not require the code I posted on the start of the Thread, as with a VAD typically utterances are much less than 30 seconds and therefore its seems more stable to deactivate condition_on_previous_text altogether, because VAD also seems to help a lot getting the model to ignore silent passages (by not presenting them).

27 replies

fleek Dec 30, 2022

Problem is PowerShell. Under CMD all running. Now I testing

Try cmder, it is the only shell that I use

crecos Dec 30, 2022

Yes, under cmd and cmder VADtranscriber working without errors in terminal. But if i run VADtranscriber with cuda parameter process is handled by the CPU, not the GPU. I dont know why...

Whisper on GPU working correctly
whisper.exe C:\VAD\en.mp3 --language English --model large

Second problem is that the json output is created but does not contain any words
en_chunk.json.txt

en_temp.wav is converted correctly and contains voice.

crecos Dec 30, 2022

Run cmder as administrator help.

C:\VAD
λ VADtranscriber.py
Converting MP4 to Wav
work/eminem_temp.wav
Loading Silero VAD model
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to C:\Users\whisper/.cache\torch\hub\master.zip
Read Wav file
getting speech timestamps
generate chunk list
Loading Diarization Model
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83.3M/83.3M [00:08<00:00, 9.72MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.92k/1.92k [00:00<00:00, 122kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.53M/5.53M [00:03<00:00, 1.75MB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 129k/129k [00:00<00:00, 434kB/s] Loading Audio
Could not load symbol cublasGetSmCountTarget from cublas64_11.dll. Error code 127
Diarise
idx=0 start=0.5s stop=4.7s SPEAKER_01
idx=1 start=5.4s stop=10.5s SPEAKER_01
idx=2 start=11.2s stop=14.5s SPEAKER_01
idx=3 start=16.6s stop=23.0s SPEAKER_01
idx=4 start=25.1s stop=40.5s SPEAKER_00
idx=5 start=45.5s stop=53.0s SPEAKER_00
idx=6 start=53.9s stop=69.1s SPEAKER_00
idx=7 start=69.0s stop=73.7s SPEAKER_01
idx=8 start=76.8s stop=79.2s SPEAKER_01
idx=9 start=80.4s stop=86.5s SPEAKER_01
idx=10 start=87.5s stop=90.0s SPEAKER_01
idx=11 start=102.5s stop=131.0s SPEAKER_00
idx=12 start=131.6s stop=134.0s SPEAKER_00
idx=13 start=135.5s stop=146.5s SPEAKER_00
idx=14 start=146.5s stop=150.9s SPEAKER_01
idx=15 start=153.8s stop=156.4s SPEAKER_01
idx=16 start=157.6s stop=162.0s SPEAKER_01
idx=17 start=163.1s stop=167.5s SPEAKER_01
idx=18 start=168.2s stop=168.5s SPEAKER_01
idx=19 start=180.0s stop=211.3s SPEAKER_00
idx=20 start=212.1s stop=227.8s SPEAKER_00
idx=21 start=229.3s stop=230.0s SPEAKER_01
idx=22 start=235.1s stop=236.4s SPEAKER_01
idx=23 start=237.6s stop=240.5s SPEAKER_01
idx=24 start=246.0s stop=247.0s SPEAKER_01
Populate Speakers
s['idx']=2:  s['start']/SAMPLING_RATE=16.862  sp['start']-0.2=16.3628125   | s['end']/SAMPLING_RATE=18.3  sp['end']+0.2=23.1921875
s['idx']=3:  s['start']/SAMPLING_RATE=18.398  sp['start']-0.2=16.3628125   | s['end']/SAMPLING_RATE=21.276  sp['end']+0.2=23.1921875
Loading Silero VAD model
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to C:\Users\whisper/.cache\torch\hub\master.zip
Forming speech slices
Loading Whisper medium.en model
100%|█████████████████████████████████████| 1.42G/1.42G [01:21<00:00, 18.8MiB/s]
Doing transcription and create srt
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 988/988 [00:04<00:00, 239.83frames/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 460/460 [00:01<00:00, 233.94frames/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 143/143 [00:00<00:00, 154.43frames/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 287/287 [00:01<00:00, 164.58frames/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 191/191 [00:01<00:00, 145.09frames/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4886/4886 [00:45<00:00, 107.19frames/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 518/518 [00:02<00:00, 188.07frames/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1507/1507 [00:05<00:00, 274.68frames/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 259/259 [00:01<00:00, 174.79frames/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 141.84frames/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4876/4876 [00:44<00:00, 108.48frames/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 518/518 [00:02<00:00, 192.88frames/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1334/1334 [00:05<00:00, 248.12frames/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 278/278 [00:01<00:00, 197.89frames/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5126/5126 [00:41<00:00, 122.24frames/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 575/575 [00:02<00:00, 211.87frames/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 470/470 [00:02<00:00, 233.25frames/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 854/854 [00:03<00:00, 255.76frames/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 268/268 [00:01<00:00, 188.77frames/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:00<00:00, 132.07frames/s] vad/eminem_00019.wav: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [03:00<00:00,  9.04s/it,  you] -----------------Transcription Complete-------------------

but the output file is still empty:

[
  {
    "start": 992,
    "end": 159168,
    "idx": 0,
    "text": "",
    "fname": "vad/eminem_00000.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 179168,
    "end": 252864,
    "idx": 1,
    "text": "",
    "fname": "vad/eminem_00001.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 269792,
    "end": 292800,
    "idx": 2,
    "text": "",
    "fname": "vad/eminem_00002.wav",
    "speaker": "<Speaker2> "
  },
  {
    "start": 294368,
    "end": 340416,
    "idx": 3,
    "text": "",
    "fname": "vad/eminem_00003.wav",
    "speaker": "<Speaker2> "
  },
  {
    "start": 352736,
    "end": 383424,
    "idx": 4,
    "text": "",
    "fname": "vad/eminem_00004.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 403424,
    "end": 1185216,
    "idx": 5,
    "text": "",
    "fname": "vad/eminem_00005.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 1189856,
    "end": 1272768,
    "idx": 6,
    "text": "",
    "fname": "vad/eminem_00006.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 1285088,
    "end": 1526208,
    "idx": 7,
    "text": "",
    "fname": "vad/eminem_00007.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 1544672,
    "end": 1586112,
    "idx": 8,
    "text": "",
    "fname": "vad/eminem_00008.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 1606112,
    "end": 1618368,
    "idx": 9,
    "text": "",
    "fname": "vad/eminem_00009.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 1641440,
    "end": 2421696,
    "idx": 10,
    "text": "",
    "fname": "vad/eminem_00010.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 2426336,
    "end": 2509248,
    "idx": 11,
    "text": "",
    "fname": "vad/eminem_00011.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 2521568,
    "end": 2735040,
    "idx": 12,
    "text": "",
    "fname": "vad/eminem_00012.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 2781152,
    "end": 2825664,
    "idx": 13,
    "text": "",
    "fname": "vad/eminem_00013.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 2879456,
    "end": 3699648,
    "idx": 14,
    "text": "",
    "fname": "vad/eminem_00014.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 3701216,
    "end": 3793344,
    "idx": 15,
    "text": "",
    "fname": "vad/eminem_00015.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 3801056,
    "end": 3876288,
    "idx": 16,
    "text": "",
    "fname": "vad/eminem_00016.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 3888608,
    "end": 4025280,
    "idx": 17,
    "text": "",
    "fname": "vad/eminem_00017.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 4060640,
    "end": 4103616,
    "idx": 18,
    "text": "",
    "fname": "vad/eminem_00018.wav",
    "speaker": "<Speaker?>"
  },
  {
    "start": 4120544,
    "end": 4131264,
    "idx": 19,
    "text": "",
    "fname": "vad/eminem_00019.wav",
    "speaker": "<Speaker?>"
  }
]

I will probably look for the problem in:

Could not load symbol cublasGetSmCountTarget from cublas64_11.dll. Error code 127

crecos Dec 30, 2022

DLL library exists, I have no idea why the problem occurs

C:\VAD
λ ls -l "c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin" | grep cublas
-rwxr-xr-x 1 whisper 197121  88668160 zář  7 00:12 cublas64_11.dll*
-rwxr-xr-x 1 whisper 197121 544070144 zář  7 00:12 cublasLt64_11.dll*

C:\VAD

crecos Dec 30, 2022

OK. Working.

Only speaker identification is not perfect

0
00:00:00,062 --> 00:00:09,948
<Speaker?> Just gonna stand there and watch me burn But that's alright because I like the waiter

1
00:00:11,198 --> 00:00:15,804
<Speaker?> Just gonna stand there and hear me cry

2
00:00:16,862 --> 00:00:18,300
<Speaker2> It's alright

3
00:00:18,398 --> 00:00:21,276
<Speaker2> Cause I love the way you lie

4
00:00:22,046 --> 00:00:23,964
<Speaker?> Love the way you lie

5
00:00:25,214 --> 00:01:14,076
<Speaker?> I can't tell you what it really is I can only tell you what it feels like And right now there's a steel knife in my windpipe I can't breathe but I still fight while I can fight As long as the wrong feels right it's like I'm in flight High off a log, drunk from my hate It's like I'm huffin' paint and I love her the more I suffer I suffocate and right before I'm about to drown She resuscitates me, she fuckin' hates me And I love it, wait, where you goin'? I'm leavin' you No you ain't come back, we're runnin' right back Here we go again, it's so insane It's going good, it's going great I'm Superman with the wind in his back She's slow as land, but when it's bad it's awful I feel so ashamed, I snap, who's that dude? I don't even know his name, I laid hands on him I never stoop so low again, I guess I don't know my own strength Just gonna stand there and watch me burn

6
00:01:14,366 --> 00:01:19,548
<Speaker?> But that's alright because I like the way it hurts

7
00:01:20,318 --> 00:01:35,388
<Speaker?> Just gonna stand there and hear me cry Well that's alright, because I love the way you lie I love the way you lie

8
00:01:36,542 --> 00:01:39,132
<Speaker?> Love the way you lie

9
00:01:40,382 --> 00:01:41,148
<Speaker?> you

10
00:01:42,590 --> 00:02:31,356
<Speaker?> You ever loved somebody so much you can barely breathe when you're with them you meet and neither one of you even know what hit them got that warm fuzzy feeling get them chills used to get them now you're getting fucking sick of looking at them you swore you'd never hit them never do nothing to hurt them now you're in each other's face spewing venom in your words when you spit them you bush pull each other's hair scratch clobbed hit them throw them down pen'ems are lost in the moments when you're in them Mr. Raised it to go Rick controls your boat so they say it best to call you separate ways guess that they don't know you cuz today that was yesterday Yesterday is over, it's a different day. It's not like broken records playing over, but you promised them. Next time you show it straight, you don't get another chance. Life is no Nintendo game, but you lot of game. Now you get to watch her leave out the window. Guess that's why they call it windowpane. Just gonna stand there and watch me burn.

11
00:02:31,646 --> 00:02:36,828
<Speaker?> But I saw her run because I liked the way it hurts

12
00:02:37,598 --> 00:02:50,940
<Speaker?> Just gonna stand there and hear me cry But that's alright, because I love the way you lie I love the way you lie

13
00:02:53,822 --> 00:02:56,604
<Speaker?> Love the way you lie

14
00:02:59,966 --> 00:03:51,228
<Speaker?> Now I know we said things, did things that we didn't mean And we fall back into the same patterns, same routine But your temper's just as bad as mine is, you're the same as me When it comes to love you're just as blinded Baby please come back, it wasn't you, baby it was me Maybe our relationship isn't as crazy as it seems Maybe that's what happens when a tornado meets a volcano All I know is I love you too much to walk away though Come inside, pick up your bags off the sidewalk Don't you hear sincerity? And my voice when I talk told you this is my fault me in the eyeball Next time I'm pissed, I lay my fist at the drywall Next time, there will be no next time I apologize even though I know it's lies I'm tired of the games, I just want her back I know I'm alive, she ever tries to fucking leave again I'ma tie her to the bed and set this house on fire Just gonna stand there and watch me burn

15
00:03:51,326 --> 00:03:57,084
<Speaker?> Well that's alright because I like the way it hurts

16
00:03:57,566 --> 00:04:02,268
<Speaker?> Just gonna stand there and hear me cry

17
00:04:03,038 --> 00:04:11,580
<Speaker?> It's alright because I love the way you lie I love the way you lie

18
00:04:13,790 --> 00:04:16,476
<Speaker?> Love the way you lie

19
00:04:17,534 --> 00:04:18,204
<Speaker?> you

silvacarl2 · 2022-12-21T16:12:16Z

silvacarl2
Dec 21, 2022

I am sorting out the worst case recordings we have to be able to test this, will post back when done.

2 replies

silvacarl2 Dec 21, 2022

I also think there must be a top_p parameter somepalce, which could be lowered a bit to help as well.

fleek Dec 22, 2022

I also think there must be a top_p parameter somepalce, which could be lowered a bit to help as well.

top_p as in ? probability setting?

silvacarl2 · 2022-12-22T16:21:59Z

silvacarl2
Dec 22, 2022

correct. it may not be possible be we are also looking at the inference side of whisper to see which settings could be optimized.

0 replies

KaiserChr · 2022-12-29T08:57:26Z

KaiserChr
Dec 29, 2022
Author

I looked into the best combination of parameters some more and right now find utterance splitting via a vad is the most stable solution for clean chunking. The current version of the code I use for real time transcription in german is using gpu and the medium model, as this gives a very nice combination of speed and precision. Find the code attached below if you want to try it for yourself!

# asr using whisper and Silero-VAD (https://github.com/snakers4/silero-vad)
# structure based on the very nice work of Oliver Guhr over at https://github.com/oliverguhr/wav2vec2-live

import pyaudio
import numpy as np
import threading
import time
from sys import exit
from queue import Queue
import matplotlib.pylab as plt
import wave
import whisper
import struct
import multiprocessing
import torch

filename = 'audio_provided.wav' #for Debugging: save the audiostream that was provided to whisper after sending through queue
filename_orig = 'audio_recorded.wav' #for Debugging: save the audiostream that was actually recorded pre sending.

class Realtime_Whisper():
    exit_event = threading.Event()

    def __init__(self, model_name, device_name="default"):
        self.model_name = model_name
        self.device_name = device_name

    def stop(self):
        """stop the asr process"""
        Realtime_Whisper.exit_event.set()
        self.asr_input_queue.put("close")
        print("asr stopped")


    def start(self):
        """start the asr process"""
        manager = multiprocessing.Manager()

        self.asr_output_queue = Queue()
        self.asr_input_queue = Queue()

        self.visualization_input_queue = manager.Queue() #currently not used, the queue is still in for convenience...

        self.asr_process = threading.Thread(target=Realtime_Whisper.asr_process, args=(
            self.model_name, self.asr_input_queue, self.asr_output_queue,))
        self.asr_process.daemon = True
        self.asr_process.start()

        time.sleep(5)  # start vad after asr model is loaded

        self.vad_process = threading.Thread(target=Realtime_Whisper.vad_process, args=(
            self.device_name, self.asr_input_queue, self.visualization_input_queue, ))
        self.vad_process.daemon = True
        self.vad_process.start()

        #Debug optional visualization
        #self.visualization_process = multiprocessing.Process(target=Realtime_Whisper.plot_stream, args=(
        #    self.visualization_input_queue,))
        # self.visualization_process = threading.Thread(target=Realtime_Whisper.plot_stream, args=(
        #     self.visualization_input_queue,))
        #self.visualization_process.daemon = True
        #self.visualization_process.start()

    def int2float(sound):
        """convert the wav pcm16 format to one suitable for silero vad"""
        _sound = np.copy(sound)  # may be not necessary
        #abs_max = np.abs(_sound).max()
        abs_max = 32767
        _sound = _sound.astype('float32')
        if abs_max > 0:
            _sound *= 1 / abs_max
        _sound = _sound.squeeze()  # depends on the use case
        return _sound


    def plot_stream(instream):
        """plot audio stream via matplotlib"""
        CHUNK = 160
        CHANNELS = 1
        RATE = 16000

        fig, ax = plt.subplots()
        x = np.arange(0, 2 * CHUNK, 2)

        line, = ax.plot(x, np.random.rand(CHUNK), 'r')
        ax.set_ylim(-20000, 20000)
        ax.set_xlim(0, CHUNK)
        fig.show()

        while True:
            data = instream.get()
            dataInt = struct.unpack(str(CHUNK) + 'h', data)
            line.set_ydata(dataInt)
            fig.canvas.draw()
            fig.canvas.flush_events()

    def vad_process(device_name, asr_input_queue, vis_input_queue):
        """voice activity detection using silero-vad"""
        model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                                      model='silero_vad',
                                      force_reload=False,
                                      onnx=False)
        (get_speech_timestamps,
         save_audio,
         read_audio,
         VADIterator,
         collect_chunks) = utils

        #not sure this is useful, but I leave it in for now...
        vad_iterator = VADIterator(model)

        audio = pyaudio.PyAudio()
        FORMAT = pyaudio.paInt16
        CHANNELS = 1
        RATE = 16000
        FRAME_DURATION = 60
        CHUNK = int(RATE * FRAME_DURATION / 1000)
        SPEECH_PROB_THRESHOLD = 0.2 # This probably needs a bit of tweaking

        microphones = Realtime_Whisper.list_microphones(audio)
        selected_input_device_id = Realtime_Whisper.get_input_device_id(
            device_name, microphones)
        print('input device id')
        print(microphones)
        print(selected_input_device_id)

        #this should be the default mic, tweak as needed ...
        selected_input_device_id = 1

        stream = audio.open(input_device_index=selected_input_device_id,
                            format=FORMAT,
                            channels=CHANNELS,
                            rate=RATE,
                            input=True,
                            frames_per_buffer=CHUNK)

        #framebuffer for queue
        frames = b''
        #masterframebuffer for saving the data send to asr
        masterframes_asr = b''

        last_speech_prob = 0

        while True:
            if Realtime_Whisper.exit_event.is_set():
                break
            frame = stream.read(CHUNK, exception_on_overflow=False)

            frame_tensor = torch.from_numpy(Realtime_Whisper.int2float(np.frombuffer(frame, dtype=np.int16)))

            speech_prob = model(frame_tensor, RATE).item()

            #turn this on for debugging and tweaking the threshold...
            #print(speech_prob)

            #accumulate frames in frame buffer if speech is detected and the total length is < 30 sec (max size of whisper chunk)
            if speech_prob > SPEECH_PROB_THRESHOLD and len(frames) < 480000: #THIS NEEDS TO BE LOOKED AT AGAIN MAYBE A FULL 30s WHISPER CHUNK IS TOO MUCH
                frames += frame
            #if there was speech and now there is none (i.e. an utterance has finished or the max length is exceeded, write to queue
            elif (speech_prob <= SPEECH_PROB_THRESHOLD < last_speech_prob) or (len(frames) >= 480000):
                asr_input_queue.put(frames)

                masterframes_asr += frames
                frames = b''

            last_speech_prob = speech_prob

        stream.stop_stream()
        stream.close()
        audio.terminate()
        # Open and Set the data of the WAV file
        file = wave.open(filename_orig, 'wb')
        file.setnchannels(1)
        file.setsampwidth(2)
        file.setframerate(16000)

        # Write and Close the File
        file.writeframes(b''.join(np.frombuffer(masterframes_asr, dtype=np.int16)))
        file.close()


    def asr_process(model_name, in_queue, output_queue):
        """transcribe using whisper"""
        model = whisper.load_model(model_name, device='cuda') #use cuda for everything > base model

        #with current settings always excepts to 0, but left in to play around with the setting...
        temperature_increment_on_fallback = 0
        temperature = 0
        try:
            temperature = tuple(np.arange(temperature, 1.0 + 1e-6, temperature_increment_on_fallback))
        except:
            temperature = 0

        kwargs = {}
        kwargs['language'] = 'de'
        kwargs['verbose'] = True
        kwargs['task'] = 'transcribe'
        kwargs['temperature'] = temperature
        kwargs['best_of'] = None
        kwargs['beam_size'] = None
        kwargs['patience'] = None
        kwargs['length_penalty'] = None
        kwargs['suppress_tokens'] = "-1"
        kwargs['initial_prompt'] = None
        kwargs['condition_on_previous_text'] = False  # seems source of false Transcripts
        kwargs['fp16'] = True #set false if using cpu
        kwargs['compression_ratio_threshold'] = None #2.4
        kwargs['logprob_threshold'] = None #-1.0 #-0.5
        kwargs['no_speech_threshold'] = None #0.6 #0.2

        #masterframes = ''
        masterframes = b''

        while True:
            audio_file = in_queue.get()

            if audio_file == "close":
                break

            print("\nlistening to your beautiful voice\n")
            masterframes += audio_file

            audio_tensor = torch.from_numpy(Realtime_Whisper.int2float(np.frombuffer(audio_file, dtype=np.int16)))

            result = model.transcribe(audio_tensor, **kwargs)

            if result != "":
                output_queue.put(result["segments"])

        # Open and Set the data of the WAV file
        file = wave.open(filename, 'wb')
        file.setnchannels(1)
        file.setsampwidth(2)
        file.setframerate(16000)

        file.writeframes(b''.join(np.frombuffer(masterframes, dtype=np.int16)))
        file.close()

    def get_input_device_id(device_name, microphones):
        for device in microphones:
            if device_name in device[1]:
                return device[0]

    def list_microphones(pyaudio_instance):
        info = pyaudio_instance.get_host_api_info_by_index(0)
        numdevices = info.get('deviceCount')

        result = []
        for i in range(0, numdevices):
            if (pyaudio_instance.get_device_info_by_host_api_device_index(0, i).get('maxInputChannels')) > 0:
                name = pyaudio_instance.get_device_info_by_host_api_device_index(
                    0, i).get('name')
                result += [[i, name]]
        return result

    def get_last_text(self):
        """returns the text, sample length and inference time in seconds."""
        return self.asr_output_queue.get()

if __name__ == "__main__":
    print("Live ASR")

    #param is model size
    asr = Realtime_Whisper("medium")


    asr.start()

    last_text = 'Start'

    try:
        while True:
            lastresult = asr.get_last_text()
            for segment in lastresult:
                print('ID: ' + str(segment['id']) + ' START: ' + str(round(segment['start'], 1)) + ' END: ' + str(round(segment['end'], 1)) + ' TEXT: ' + segment['text'])
    except KeyboardInterrupt:
        asr.stop()
        exit()

2 replies

dgoryeo Dec 29, 2022

@KaiserChr , thanks for the findings. Do I read it right that by default you set both beam and temperatiure to zero, i.e deterministic?

KaiserChr Dec 30, 2022
Author

This is correct, as the medium model seems to perform well with it.

larsschwarz · 2023-01-07T17:22:26Z

larsschwarz
Jan 7, 2023

Not sure this is the same root cause, but besides the duplicates everything starting at 28:26:00 is total rubbish (translates to "subtitles on behalf of ZDF"). Obviously the source audio does not include someone saying anything about subtitles. I used the small model on CLI. Haven't tried --condition_on_previous_text yet, but trying the medium model now.

6 replies

larsschwarz Jan 7, 2023

It contains actual dialogue. Running the transcribe on the same file earlier worked fine (using the same config) and resulted into a correct result. Feels like this behavior is caused by temporary memory issues or something? The previous run also did not produce these duplicate lines (26:59, 27:59 and 28:07).

fleek Jan 7, 2023

It contains actual dialogue. Running the transcribe on the same file earlier worked fine (using the same config) and resulted into a correct result. Feels like this behavior is caused by temporary memory issues or something? The previous run also did not produce these duplicate lines (26:59, 27:59 and 28:07).

That's the exact behaviour of AI, you can run through the same audio with the same config 200 times and each time will yield different results, it is kind of like the mood of the AI, ahahaha

larsschwarz Jan 7, 2023

I don't mind slightly different results, but this just seems broken 🤷‍♀️

fleek Jan 7, 2023

I don't mind slightly different results, but this just seems broken 🤷‍♀️

I can't remember where I saw it, but if you can retain the same seed value, you can guarantee same results.

arvog Jan 10, 2023

I can confirm this behaviour for German language videos ending with music or silence. Transcripts end with lines like "Copyright WDR 2021 Untertitel im Auftrag des ZDF, 2017 Untertitel im Auftrag des ZDF, 2017" or "Untertitel der Amara.org-Community.." that have no basis in any actual utterances, but are apparently artifacts from subtitle files the model was trained on.

EtienneAb3d · 2023-01-11T13:54:48Z

EtienneAb3d
Jan 11, 2023

My small contribution to this subject: each of the following reduces the amount of "hallucinated texts" (but the problem still occurs).

use beam_search=5 (time consuming!)
pre-filter pur-blank parts with this command:

ffmpeg -y -i [PathToSoundFile] -af silenceremove=start_periods=1:stop_periods=-1:start_threshold=-50dB:stop_threshold=-50dB:start_silence=0.1:stop_silence=0.1 [PathToFilteredSoundFile]

0 replies

EtienneAb3d · 2023-01-12T07:17:04Z

EtienneAb3d
Jan 12, 2023

Loudness normalization may also be useful, by adding -filter:a loudnorm in the ffmpeg pure-blank filtering:

ffmpeg -y -i [PathToSoundFile] -af "silenceremove=start_periods=1:stop_periods=-1:start_threshold=-50dB:stop_threshold=-50dB:start_silence=0.1:stop_silence=0.1, loudnorm" [PathToFilteredSoundFile]

PS: updated. With ffmpeg multiple filters should be provided at once comma separated.

4 replies

dgoryeo Jan 12, 2023

Do I understand correctly that ffmpeg pure-silence filtering will trim chunks of the audio and therefore affects the timing/sync of the transcription? If that is the case, is there a way to reconstruct the timing/sync?

EtienneAb3d Jan 12, 2023

Yes, it will remove silence parts from the audio file, thus change the timing.
I do not see a simple way to get back the timing sync.
A way to restore it would be to process the file twice, (A) with + (B) without silence parts removal, and compare both results to take the time stamps from (B) to project them to (A). It could work quite well on most cases (with twice the amount of charge).

dgoryeo Jan 12, 2023

I'm wondering may be one approach would be to use something like this:
#467

identify the silent parts, and excluse them from feeding to Whisper.

But then again, wouldn't a VAD achieve the same outcome?

EtienneAb3d Jan 12, 2023

But then again, wouldn't a VAD achieve the same outcome?

Certainly, it should. But, in my experiment, Silero VAD doesn't remove them all properly, especially at beg/end parts, where the observed hallucinations are more likely to occur.

EtienneAb3d · 2023-01-14T16:19:44Z

EtienneAb3d
Jan 14, 2023

As said above, hallucinations are certainly due to bad sync between sounds and texts in subtitle training data (and copyright added to them). Thus, they are more probably at the beg or end of the output.

I observed that hallucinations are very depend on very particular sound configurations. Certainly, if the input is changed a bit, it is highly probable that the hallucination disappears.

My new idea is to add markers, easily recognized by Whisper, at the beg and end of the sound. If the markers are well obtained in the beg and end of the output, simply remove them. If not, something was wrong, certainly an hallucination added, try the same inverting the markers (to try something a bit different). If the markers are still not properly obtained in the output, try with the original sound (to avoid trouble possibly added by the markers).

If timestamps are needed, it's easy to restore them knowing the time length of the markers.

Here is the code:

SAMPLING_RATE = 16000
beam_size=1

result = transcribeMARK(pathIn, lng, mode=1)

def transcribeMARK(path: str,lng: str,mode = 1):
    pathIn = path

    noMarkRE = "^(ar|he|hi|ru|zh)$"
    if(lng != None and re.match(noMarkRE,lng)):
    	#Need special voice marks
    	mode = 0;
    
    if(mode == 1):
        mark1="WOK-MRK.wav"
        mark2="OKW-MRK.wav"
    elif(mode == 2):
        mark2="WOK-MRK.wav"
        mark1="OKW-MRK.wav"
        
    if(mode != 0):
        try:
            pathMRK = pathIn+".MRK"+".wav"
            aCmd = "ffmpeg -y -i "+mark1+" -i "+pathIn+" -i "+mark2+" -filter_complex \"[0:a][1:a][2:a]concat=n=3:v=0:a=1[a]\" -map \"[a]\" -c:a pcm_s16le -ar "+str(SAMPLING_RATE)+" "+pathMRK+" > "+pathMRK+".log 2>&1"
            os.system(aCmd)
            pathIn = pathMRK
        except:
             print("Warning: can't add markers")

    try:
        if beam_size > 1:
            decode_options = dict(language=lng,beam_size=beam_size)
        else:
            decode_options = dict(language=lng)
        
        transcribe_options = dict(task="transcribe", **decode_options)
        result = model.transcribe(pathIn,**transcribe_options)
    except:
         result = {}
         result["text"] = ""
         return result
    
    if(mode == 0):
        #Direct
        return result
    
    if(mode == 1):
        if(re.match(r"^ *Whisper[., ]+(ok|okay)([., ]+(ok|okay))?[., ]+Whisper[.] *$", result["text"], re.IGNORECASE)):
            #Empty sound ?
            return transcribeMARK(path, opts, mode=2)
        
        if(re.match(r"^ *Whisper[.,] (ok|okay)[.] .* (ok|okay)[.,] whisper[.] *$", result["text"], re.IGNORECASE)):
            #GOOD!
            result["text"] = re.sub(r"(^ *Whisper[.,] (ok|okay)[.] | (ok|okay)[.,] whisper[.] *$)", "", result["text"], 2, re.IGNORECASE)
            return result

        return transcribeMARK(path, lng, mode=2)
    
    if(mode == 2):
        if(re.match(r"^ *Whisper[., ]+(ok|okay)([., ]+(ok|okay))?[., ]+Whisper[.] *$", result["text"], re.IGNORECASE)):
            #Empty sound ?
            result["text"] = ""
            return result
        
        if(re.match(r"^ *(ok|okay)[.,] whisper[.] .* Whisper[.,] (ok|okay)[.] *$", result["text"], re.IGNORECASE)):
            #GOOD!
            result["text"] = re.sub(r"(^ *(ok|okay)[.,] whisper[.] | Whisper[.,] (ok|okay)[.] *$)", "", result["text"], 2, re.IGNORECASE)
            return result

        return transcribeMARK(path, lng, mode=0)

Used markers (GitHub does not accept attached WAV files):
https://neurospell.com/WOK-MRK.wav
https://neurospell.com/OKW-MRK.wav

7 replies

dgoryeo Feb 4, 2023

Thanks @EtienneAb3d , wow, this sounds promising. And thanks for the tip on artificial voices. I was just going ahead to use one of those TTS generators for my target language (Japanese).

Do I undertand correctly that you insert the markers in each VAD cuts/chunks, right?

EtienneAb3d Feb 4, 2023

I do not chunk the sound, I just remove the noise and blank parts (to get only one file to be processed). In my application (see NeuroSpell), speakers are often making a pause in their dictations to think about what they need to say, and then they are continuing the same sentence. But, yes, if you are chunking the sound (to get several files), you need to add markers before/after each chunk.

Additional tip: to get a better recognition of the markers above, you may add "Whisper, ok." and "Ok, Whisper" in the initial_prompt. But, if you add them alone, Whisper consider it's certainly a repetition that should be removed in the final result, and you won't get them properly in the output. In order to make this working, you need to add an initial_prompt like this (of course, in the language to be processed):

Whisper, Ok.
Whatever 2 or 3 short pertinent sentences. 
Ok, Whisper.
Whisper, Ok. 
Whatever 2 or 3 short pertinent sentences. 
Ok, Whisper.
Whisper, Ok. 
An improbable text that can't be a start of what has to be recognized at the beginning of the sound 
to avoid it to be seen as a repetition to be removed.

The initial_prompt should not be longer than about 100 words (or it will be silently truncated by Whisper).

dgoryeo Feb 4, 2023

Thanks @EtienneAb3d . I'll give it a try during the week, will let you know of the result. Happy Sunday!

FurkanGozukara Feb 5, 2023

@EtienneAb3d do you have a modified whisper script that we can use your code directly?
or your are just using ffmpeg to change audio?
i didnt understand how your code works i want to test

EtienneAb3d Feb 5, 2023

@FurkanGozukara

For the silence removing and loudness normalization, I'm using ffmpeg, with the code here:
A possible solution to Whisper hallucination #679 (comment)
For noise removing, you need to explore the use of silero-vad:
https://github.com/snakers4/silero-vad
For the marks method, the code is nearly usable here (just updated with few improvements), with the provided sound file (see links):
A possible solution to Whisper hallucination #679 (comment)
To get better results with the marks, you may add an initial_prompt as explained here:
A possible solution to Whisper hallucination #679 (reply in thread)

For the marks method, in short:

Adds a sound saying "Whisper, Ok" at the beginning of the file to be processed, and "Ok, Whisper" at the end.
Transcribes the new file (mode=1), and checks if the corresponding text is properly found at beg/end of the result. If yes, assumes the transcription is good (no hallucination added).
If the marks are not properly found at beg/end of the text, assumes something was wrong, and tries the same inverting beg/end marks (mode=2).
If the marks are still not properly found at beg/end, assumes something was still wrong, and tries without marks (mode=0).

ExtReMLapin · 2023-02-08T14:15:31Z

ExtReMLapin
Feb 8, 2023

~~Whiper.cpp seems to have this problem fixed ?~~

Comparing whisper.cpp to this repo :

I took an audio file with no noise at the end and concatened itself to it (so same file *2 in other words), works correctly with whiper.cpp, hallucinations with whisper openai python repo

Edit :

We tested on a much longer audio files, there is MORE hallucinations so they didn't fix it

3 replies

EtienneAb3d Feb 9, 2023

These hallucinations are highly random, so, it's not really surprising you are experimenting different results with both code versions.
The only way to fix this is certainly to train a completely new model on curated data.

dgoryeo Feb 9, 2023

@EtienneAb3d , back to your original script --removing silence by ffmpeg, would you have a script to recalculate the correct transcription time lines after cutting/trimming? Thanks!

EtienneAb3d Feb 9, 2023

It's possible I've read something about this: there is perhaps an ffmpeg option to output silence timestamps. Search for this on this forum, or on the ffmpeg documentation.

EtienneAb3d · 2023-02-11T03:57:01Z

EtienneAb3d
Feb 11, 2023

@dgoryeo If processing time is not a problem for you, here is perhaps the ultimate way to get a good SRT.

Do all we explained above to get a TXT transcription as clean and accurate as possible, without hallucination:
a- use VAD to chunk the sound file, removing noise parts.
b- use ffmpeg to remove silent parts from chunks, and apply a loudness normalization.
c- use the markers procedure above, possibly doing several processing of some chunks. Certainly you will need different markers for each language, having a native speaker to create it. Use a pertinent initial_prompt to get these markers as well recognized as possible, as explained above.
d- use Whisper large model, with beam_search=5.
Do a synchronization of Whisper timestamps over the clean TXT using my WhisperTimeSync tool on the original sound, as explained here:
https://github.com/EtienneAb3d/WhisperTimeSync

:-)

33 replies

EtienneAb3d Feb 17, 2023

@skanda1005
WhisperHallu updated:

input/output languages may now be different (to get a translation).
I provide some prompt examples that can be automatically used without the need to redefine new ones.
hindi is now properly configured (but I get poor results for the recognition of markers with the medium model if hindi is the output language on your test file, and I don't know why, the Faster model throws an error)

@crecos
The goal of WhisperHallu is to provide with an as clean as possible transcribed text. It's very easy for you to put this text in a JSON string.

dgoryeo Feb 17, 2023

Hi @EtienneAb3d, I just tested WhisperHallu on several audio files and thought of sharing my observations so far:

Number of files tested: 5;
Durations: fIles from 2 min- to 40min;
Laguage - Japanese;
Speaker mix: one file with 3 speakers, 2 files with 2 speakers, and 2 files with single speaker;
Music mix: 3 files: music=True, and 2 files music=False;
Dialogue mix: a mix of normal, whispering and loud voices;
All files were down mixed to mono and 16Khz before processing.

The output of all tests were successful in that there were no hallucination and repetitive words. Well done !!!

The speed was faster than I had experienced in my other Whisper and VAD implementations. This is though in all 5 test the Hallu code was using standard Whisper (no ct2-transformers).
There were some missing dialgue but I haven't yet found out if there is any common condition/situation among those.
Few times out of memory (both RAM and cuda) happened. Usually when I wanted to run tests right after each other --colab environment.
The log files, and the print messages were very helpful to follow the execution and adapt the code.

I tested one of the files (duration 2min) with WhisperTimeSync. The instruction did say that the code is not ready for Japanese language but I just wanted to give it a try. The result was better than I expected --The code was able to align more than 80% of the subs correctly. The other 20% were either grouped wrongly with the previous line, or had a wrong timing --too early.

In terms of missing dialogues, I suspect that lowering the VAD threshold might help. My guess is that since the audio is already de-silenced, and has gone through compressor, a lower VAD threshold would be sufficient.

A quick question: I couldn't figure out why the code always picks up the standard Whisper and never the ct2-trasnformers. Is that because of the GPU flag?

EtienneAb3d Feb 17, 2023

Feel free to experiment with parameters, and give me your feedback.
WhisperHallu is just testing the implementation installed, if standard is installed it will take it. You should simply not install the standard one, only the Faster one.

EtienneAb3d Feb 17, 2023

The instruction did say that the code is not ready for Japanese language but I just wanted to give it a try. The result was better than I expected --The code was able to align more than 80% of the subs correctly. The other 20% were either grouped wrongly with the previous line, or had a wrong timing --too early.

This is because the current tokenizer is cutting text on spaces only. This is not sufficient for Asian texts. To get a more precise result, it should cut on words.

EtienneAb3d Feb 17, 2023

I released a new version of WhisperTimeSync where you specify the language. For Asian languages, it tokenizes on all chars. You should get a much more precise result.

EtienneAb3d · 2023-02-24T16:13:55Z

EtienneAb3d
Feb 24, 2023

A new version of WhisperHallu is available. It adds Deezer Spleeter to extract voices and eliminate noises.

8 replies

dgoryeo Aug 15, 2023

@malcolmosh , how do you remove common errors? I'd be keen if you could share any of the code. Thnaks.

jthompson22 Aug 28, 2023

@dgoryeo - I'm curious why use both Demucs this Silero VAD. Aren't these doing equivalent things?

EtienneAb3d Aug 29, 2023

@jthompson22
Silero VAD is detecting parts where there are voices. You can then cut these parts, but they are kept unchanged, with the full original sound.
Demucs is splitting the file in tracks, one of them is the voice cleaned from other possible kind of sounds.

utility-aagrawal Sep 13, 2023

@malcolmosh , By any chance, can you share your code? I am looking for an open-source solution to this problem but haven't found anything yet. I am also using Pyannote's diarization pipeline with whisper-large-v2 transcription for my use case. I am looking for a solution to improve transcription quality by removing hallucinations. Thank you!!

malcolmosh Sep 13, 2023

Sorry, I can't share the code since we're building it for a client, but I can point you in the right direction. We've actually started using faster-whisper and its integrated VAD (Silero-VAD). Since faster-whisper is 4-10X quicker, we've been able to use the full default set of parameters (beam_size = 5, best_of = 5, temperatures = (0, 0.2, 0.4, etc.) in our pipelines. Using any VAD to remove non-speech segments and then passing the data through the accelerated model with all of the fallback options is almost a foolproof solution. We also tuned the no_speech_threshold and the compression_ratio_threshold.

conan1024hao · 2023-10-27T06:55:14Z

conan1024hao
Oct 27, 2023

options = {
    "task": "transcribe",
    "language": "english",
    "no_speech_threshold": 0.1,
    "condition_on_previous_text": False,
    "without_timestamps": True
}

works for me.

0 replies

FurkanGozukara · 2023-10-27T23:37:21Z

FurkanGozukara
Oct 27, 2023

options = {
    "task": "transcribe",
    "language": "english",
    "no_speech_threshold": 0.1,
    "condition_on_previous_text": False,
    "without_timestamps": True
}

works for me.

"condition_on_previous_text": False, will probably degrade quality

0 replies

dgoryeo · 2023-10-28T00:51:16Z

dgoryeo
Oct 28, 2023

Is an "optimal" value for no_speech_threshold independent of the source language?

0 replies

canonex · 2023-11-23T07:55:52Z

canonex
Nov 23, 2023

Me during an online Blender class, hoping it didn't crash:
https://github.com/openai/whisper/assets/1444378/68136024-f139-4f89-9a8a-73737700bea3

Meanwhile, Whisper...
`01:31:21.720 --> 01:31:44.240
usate spero che si ripigli un attimo ho schiacciato qualche tasto che non dovevo

01:31:44.240 --> 01:31:44.880
no

01:32:14.240 --> 01:32:17.240
Sottotitoli creati dalla comunità Amara.org

01:32:44.240 --> 01:32:47.240
Sottotitoli creati dalla comunità Amara.org

01:33:14.240 --> 01:33:17.240
Sottotitoli creati dalla comunità Amara.org

01:33:44.240 --> 01:33:47.240
Sottotitoli creati dalla comunità Amara.org

01:34:14.240 --> 01:34:17.240
Sottotitoli creati dalla comunità Amara.org

01:34:44.240 --> 01:34:47.240
Sottotitoli creati dalla comunità Amara.org

01:35:14.240 --> 01:35:17.240
Sottotitoli creati dalla comunità Amara.org

01:35:18.240 --> 01:35:21.240
Sottotitoli creati dalla comunità Amara.org

01:35:22.240 --> 01:35:25.240
Sottotitoli creati dalla comunità Amara.org

01:35:26.240 --> 01:35:29.240
Sottotitoli creati dalla comunità Amara.org

01:35:30.240 --> 01:35:33.240
Sottotitoli creati dalla comunità Amara.org

01:35:34.240 --> 01:35:37.240
Sottotitoli creati dalla comunità Amara.org

01:35:38.240 --> 01:35:41.240
Sottotitoli creati dalla comunità Amara.org

...
`

and so on.

Language is Italian.

Thank you,
Riccardo

1 reply

rodrigoGA Nov 24, 2023

with spanish add the same text

rodrigoGA · 2023-11-24T19:32:17Z

rodrigoGA
Nov 24, 2023

Noise removal + VAD + remove segments with low likelihood of speech is working for me.

The latter was not mentioned in this thread and it was what has given me a significant improvement in the results, it is mentioned here #928 (comment)
In my case, I am discarding all segments with no_speech_prob > 0.72.

1 reply

dgoryeo Nov 24, 2023

@rodrigoGA , what do you use for noise removal? I've just started looking into UVR5 models UltimateVocalRemoveGUI, but haven't found a way to automate the flow.

iichii1 · 2023-11-28T00:06:30Z

iichii1
Nov 28, 2023

Hi everyone, this is the first time I'm commenting here but I've been following this topic for some time, and I believe I achieved a very good result in my code, I combined several techniques already mentioned in this topic and others that perhaps haven't been thought of yet. I apologize for some sentences in Portuguese but I believe I have translated all the variables into English, so everyone can understand the code. Anyone who wants to test it will have to adapt it to their own code as this is just an excerpt from my code, my idea is to show how I achieved a good result and almost 100% avoid Whisper's hallucinations and keep it fast enough to make transcriptions In real time. I want to thank the people at Faster Whisper for the excellent work converting it to fast16.

class AudioTranscriber:

FORMAT = pyaudio.paInt16
DEVICE = 0  # 4 RTX Voice
CHANNELS = 1
RATE = 16000
FRAME_DURATION = 30
FRAME_SIZE = int(RATE * FRAME_DURATION / 1000)
VOLUME = 300.0 # 600.0

def __init__(self, model_size='medium', text_callback=None):
    self.model = WhisperModel(model_size, device="cuda", compute_type="float16")
    self.vad = webrtcvad.Vad(1)
    self.p = pyaudio.PyAudio()
    self.stream = self.p.open(input_device_index=self.DEVICE, format=self.FORMAT, channels=self.CHANNELS, rate=self.RATE, input=True, frames_per_buffer=self.FRAME_SIZE)
    self.audio_queue = queue.Queue()
    self.transcribing = True
    self.text_callback = text_callback
    self.audio_available = threading.Event()
    self.capture_enabled = threading.Event()
    self.capture_enabled.set()
    
def save_audio_to_file(self, audio_data, file_path):
    with wave.open(file_path, 'wb') as wf:
        print(file_path)
        wf.setnchannels(self.CHANNELS)
        wf.setsampwidth(self.p.get_sample_size(self.FORMAT))
        wf.setframerate(self.RATE)
        wf.writeframes(audio_data)
        
def is_audio_loud_enough(self, frame, threshold=VOLUME):
    """ Checks whether the audio frame has sufficient amplitude. """
    audio_frame = np.frombuffer(frame, dtype=np.int16)
    amplitude = np.mean(np.abs(audio_frame))
    #print(amplitude)
    return amplitude > threshold

def capture_audio(self):
    print("Iniciando captura de áudio...")
    audio_buffer = io.BytesIO()
    temp_buffer = io.BytesIO()
    speech_detected = False
    last_speech_time = time.time()

    num_frames_speech = 0
    num_frames_speech_threshold = 10
    silence_buffer_time = 0.2
    last_voice_time = time.time()

    while True:
        frame = self.stream.read(self.FRAME_SIZE)
        is_voice = self.vad.is_speech(frame, self.RATE) and self.is_audio_loud_enough(frame)

        current_time = time.time()
        temp_buffer.write(frame)

        if is_voice:
            last_voice_time = current_time
            num_frames_speech += 1
            print((num_frames_speech))
            if num_frames_speech >= num_frames_speech_threshold:
                speech_detected = True
                last_speech_time = current_time

        if speech_detected:
            if current_time - last_voice_time > silence_buffer_time:
                # Processa o áudio capturado após o período de silêncio
                speech_detected = False
                audio_data = temp_buffer.getvalue()
                if audio_data:
                    self.audio_queue.put(audio_data)
                    self.audio_available.set()
                num_frames_speech = 0
                temp_buffer = io.BytesIO()

        else:
            if current_time - last_voice_time > silence_buffer_time:
                # Reseta num_frames_speech após um período de silêncio
                num_frames_speech = 0
                temp_buffer = io.BytesIO()
            
def transcribe_from_file(self, file_path):
    segments, info = self.model.transcribe(file_path, language='en', beam_size=5, vad_filter=False,
                                        vad_parameters=dict(min_silence_duration_ms=100),
                                        condition_on_previous_text=False, repetition_penalty=1)
    start_time = time.time()
    for segment in segments:
        print(f"{AZUL}[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}{PADRAO}")

        modified_text = filtro_de_fala(segment.text.strip())

The "speech_filter" function serves to filter some hallucinations, and hide the output, however with my latest modifications, this was no longer necessary. But whoever wants to use it, follow below.

def filtro_de_fala(texto):
frases_invalidas = [
"sorry, don't ask me if i asked about this question right, and i mean the lightening",
"don't forget to like, commentary, and subscribe to the channel",
"e se inscreva no canal e ative o sininho",
"thank you very much. thank you very much",
"legendas pela comunidade de amara.org",
"legendas pela comunidade dos amara.org",
"legendas pela comunidade des amara.org",
"Legendas pela comunidade das Amara.org",
"Legendas pela comunidade da Amara.org",
"Legendas pela comunidade do Amara.org",
"Legendas pela comunidade Amara.org",
"thank you very much for watching!",
"thank you very much for watching",
"i'll see you on the next video",
"improvável probably a good show",
"thank you for your watching",
"bye, ladies and gentlemen",
"see you in the next video",
"thank you for watching",
"i'll see you next time",
"he was gonna catch it",
"thank you. bye bye",
"Thanks for watching!",
"thanks for watching",
"e se inscreva no canal",
"thanks for watching!",
"thank you very much",
"it's no good to me",
"see you next video",
"see you next time",
"Tchau, galera!",
"Tchau tchau!",
"thanks mate",
"Thank you.",
"thank you",
"Valeu.",
"Tchau!",
"Obrigado.",
"hello,",
"E aí",
"Oi!",
"you",
"bye"
]

12 replies

EtienneAb3d Dec 16, 2023

@ottokarr
You can get the same kind of pre-processing with WhisperHallu:
https://github.com/EtienneAb3d/WhisperHallu

ottokarr Dec 16, 2023

@EtienneAb3d

Thank you very much for the WhisperHallu application.

However, even though your explanations might be clear for a specialist (which I am not) I am unable to figure out how to write the code to import the Whisper AI model and give the right instruction for the transcription.

If you have a few minutes, could you please give me a simple example of the sentence to write if, for example, I would like to make a written transcription of a russian video in using a Whisper AI «Large » model ?

For your information, until now I was used to obtain srt and vtt files in writing the following :

!whisper "PATH/RUSSIAN VIDEO.mp4" --task transcribe --model large --language ru

Thank you for your time and consideration in this matter (if you got the time).

EtienneAb3d Dec 17, 2023

@ottokarr
WhisperHallu is an experimental code. You will need some minimal developer skills to use it.
A Python code sample is provided at the bottom of the ReadMe page.

Purfview Dec 17, 2023

@iichii1
I see you are using webrtc, shouldn't silero be better?

iichii1 Feb 14, 2024

Normally yes, but I didn't have a good result with it, so I tested webrtc, and it also has a good processing speed, which is essential in my application.

Purfview · 2023-12-21T12:37:04Z

Purfview
Dec 21, 2023

There is a bug introduced with #1279, it can make bad hallucination loops.

There is a bugfix -> #1903 [not merged at the moment]

0 replies

fabswt · 2024-04-06T06:37:33Z

fabswt
Apr 6, 2024

just a quick thought: since this tends to happen on short audios, checking the duration per word might be a way to detect anomalies.

for example, when i get:

Thank you for watching!

for a recording of duration 0.128 second then it is definitely an hallucination.

0 replies

LaurinmyReha · 2024-09-09T07:58:29Z

LaurinmyReha
Sep 9, 2024

@fabswt

We also thought along those lines and specifically created a finetune with better timestamps and dedicated and trained alignment heads to make hallucination detection even more robust. Feel free to check it out.

https://github.com/nyrahealth/CrisperWhisper

0 replies

StarMariner · 2024-09-30T11:02:37Z

StarMariner
Sep 30, 2024

I have no solution .
Whisper needs a lobotomy, in the simplest terms.
Is there no way to map tokens that produce silent hallucinations in the first place and remove them? Or re-train whisper with silent copyright removed. There shouldn't be copyright material in the training anyway , in my opinion.
I have silent audio on purpose for meditations, 1 min to 15 minutes silence then vocals. Its 0 silence, no dB noise.
My only solution is to grep and sed a dictionary, post transcribe.

1 reply

misutoneko Oct 6, 2024

Have you tried the --suppress_tokens trick?
It doesn't work for everything but maybe try it with the medium model and see what happens?
I've posted about that before, quite a while ago (in whisper-timestamped git and here) so just search the forums.
EDIT: Couple of threads here:
linto-ai/whisper-timestamped#107
#1873 (reply in thread)

And yes, you can find the hallucinated tokens e.g. checking the JSON, but the problem is that they're legit tokens such as "Thank" (token 1044) or "I" (token 286) so you can't just disable them. Maybe re-balance is possible using just suppressing code but seems quite tricky.
Hmmm actually it would be nice it there was some command-line switch to emphasize or de-emphasize individual tokens or sets of tokens. Could come in handy sometimes...

A possible solution to Whisper hallucination #679

Replies: 33 comments · 135 replies

KaiserChr Dec 13, 2022 Author

KaiserChr Dec 14, 2022 Author

KaiserChr Dec 14, 2022 Author

KaiserChr Dec 16, 2022 Author

KaiserChr Dec 21, 2022 Author

KaiserChr Dec 29, 2022 Author

KaiserChr Dec 30, 2022 Author

Replies: 33 comments 135 replies

KaiserChr Dec 13, 2022
Author

KaiserChr Dec 14, 2022
Author

KaiserChr Dec 14, 2022
Author

KaiserChr Dec 16, 2022
Author

KaiserChr
Dec 21, 2022
Author

KaiserChr
Dec 29, 2022
Author

KaiserChr Dec 30, 2022
Author