Disfluencies and filler words are being transcribed on longer form audio #569

filmo · 2023-11-19T18:29:10Z

filmo
Nov 19, 2023

On conversational audio files (two speakers conducting informal interviews. One interviewer/One Interviewee) I have been noticing the disfluencies like um and uh end up being transcribed. Particularly if the audio is 'difficult'.

For example I will get generated text like "I, uh, went out and, um, got the car, um, warmed up"

Likewise for 'false starts' and filler words such as "So I like ate the burger and like it was good and like, you know, I'm not hungry any more" where "like" and ", you know," are being used as filler words/speech mannerisms by the speaker.

I have noticed this is generally only a problem with files longer than about 10 minutes in length. In fact if I cut a 30-minute file into 3 separate 10-minute pieces, the individual pieces don't exhibit this problem. There may be None or 1 disfluency in the separate 10-minute chunks and 10s or 100s in the 30-minute transcript. Likewise the transcription of 'junk/filler' words is greatly reduced in the 10-minute chunks vs longer form.

My suspicion is that on a longer piece of audio, it get 'off-track' and then spirals out of control in terms of generating the disfluencies.

I'm currently generating transcripts with 5 & 8 beams and no initial_prompt with the other setting set to defaults as given in the WhisperModel.

I see that there's a 'prompt_reset_on_temperature' but I don't understand the intuition behind it.

What parameters should I be adjusting to avoid the introduction of the disfluencies as much as possible? Ideally something that replicates the capabilities on the shorter 10-minute chunks but that can be run for the entire duration of the audio without having to break them down into smaller pieces. (30 minutes, 60, 90, 180 etc.)

Thank you in advance for you time and any help you can provide.

Additional Context:

Audio files are conversational in nature with participants often interjecting on top of each other.
Audio files are often poorly recorded phone interviews. But this issues happens on all types of audio (good record quality or poor)
The 'ums and uhs' will generally start to appear randomly in a file for no apparent reason (other than it's further into the file than ~10 minutes. Sometimes they will then 'disappear' after another few minutes, sometimes not.
This problem is generally worse on poor audio quality files than good audio quality.

blackpolarz · 2023-11-24T09:11:50Z

blackpolarz
Nov 24, 2023

Firstly, are the disfluencies presented in the audio file?
If yes, it is likely that whisper captures these disfluencies and transcript them as it is. You might be able to remove them by prompting or by post processing.
If no, there might be some hallucination within whisper where it assumes some of the audio distortions as filler words. That can be an effect from the temperature(highly unlikely) or due to the whisper model's training data(which you can't modify).

When you mentioned that it only happens with files longer than 10mins, it makes me suspect that it might be the cascaded effect from the prefix of the audio. The easiest way to reduce it is by prompting a complete sentence as initial prompt to structure the sentence output. (This is a hit or miss method.)
Another easy method to implement is by applying a very strict VAD (one of the parameters) that filters out any pauses in the speech. However, this isn't perfect and there can be still some distortions that cannot be removed. You will need to play with the parameters to see which number will be most suitable for your use case.

Faster-whisper work by using beam search then greedy algorithm if beam search does not give a good result. Temperature controls the level of creativity for the algorithm where a higher temperature leads to more hallucinations. prompt reset on temperature basically sets the threshold where the prompt is removed if it exceeds a certain level of "creativity". In your case, as it is highly unlikely that these distortions occur due to temperature, prompt reset on temperature is unlikely to help you. You can set it to 0 to ensure that prompt resets each time.

3 replies

filmo Nov 24, 2023
Author

Thank you for your guidance.

'Yes' there are ums and uhs in the dialog. The audio is 'conversational interviews' and they (like many conversational speakers) tend tend to use lots of ums and uhs in their dialog.

It was my understanding that Whisper was trained to omit these. And it generally does on short files. (sub 10 minutes). I agree that there's some sort of cascading effect that triggers it. Something 'triggers' and it usually only happens on longer form audio. The first N minutes of the transcript will be 'clean' then something changes and all of a sudden every uttered um and uh is transcribed.

I've tried adding an initial prompt, which does help with spelling of proper nouns, but doesn't seem to influence the amount (or lack thereof) of disfluencies.

Increasing the length penalty has helped quite a bit on most files, but on some it makes no difference. I think the intuition here is that a length penalty makes a beam such as "I ate the meal" more "probable" than "I, um, ate the, uh, meal" because the disfluency version gets penalized for being 'longer'

I've tried also playing around with patience (default = 1) as detailed in "Beam Decoding with Controlled Patience" which the authors tested against Machine Translation and News Article Summarization. My thought was perhaps applying their finding from Summarization (which prefers concision) would also help reduce disfluencies and 'junk words' such as "I like ran over here and, you know, dropped my backpack" where "like" and ", you know," are common 'junk'/filler phrases. In general I found that adjusting patience didn't make a discernible difference.

I have not played with the VAD parameters. Thank you for suggesting I do so.

With regards to post processing, we have had good success automatically removing the following:

'Um' or 'Uh' at the beginning of a sentence
all ', uh,' can be removed entirely

The problem with also just using text processing on ', um,' when it's contained within a sentence it that to make the sentence grammatically it needs to be be entirely removed, replaced with a single comma, or replaced with a dash '-' indicating change of thought. All of which require context.

One other thing I was thinking of exploring was adding the token for 'um' to the excluded list, but the problem with that is the same token is used to form words like 'umbrella' and 'umbilical'. If there was some way to isolate and penalize the token for 'um' when its followed by either a SPACE or commas would be ideal but it doesn't seem like Whisper is set up to handle multi-token or "token sequence" exclusions. I don't know if there's a way to simply adjust the token probability rather than setting it to infinitely improbable. That might help as it would still allow generation of words/token-sets like 'umbrella'

blackpolarz Nov 25, 2023

One thing I want to clarify is that whisper is not trained to omit disfluencies. In fact, it is not trained to suppress disfluencies but rather it is less likely to transcribe disfluencies due to the large training data used that happened to not have these disfluencies. As whisper gets more and more accurate, these disfluencies may be more prominent (which would be problematic in your use case).

Increasing length penalty does help to some extent but it is more of a stop gap method rather than a fix. (Btw your intuition is right about the length penalty.)

With regards to post processing, I actually worked on something similar though what I tried to omit are the common hallucinations that appear in Whisper rather than the disfluencies(partly due to my use case of translation rather than transcription). However, the same method can apply for your case.

In my case, instead of messing around with the tokens which may or may not influence the generation of text, what I did was to post process it by running it through a loop which splits each word by either its spacing/punctuation. Then check if the word exist in my list of filtered words. To assist you, I will just give you a simple code and you can modify from there to post process the filtered words.

import string 

strings = ["I, uh,"," went out and, um, got the car",", um,"," warmed up","without um an umbrella"]
filtered_words = ["uh","um"]
result = []
for i in range(0, len(strings)):
    
    mod_strings = strings[i].translate(str.maketrans('', '', string.punctuation))
    current_string = mod_strings.strip()
    current_words = current_string.split()
    
    for word in current_words:
        if word.lower() not in filtered_words:
            result.append(word)

sentence = ' '.join(result)

print(sentence)

The output is "I went out and got the car warmed up without an umbrella". You can further improve by preserving the punctuations, but this is just a simple example and would not affect words like 'umbrella' or 'umbilical'.

blackpolarz Nov 25, 2023

Just to add on, this is also partly the reason why I am pushing for negative prompts as a feature in the official whisper repo which can remove some of these disfluencies, hallucinations or provide censorship. To add negative prompts to faster-whisper would require me to interfere directly with ctranslate2 which I am unfortunately not capable of.

KasimSahil · 2024-07-06T02:01:50Z

KasimSahil
Jul 6, 2024

Disfluencies and filler words can be removed only after the post processing of Transcription. Please let me know that if still this needs to be resolved. We can apply the Disfluencies and filler words of dictionary scanned on the transcript and remove.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disfluencies and filler words are being transcribed on longer form audio #569

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Disfluencies and filler words are being transcribed on longer form audio #569

filmo Nov 19, 2023

Replies: 2 comments · 3 replies

blackpolarz Nov 24, 2023

filmo Nov 24, 2023 Author

blackpolarz Nov 25, 2023

blackpolarz Nov 25, 2023

KasimSahil Jul 6, 2024

filmo
Nov 19, 2023

Replies: 2 comments 3 replies

blackpolarz
Nov 24, 2023

filmo Nov 24, 2023
Author

KasimSahil
Jul 6, 2024