Replies: 2 comments 3 replies
-
Firstly, are the disfluencies presented in the audio file? When you mentioned that it only happens with files longer than 10mins, it makes me suspect that it might be the cascaded effect from the prefix of the audio. The easiest way to reduce it is by prompting a complete sentence as initial prompt to structure the sentence output. (This is a hit or miss method.) Faster-whisper work by using beam search then greedy algorithm if beam search does not give a good result. Temperature controls the level of creativity for the algorithm where a higher temperature leads to more hallucinations. prompt reset on temperature basically sets the threshold where the prompt is removed if it exceeds a certain level of "creativity". In your case, as it is highly unlikely that these distortions occur due to temperature, prompt reset on temperature is unlikely to help you. You can set it to 0 to ensure that prompt resets each time. |
Beta Was this translation helpful? Give feedback.
-
Disfluencies and filler words can be removed only after the post processing of Transcription. Please let me know that if still this needs to be resolved. We can apply the Disfluencies and filler words of dictionary scanned on the transcript and remove. |
Beta Was this translation helpful? Give feedback.
-
On conversational audio files (two speakers conducting informal interviews. One interviewer/One Interviewee) I have been noticing the disfluencies like um and uh end up being transcribed. Particularly if the audio is 'difficult'.
For example I will get generated text like "I, uh, went out and, um, got the car, um, warmed up"
Likewise for 'false starts' and filler words such as "So I like ate the burger and like it was good and like, you know, I'm not hungry any more" where "like" and ", you know," are being used as filler words/speech mannerisms by the speaker.
I have noticed this is generally only a problem with files longer than about 10 minutes in length. In fact if I cut a 30-minute file into 3 separate 10-minute pieces, the individual pieces don't exhibit this problem. There may be None or 1 disfluency in the separate 10-minute chunks and 10s or 100s in the 30-minute transcript. Likewise the transcription of 'junk/filler' words is greatly reduced in the 10-minute chunks vs longer form.
My suspicion is that on a longer piece of audio, it get 'off-track' and then spirals out of control in terms of generating the disfluencies.
I'm currently generating transcripts with 5 & 8 beams and no initial_prompt with the other setting set to defaults as given in the WhisperModel.
I see that there's a 'prompt_reset_on_temperature' but I don't understand the intuition behind it.
What parameters should I be adjusting to avoid the introduction of the disfluencies as much as possible? Ideally something that replicates the capabilities on the shorter 10-minute chunks but that can be run for the entire duration of the audio without having to break them down into smaller pieces. (30 minutes, 60, 90, 180 etc.)
Thank you in advance for you time and any help you can provide.
Additional Context:
Beta Was this translation helpful? Give feedback.
All reactions