Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distil Whisper models: Missing words and repetitions in transcription #59

Open
go-run-jump opened this issue Aug 16, 2024 · 4 comments
Open

Comments

@go-run-jump
Copy link

I've been experimenting with the distil whisper models as an alternative to the standard whisper models. While I was able to successfully integrate the distil models, I'm experiencing some issues with the transcription quality.

Current Behavior:

  • Approximately two sentences into the transcription, issues start to occur
  • Parts of sentences are missing from the transcription (not necessarily full sentences)
  • Sometimes some words or short phrases are transcribed multiple times
  • The pattern of missing or repeated content varies and is inconsistent
  • I have tried with VAD on and off. It doesn't change anything.

Expected Behavior:

  • Continuous, accurate transcription without missing words or repetitions
  • Performance similar to standard whisper models in terms of transcription quality

Additional Information:

  • The distil models are running about twice as fast as the standard models
  • When transcription occurs correctly, the quality seems to be on par with standard models
  • To add support for distil whisper models, the following lines need to be added to config_schema.yaml:
    - distil-small.en
    - distil-medium.en
    - distil-large-v2.en
    - distil-large-v3.en

Questions:

  1. Has anyone else tested the distil whisper models and experienced these issues?
  2. Are there any known factors that might be responsible for this inconsistent transcription behavior?
  3. Are there any recommended settings or configurations that might help resolve these issues while maintaining the speed advantage of the distil models?

Any input or suggestions would be greatly appreciated, as the speed improvements of the distil models are significant.

Environment:

  • Operating System: Manjaro Linux with Gnome
  • Python version: 3.11
  • Branch: main (commit 71c03f6)

Steps to Reproduce:

  1. Add the distil whisper model options to config_schema.yaml as mentioned above
  2. Select one of the distil whisper models in the settings
  3. Attempt to use voice input for an extended period (more than two sentences)
  4. Observe the resulting transcription for missing words and repetitions
@dariox1337
Copy link
Contributor

dariox1337 commented Aug 21, 2024

I've tested the distil-small model. It works without issues for me. Granted, I mostly use the "hold to record" mode, and therefore dictate one sentence at a time. I tried dictating a couple of sentences, and still didn't notice any issues. However, it feels weird not seeing what you say for a long time. Anyway, can you suggest a phrase that often fails to transcribe properly for you?

Recording...
Recording finished. Size: 260640 samples, Duration: 16.29 seconds
Transcribing...
Transcription completed in 0.51 seconds. Post-processed line:  I have been experimenting with the distilled whisper models as an alternative to the standard whisper models. While I was able to successfully integrate the distilled models, I am experiencing some issues with transcription quality. 

NOTE: I'm using a heavily edited fork. So, consider my observations as related to the underlying libraries rather than WhisperWriter. You can try my fork, if you feel like it. https://github.com/dariox1337/whisper-writer To use distil models with this fork you can simply download a faster-distil-whisper model from HF, and set the folder in "model path".

@go-run-jump
Copy link
Author

@dariox1337 I have identified what is responsible for the decreased quality of the distill whisper models. It seems that the distil whisper models are more susceptible to issues in the original audio. On my machine, which is running on Linux, the audio file that is produced by the library sounddevice is running faster than real time, skipping and having some flapping noises on top. I replaced sounddevice with pyaudio and after I did this the quality of distill whisper is just what you would expect. No issues.
If this is happening unnoticed for more people (which is likely because the original whisper models seem to be very good at handling this and you can't select the distill models without changing the code) and thus reducing the quality of the transcriptions, it might be beneficial to replace sounddevice or find out what is responsible for its misbehavior.

@go-run-jump
Copy link
Author

go-run-jump commented Sep 5, 2024

@dariox1337 Actually, it seems that this behavior is only happening in the fork you're having and suggesting to merge in #61 . Why the distill models don't work for me in state of the software now remains unclear.

@dariox1337
Copy link
Contributor

@go-run-jump as I said in the PR, the faster than real time audio might be because the sample rate isn't set correctly somewhere (I don't know where). Skipping and crackling is a mystery for me. I couldn't reproduce either of the issues.

Anyway, even though SoundDevice worked without issues for me, I rewrote the audio recording code to use PyAudio as well since it was already used for "beep on completion." The code is in the main branch of my fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants