Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repetition in recordings #72

Open
gudrob opened this issue May 10, 2024 · 4 comments
Open

Repetition in recordings #72

gudrob opened this issue May 10, 2024 · 4 comments

Comments

@gudrob
Copy link

gudrob commented May 10, 2024

So far everything has been working out of the box, so thank you for this great plugin!

Issue:
I'm having problems with repetition. Recognition is good, but the same sentence is repeated over and over.

What I have tried:
From what I can see in the whisper documentation, the entropy threashold should fix this.
But there seems to be no effect when I change the value.

entropy 2.8, default

image

entropy 5

image

entropy 0

image

If at all higher values make recognition less precise.

Is this related to the other problem regarding Voice Activation Detection?
I have tried changing the VAD threshold as well but that seems to be doing nothing.

I have also tried using a larger whisper model but that yields the same results, only slower.

@gudrob gudrob changed the title Repetition in recordings, e Repetition in recordings May 10, 2024
@gudrob
Copy link
Author

gudrob commented May 11, 2024

So I replaced the Capture Effect of the audio bus with a Record Effect. I used linear Interpolation to resample the data i got from GetRecording() from 48000 to 16000. This works with an astounding accuracy of ~95% ( I am not a native english speaker). No repetition, even recognizes names correctly.

While this approach works for me, i just couldnt get the sample capture implementation to work.

@Ughuuu
Copy link
Collaborator

Ughuuu commented May 13, 2024

Interesting, this sounds like it could be an issue with how I am doing the interpolation. This plugin currently uses libsamplerate for that, as seen here: https://github.com/V-Sekai/godot-whisper/blob/main/src/speech_to_text.cpp#L32

The resample function also exposes a InterpolatorType:

	enum InterpolatorType {
		SRC_SINC_BEST_QUALITY = 0,
		SRC_SINC_MEDIUM_QUALITY = 1,
		SRC_SINC_FASTEST = 2,
		SRC_ZERO_ORDER_HOLD = 3,
		SRC_LINEAR = 4,
	};

By default it's set to FASTEST

var resampled = resample(_accumulated_frames, SpeechToText.SRC_SINC_FASTEST)

You could also give a try to set it to BEST_QUALITY see if there is a change. If not the solution/approach you did is pretty good as well, if you want you can make a new scene with it and add a PR for others to try.(if not I might if I get some time).

@gudrob
Copy link
Author

gudrob commented May 14, 2024

@Ughuuu I have implemented this in C#, here https://github.com/gudatr/godot-ai-rpg/blob/main/scripts/SpeechRecognizer.cs but it greatly differs from the examples of the project. I tried writing the code in gdscript but I must admit that I am too inexperienced with it, especially if the implementation needs to be close to the samples, and currently have no motiviation to learn it, sorry.

@Ughuuu
Copy link
Collaborator

Ughuuu commented May 14, 2024

No worries, thanks for this, it's great! If anything it's a sample people can look at if they want to do sampling manually. I'm also busy but maybe in future I might take a stab at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants