Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WOW Great extension! The best TTS extension out there! Here are some code fixes for auto play and installation! #3

Open
RandomInternetPreson opened this issue Nov 19, 2023 · 13 comments

Comments

@RandomInternetPreson
Copy link

RandomInternetPreson commented Nov 19, 2023

Firstly, thank you for taking the time to do this!!! OMG it's fast, does perfect inflections, this is eleven labs quality on my local machine AMAZING!!!!!

Here is some information to make the extension work a bit better, I'm on a windows machine so my experience might be unique to that.

  1. Auto-play keeps trying to play all audio clips in the history to fix this change this:

def history_modifier(history):
if len(history["internal"]) > 0:
history["visible"][-1] = [
history["visible"][-1][0],
history["visible"][-1][1].replace(
"controls autoplay>", "controls>")
]
return history

to this:

def history_modifier(history):
if len(history["internal"]) > 0:
history["visible"][-1] = [
history["visible"][-1][0],
history["visible"][-1][1].replace(
"controls autoplay style="height: 30px;">", "controls style="height: 30px;">")
]
return history

  1. The initial loading of the extension was not successful, this is because the folder that is created in the oob extension directory has the horizontal dashes, users need to change the folder name from:

text-generation-webui-xtts

to:

text_generation_webui_xtts

Seriously amazing stuff, thank you again for integrating this into oobabooga. I will do a pr just to have a copy to mess around with, but I'll direct people to this repo.

@allenhs
Copy link

allenhs commented Nov 19, 2023

Thanks!

I had to make a chance get get it to work right in linux for me.

I changed:

"controls autoplay style="height: 30px;">", "controls style="height: 30px;">")

to:

'controls autoplay style="height: 30px;">', 'controls style="height: 30px;">'

I used chatgpt to help me make the fix. It works for me, but I don't know how correct this change is.

@RandomInternetPreson
Copy link
Author

RandomInternetPreson commented Nov 19, 2023

What that bit of code is doing is replacing the stings inside the log file and removing the "autoplay" tag.

your code has the embeddings for the source location of the .wav files slightly different than the og barkTTS code if you look at your format_html function

def format_html(audiofiles):
if params["combine"]:
autoplay = "autoplay" if params["autoplay"] else ""
combined = combine(audiofiles)
time_label = audiofiles[0].split("/")[-1].split("_")[0]
sf.write(f"{this_dir}/generated/{time_label}_combined.wav",
combined, 24000)
return f'<audio src="file/{this_dir}/generated/{time_label}_combined.wav" controls {autoplay} style="height: 30px;">'
else:
string = ""
for audiofile in audiofiles:
string += f''
return string

your see the string the code fix addresses:  controls style="height: 30px;">

so we are making sure we are changing this from

"controls autoplay style="height: 30px;">"

to

"controls style="height: 30px;">")

in the history of the conversation with the AI so it doesn't keep autoplaying.

@RandomInternetPreson
Copy link
Author

I edited the .py file in my fork for you to reference if you need it:

https://github.com/RandomInternetPreson/text_generation_webui_xtt_Alts/blob/main/script.py

wow this works so incredibly well!

@RandomInternetPreson
Copy link
Author

Sorry to keep peppering you here in this issue, but just wanted to let you know that I'd be okay if you wanted to reference my fork here: https://github.com/RandomInternetPreson/text_generation_webui_xtt_Alts
for folks installing the extension for windows.

@erew123
Copy link

erew123 commented Nov 19, 2023

Ill close my other issue on here, but I can confirm that on a 100% fresh install of Text-Gen-WebUI on windows, I did the following:

Run a command prompt
cd text-generation-webui (wherever you have it stored on your disk)
cmd_windows.bat (cmd_windows.bat will activate your environment. Linux and Mac options are there too)
cd extensions
git clone https://github.com/kanttouchthis/text_generation_webui_xtts
cd text_generation_webui_xtt_Alts
pip install -r requirements.txt
pip install TTS --no-dependencies

cd back up to the text-generation-webui folder.
Run Start_windows.bat

Agree to the license and let it download the other files it needs.
(ensure its activated on the "session" tab and apply/restart)

With all that done, its running fine! :) No audio repeats etc.

One thing I do notice, it keeps the generated audio in \text-generation-webui\extensions\text_generation_webui_xtt_Alts\generated so that may need clean up from time to time.

Im sure the changes will get merged back into the original on here at some point!

Thanks for everyone's help and work on this!

@erew123
Copy link

erew123 commented Nov 20, 2023

A quick note on speed vs quality etc as its not mentioned anywhere else. I notice the sample audio voice file used to generate audio, is about 7 seconds long, Mono (not stereo), PCM S16 LE with a Sample rate of 22050Hz and Bits per sample 16.

I'm guessing there are a few factors that may speed up processing.

  • Keeping it the lower quality like the original file.
  • Fewer seconds in length (I think somewhere it says you need 4 to 12 seconds as a sample)

I tried a very simple test using a 22050Hz sample voice and a 44100Hz sample voice (9 second mono sample).

22050Hz > Processing time: 59.185802936553955
44100Hz > Processing time: 125.19529104232788

This was generating the same amount of speech. Its not highly scientific, run over 1000's tests. But it would appear that if you want to use your favourite celebrity voice, get a high quality sample, make it mono, drop its bit rate to 22050Hz and keep it around the 4-9 second mark. (I suspect a shorter voice sample probably will be faster).

@fbradcdsc
Copy link

Followed the steps but it still gives me a

ERROR:Failed to load the extension "text_generation_webui_xtt_Alts".
Traceback (most recent call last):
File "C:\text-generation-webui\modules\extensions.py", line 36, in load_extensions
exec(f"import extensions.{name}.script")
File "", line 1, in
File "C:\text-generation-webui\extensions\text_generation_webui_xtt_Alts\script.py", line 1, in
from TTS.api import TTS
ModuleNotFoundError: No module named 'TTS'

When restarting the webui after activating it in the session tab

@RandomInternetPreson
Copy link
Author

If you are using windows follow these instructions, I've made a video to go with them. These instructions will show you how to install TTS.

https://github.com/RandomInternetPreson/text_generation_webui_xtt_Alts/tree/main#installation-windows

@kanttouchthis
Copy link
Owner

Sorry to keep peppering you here in this issue, but just wanted to let you know that I'd be okay if you wanted to reference my fork here: https://github.com/RandomInternetPreson/text_generation_webui_xtt_Alts for folks installing the extension for windows.

Thanks for your help!

One thing I do notice, it keeps the generated audio in \text-generation-webui\extensions\text_generation_webui_xtt_Alts\generated so that may need clean up from time to time.

I added an option to delete old files on startup in the config.json

@kanttouchthis
Copy link
Owner

A quick note on speed vs quality etc as its not mentioned anywhere else. I notice the sample audio voice file used to generate audio, is about 7 seconds long, Mono (not stereo), PCM S16 LE with a Sample rate of 22050Hz and Bits per sample 16.

I'm guessing there are a few factors that may speed up processing.

  • Keeping it the lower quality like the original file.
  • Fewer seconds in length (I think somewhere it says you need 4 to 12 seconds as a sample)

I tried a very simple test using a 22050Hz sample voice and a 44100Hz sample voice (9 second mono sample).

22050Hz > Processing time: 59.185802936553955 44100Hz > Processing time: 125.19529104232788

This was generating the same amount of speech. Its not highly scientific, run over 1000's tests. But it would appear that if you want to use your favourite celebrity voice, get a high quality sample, make it mono, drop its bit rate to 22050Hz and keep it around the 4-9 second mark. (I suspect a shorter voice sample probably will be faster).

The model outputs 24khz mono files, so I presume that is the ideal format for samples as well. Could potentially write code to automatically resample the input files

@RandomInternetPreson
Copy link
Author

Yeass! You got the repo fixed up, thank you again for making this. It is one of the last missing pieces for AI interactions, the speed and quality is above everything else.

@fbradcdsc
Copy link

Alright I got it to work! The problem was I installed TTS in textgen and not in the base environment

@kanttouchthis
Copy link
Owner

Alright I got it to work! The problem was I installed TTS in textgen and not in the base environment

As long as you have textgen activated when running the webui that shouldn't be an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants