-
-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Tokenizer Loading #276
Comments
Hi @IIEleven11 Im just going to document how its used in the code/my rough understanding of it. When the "BPE Tokenizer" is selected, it creates a In the Stage 2 training code, when starting up training, we look for the existence of The primary/standard
Then when we get as far as initialising the actual trainer, we load the The original So 2 things at this point:1) I am unable to re-create the crash/issue you had (as follows):
Or
So I'm not sure that those errors and the BPE tokenizer are related issues. You can delete the 2) This specific bit of code (addition of BPE tokenizer) was not code I had added/written.This code has been added over 2x PR requests and additions, with the last PR being here Alltalkbeta enhancements #255 I am not personally claiming to be an expert in the tokenizer setup. So what I am thinking is I will see if the original submitter of the PR is free/able to get involved in this conversation/issue here and between us we can discuss if there is an issue with the way its operating and bounce some ideas around. Does that sound reasonable? Thanks |
Hmmm. I see... I didn't notice the merge with the original vocab. I'm going to run 4 quick epochs and check again. I may of grabbed the wrong vocab.json. I've been looking into solving this and the size of the embeddings layer is just the number of the key:value pairs in the vocab.json. So I think an easy way to confirm would be to check the created vocab.json key value pair count and compare it with the base model count. They should be different because we're adding vocabulary. I'll check here shortly |
Ok so they are the same. 6680. this is the vocab pulled from the base model xttsv2_2.0.3 folder I could be wrong about the size here though. Maybe that number of key value pairs must stay the same? |
My knowledge on this specific aspect is 50/50 at best and only what Ive personally read up, so feel free to take my answer with a pinch of salt. My belief/understanding from https://huggingface.co/learn/nlp-course/en/chapter6/5 and https://huggingface.co/docs/transformers/en/tokenizer_summary was that the BPE tokenizer was purely a "during training". From the first link:
So that should cover it using the original vocab.json and then adding the BPE tokenizer vocab for training. Looking over other code related to BPE tokenizers, I haven't seen any code that specifically alters the original tokenizer (finetuing), bar training a absolute brand new, from the ground up model. So my belief/understanding was this BPE tokenizer just extends the vocab.json during the process of finetuning the XTTS model, but shouldn't/doesn't alter the original vocab.json as that's required to maintain compatibility. As I say, this is my current understanding. Although I'm not saying this is a 100% confirmation, based on the question, I've also thrown the finetuning code, other documentation (transformers etc), at ChatGPT to see what we get there as an answer. So the answer there was: Training Only:The BPE tokenizer defined in bpe_tokenizer-vocab.json would most likely be used only during the training process. This is especially true if it's a custom tokenizer created specifically for the fine-tuning dataset. It allows the model to handle specific vocabulary or tokenization needs of the fine-tuning data without permanently altering the base model's vocabulary. Not Merged into Final vocab.json:I would not expect the custom BPE tokenizer to be merged into the final vocab.json of the model. The reason is: Maintaining Compatibility: Keeping the original vocabulary allows the fine-tuned model to remain compatible with the base model's tokenizer. This is crucial for interoperability and using the model in various downstream tasks. Separate Tokenizer for Inference:If the custom tokenizer is essential for using the fine-tuned model correctly, I would expect it to be saved separately and used in conjunction with the model during inference, rather than being merged into the model's main vocabulary. Possible Exceptions:There might be cases where merging vocabularies is desirable, such as when: The fine-tuning introduces critical new tokens that are essential for the model's new capabilities. If there is something missing/an issue/a better way to do something, I'm all ears or happy to figure it out. |
Hmm, that section on "Maintaining Compatibility" I am not sure if I can totally agree. The entire point of training the new tokenizer is to add vocabulary to the base model making the fine tuned model more capable of the sounds within our dataset. During inference if the model has a different vocab.json file it means its tokenizing differently than what it was trained on right? Or it lacks some of the tokenizing rules that the model had during training. It would in some case not know how to pronounce certain words or tokens. The size of the model and therefore the potential impact on performance I can agree with to some degree but is it not just a side effect of having a larger vocabulary? I am not sure anything can be done about that on this level of development/use. I am going to ask a friend of mine thats a little more informed on this specific thing and see what he says. |
Sure! Sounds good :) Just FYI, the last 6 weeks I've had to travel on/off for a family situation and I will be travelling again soon. This can limit my ability to review code/test/change code and obviously respond in any meaningful way. So if/as/when you respond, if you dont hear back from me for a bit, its because Im travelling, but will respond when I can. |
Ok so who I talked to is the author of the tortoise voice cloning repo. Xttsv2 is essentially a child of the tortoise model. This makes a lot of the code interchangeable or the same. He had this to say: "The error you're getting is size related and is most likely from the model. Are you using the latest XTTS2 model, it should have a larger text embedding table than 3491. It's saying that your new vocab size is too large for the text embeddings, so your getting a shape mismatch. For tortoise, my observation is you can train on a tokenizer with less tokens than the specified size of the weight, but not more and more is what seems to be happening here. Oh, also, using the bpe tokenizer addition for training only doesn't make sense, so if this is the implementation, it would be incorrect This is essentially what I was attempting to get at. So what I think is happening is during the merging of the vocab.json's either nothing is happening or its for some reason still using the original vocab. I can attempt to do this and send a PR. It is going to maybe be a fair amount of work to implement though. I was reading the commits and saw you had someone else helping with this portion. I am curious of their opinion too, if we could maybe ping him if possible? Its the script to resize the base tortoise model https://github.com/JarodMica/ai-voice-cloning/blob/token_expansion/src/expand_tortoise.py |
Hi @IIEleven11 Im currently traveling (as mentioned earlier), so short replies from me atm. I have contacted @bmalaski on the original PR and asked if they get chance to look over this incident and throw in any comments. Thanks |
No problem. Sounds good. I've been testing out a few things i'll let you know how it goes. Travel safe |
Yea when I made that originally, I was using knowledge from text tokenizers, which is what I am familiar with. Digging into the code, its wrong. I have a change we can test to see if that works better, where it creates a BPE tokenizer with the new words and appends to the original. This can be loaded later without issues, but I dont have time to test to see how it changes training and generation atm, as I have also been traveling. |
Ok, i have some updates too. You're right the original tokenization process you're using creates a new vocab that does not follow the structure of the base model. This results in a smaller vocab and the model ends up speaking gibberish. Also, I successfully wrote the script that expands the base model embedding layer according to the newly trained/custom vocab.json. So that new script you wrote for the vocab.json combined with mine to expand the original model it should be all we need to straighten this out. The only question being how @erew123 wants to integrate it exactly. I can send a PR that just pushes a new script "expand_xtts.py" to the... /alltalk_tts/system/ft_tokenizer folder. I will go over any of the specifics of the script in the PR. |
@bmalaski @IIEleven11 Thanks to both of you on this. @bmalaski Happy to look at that code, but cant see a recent update on your Github. @IIEleven11 Happy to go over a PR and test it. Im guessing this is something that needs to be run at the end of training OR when the model is compacted and moved to its final folder? |
Ah yeah sorry I ran into some issues during testing. I think I got them though. I also ended up writing both scripts to merge the vocab and also expand the model. I am going to go through and make verbose comments for you, so you will find more details within them. Give me about an hour then you'lll see the PR. This is all prior to fine tuning. We are making a new tokenizer/vocab during the whisper transcription phase. The new vocabulary needs to be merged with the base models vocabulary. This of course makes it bigger so we need to expand the embeddings layer of the base model so its capable of fine tuning with all of that extra vocab. |
Ok, I just trained a model and we have a small issue. My merge script has some flawed logic. Its actually a problem I took lightly because it seemed easy enough at first. But if anyones interested and wants to attempt to solve. Here is what needs to happen. We have our base XTTSv2 model. Im going to reference the 2.0.3 version here only. This model comes with a vocab.json. The entirety of that base model vocab.json needs to stay intact. This is the most important part. When we make a new bpe_tokenizer.json it doesn't follow this exact mapping of vocab and merges. But it will have some of the same keys and values. So what we want to do is:
So if the key
Next is the merges. I'll be working on this too, just thought I'd share it with an update. I've attached the original 2.0.3 vocab.json for convenience And we can test the outcome by comparing output of base model vocab.json and our newly merged vocab.json from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(tokenizer_file="/path/to/vocab.json")
# Tokenize
sample_sentence = "This is a test sentence."
tokens = tokenizer.tokenize(sample_sentence)
print(tokens) |
Nvm I did it. |
@IIEleven11 I will take a look/test the PR as soon I have my home PC in front of me :) |
for what its worth, here is what I was testing:
I found that merging in the new merges led to speech issues, with slurred words. |
Yes so my new merge script while having no apparent errors also led to slurred speech and gibberish. So... There is most certainly some nuance were missing. |
It's the tokenizer.py script I think. We need to make it more aggressive with what it decides to tokenize. |
Ok so to update, merge and expand scripts both work as expected but the creation of the new vocab.json needs to be done the same way that Coqui did it. It would help if someone could locate the script coqui used to create the vocab.json (I believe it was originally named tokenizer.json). I thought coqui's tokenizer.py would create the vocab.json but this isn't the case. I could also be over looking something. If anyone has any insight please let me know. I trained a new model and the slurred speech is gone but it has an English accent for no apparent reason. Something to note is this issue was brought up with the 2.0.3 model by a couple of people. I have trained on this dataset before though and this is the first I am seeing of it. Which I would think points to a potential tokenizer problem. |
@IIEleven11 There is a tokenizer.json in the Tortoise scripts https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/utils/assets/tortoise/tokenizer.json not sure if thats the one.... Or could it be this file https://github.com/coqui-ai/TTS/blob/dev/tests/inputs/xtts_vocab.json |
Ahh good find! I have been looking for this forever. I looked over it because I figured they would at least put it in a different location. Thats what I get for assuming, thank you though. With this I can get some answers hopefully |
@IIEleven11 Ill hang back on merging anything from the PR, in case you find anything. I have just merged 1x small update from @bmalaski whom found a much faster whisper model, which after testing, really speeds up the initial step one for dataset generation, so just so you are aware that got merged in. |
I successfully removed all non English characters from the 2.0.3 tokenizer and trained a model that resulted in a very clean result without an accent. This process was a bit more complex and nuanced than I thought it would be though. Because of this I think the best option is to make the script with a variable that uses the 2.0.2 or 2.0.3 vocab and base model for merge and expansion depending on the choice of the end user while making sure to mention the potential the 2.0.3 model has for artifacting/accent of some sort. As for the change in whisper model. The quality of this whole process really relies upon the quality of that transcription. If whisper fails to accurately transcribe the dataset then when we create a vocab from that dataset it will have a very damaging effect on our fine tuned model. So while there are most definitely faster options than whisper large v3 and I can appreciate a balance of speed and accuracy. I don't feel we have room to budge, a poor tokenizer would negate the entire fine tuning process entirely. Unless of course it is faster and more accurate. |
Hey, In this discussion thread, I have talked about premade datasets and bpe tokenizer: Is it possible to train the bpe tokenizer from pre-made transcripts instead of using whisper? |
Yes, it doesn't matter what you use to transcribe the audio. The script I wrote will just strip it of all formatting and look at the words before making the vocab. |
Hey, What I'm saying is that I already have the transcripts prepared in advance, so I don't need to use Whisper and can go directly to the training. As I understand it, your script performs the transcription in step 1 using Whisper. By skipping that step, I wouldn't be training the BPE tokenizer either. What I wanted to know or ask for is a way to train the tokenizer in step 2, before the training, instead of in step 1 along with Whisper. As I explained in that thread, I already have a script that formats the transcriptions to the format accepted by all talk, so the only thing left would be to use those CSV transcriptions to train the tokenizer. I hope I've explained myself clearly 😅 |
None of my scripts have to do with transcription. This whole process would still need to be added into the webui appropriately. As of right now you would have to run each script seperately. |
🔴 If you have installed AllTalk in a custom Python environment, I will only be able to provide limited assistance/support. AllTalk draws on a variety of scripts and libraries that are not written or managed by myself, and they may fail, error or give strange results in custom built python environments.
🔴 Please generate a diagnostics report and upload the "diagnostics.log" as this helps me understand your configuration.
https://github.com/erew
diagnostics.log
123/alltalk_tts/tree/main?#-how-to-make-a-diagnostics-report-file
Describe the bug
A clear and concise description of what the bug is.
The script doesn't load the new custom tokenizer
To Reproduce
Steps to reproduce the behaviour:
Setup dataset, check create new tokenizer, proceed through the process and begin training.
Screenshots
If applicable, add screenshots to help explain your problem.
Text/logs
If applicable, copy/paste in your logs here from the console.
config.json from last run. You can see the path/tokenizer loaded
config.json
I can share output from the terminal when trying to load the incorrect tokenizer if you want. But it is just prints the model keys and it is a lot of text. Then throws the embedding mismatch error.
Desktop (please complete the following information):
AllTalk was updated: [approx. date]. I installed it yesterday
Custom Python environment: [yes/no give details if yes] no
Text-generation-webUI was updated: [approx. date] no
Additional context
Add any other context about the problem here.
The solution is to just load the custom tokenizer instead of the base model tokenizer. You'll probably have to rename it to vocab.json.
We might have to resize the embeddings layer of the base model to accommodate the new embeddings.
The text was updated successfully, but these errors were encountered: