Finetuning/Training code ? #64

OleguerCanal · 2022-09-23T09:48:19Z

OleguerCanal
Sep 23, 2022

I see that the code used to train the model is not included in the repo. Is there any plans to publish it? It would be very useful to finetune the model and get benchmarks beyond the 0-shot transfer.

Thank you very much!

Answered by k-washi

Sep 29, 2022

Fine Tuning code in Japanese Kana.

https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz?usp=sharing

I don't know the exact code, but it appears to work.

View full answer

JaiWWW · 2022-09-23T10:53:30Z

JaiWWW
Sep 23, 2022

Definitely, especially for languages other than English which have had much less training so far!

9 replies

sanchit-gandhi Mar 17, 2023

Hey @ExtReMLapin! Could you print out the predictions that the model is making and the target transcriptions?

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    for pred, label in zip(pred_str, label_str):
        print("Pred: ", pred)
        print("Label: ", label)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Also it looks like your targets are starting with a triple ... which might be messing things up, e.g. ... Marseille, sa should be Marseille, sa

ExtReMLapin Mar 17, 2023

Sorry for not keeping you in touch, I might now understand what I was doing wrong.

I took a 60 sec sample audio, ran prediction on Tiny model, got poor result.
Ran it on Medium model and saved the predict data.
Cut the original 60sec file into multiple chunks for training, and data is the result from medium predict

Because I know i don't have enough data, only 60sec, my validation/test set was the exact same thing as my training set, yes, I know it's bad.

But it's a way to see if it converges.

While the WER goes low, at eval_period = 1, running prediction on the new trained model isn't really showing better results, but that's because I didn't run inference on the individually trained chunks it did overfit on, but rather on the original 60sec audio file it could not generalize on, because it never saw it in training but rather chunks of it.

sanchit-gandhi Mar 17, 2023

Hey @ExtReMLapin! Whisper can only handle 30s chunks, so the last 30s of your data is immediately discarded. I would probably just try fine-tuning it on a publicly available corpus with more data!

Or you can chunk the audio samples before hand and then pass the chunked audio to the Whisper model for fine-tuning, but you'll only be able to overfit with just 60s of audio

IgnatiusEzeani Jun 2, 2023

Hey @erdeme36! You can follow the guide here: https://huggingface.co/blog/fine-tune-whisper

If you're interested in English-only fine-tuning, you simply need to load one of the English-only checkpoints:
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small.en")
And omit the language/task arguments when you instantiate the processor:
processor = WhisperProcessor.from_pretrained("openai/whisper-small.en")
Hope that helps!

Hi @sanchit-gandhi
Thanks for your excellent blog post. Can I please ask if it is possible to fine-tune Whisper with data for a language not already in any of the checkpoints? I want to fine-tune with Igbo language speech data but it appears that it can continue since Igbo was nit originally in the model.
Thanks in advance.

sanchit-gandhi Sep 6, 2023

Yes you can @IgnatiusEzeani! For this, you can follow the guide here: https://huggingface.co/learn/audio-course/chapter5/fine-tuning

jongwook · 2022-09-26T07:33:24Z

jongwook
Sep 26, 2022
Maintainer

We currently don't have plans to release training/fine-tuning code, but there might be an implementation from the community soon.

6 replies

jongwook Sep 27, 2022
Maintainer

We used the cross entropy loss over the tokens, similar to language modeling.

light42 Sep 28, 2022

Many pretty much very eager to adopt it to their usecases asap, I've hacked some of STT libraries in the past and this is the first time a STT library generated so much hype.
It's quite disappointing no official training scripts will be available.

ar-visions May 15, 2023

Having an opaque training backend does not seem OpenAI-like. Reporting this as a bug. This is a dealbreaker to myself. If the training is not open you have a long way to go. Why would someone consider adopting this if it doesn't have a training routine in open? It's a half-featured AI interface in a age where people are trying to establish end to end solutions.

anothername777 Jun 21, 2023

Thanks for letting us know though.

sanchit-gandhi Sep 6, 2023

While there's not pre-training code, there is a well established implementation for fine-tuning that you can reference @ar-visions: https://huggingface.co/blog/fine-tune-whisper

Hope this helps!

ArtyomZemlyak · 2022-09-28T06:35:28Z

ArtyomZemlyak
Sep 28, 2022

Founded some Fine Tuning code on HF:
https://huggingface.co/sanchit-gandhi/whisper-medium-switchboard-5k/blob/main/run_speech_recognition_whisper.py

But not tested

5 replies

sanchit-gandhi Jan 5, 2023

A full blog post explaining how to fine-tune using HF Transformers can be found here: https://huggingface.co/blog/fine-tune-whisper

IgnatiusEzeani Jun 3, 2023

A full blog post explaining how to fine-tune using HF Transformers can be found here: https://huggingface.co/blog/fine-tune-whisper

Thanks, @sanchit-gandhi. I do get the error below when I try to fine-tune on my CommonVoice Igbo data

ImportError: Using the Trainer with PyTorch requires accelerate>=0.19.0: Please run pip install transformers[torch] or pip install accelerate -U

sanchit-gandhi Sep 6, 2023

Hey @IgnatiusEzeani - super sorry for the late reply! Could you run the command recommended by the traceback, i.e.:

pip install -U accelerate

This should fix it!

IgnatiusEzeani Sep 6, 2023

Thanks so much @sanchit-gandhi

RohitMidha23 Sep 11, 2023

@sanchit-gandhi is the above code usable for translate task fine tuning as well?

k-washi · 2022-09-29T17:35:06Z

k-washi
Sep 29, 2022

Fine Tuning code in Japanese Kana.

https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz?usp=sharing

I don't know the exact code, but it appears to work.

7 replies

phamkhactu Feb 27, 2023

Fine Tuning code in Japanese Kana.

https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz?usp=sharing

I don't know the exact code, but it appears to work.

hi

Fine Tuning code in Japanese Kana.

https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz?usp=sharing

I don't know the exact code, but it appears to work.

Hi @sanchit-gandhi @k-washi

I have train, I see that the model work well for ASR. But, I want to use it to TTS(text to speech)
My question is: "How to custom it to TTS or link or tutorial for change to TTS?"
Thank you very much

sanchit-gandhi Mar 3, 2023

Hey @daniel-v-e! Super sorry about the late reply! Indeed, you can enable multi-GPU training quite easily with HF Trainer, see https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition#multi-gpu-whisper-training

sanchit-gandhi Mar 3, 2023

Hey @phamkhactu! Whisper is a model for speech-to-text, so we can't use it for text-to-speech unfortunately. I would advise that you check-out SpeechT5 for a model that can do both: https://huggingface.co/blog/speecht5

AYUSH27112021 May 13, 2023

Fine Tuning code in Japanese Kana.
https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz?usp=sharing
I don't know the exact code, but it appears to work.

hi

Fine Tuning code in Japanese Kana.
https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz?usp=sharing
I don't know the exact code, but it appears to work.

Hi @sanchit-gandhi @k-washi

I have train, I see that the model work well for ASR. But, I want to use it to TTS(text to speech) My question is: "How to custom it to TTS or link or tutorial for change to TTS?" Thank you very much

I am getting an error with this code regarding the LRshedular, which is a costume implementation which pytorch is not supporting

sanchit-gandhi Sep 6, 2023

Hey @AYUSH27112021 - could you open an issue in the transformers repository if you're getting an issue with fine-tuning? If you could provide the full traceback and a reproducible code snippet that would be great, thanks!

OleguerCanal · 2022-10-27T10:59:14Z

OleguerCanal
Oct 27, 2022
Author

I have a doubt on how to train with the prompt. (not implemented by @k-washi as far as I understand)

If I understand correctly a training sample of the batch could look like this?

decoder_input = [PREV, p[0], p[1], ..., p[n], EN, TRANSCRIBE, NO_TIMESTAMPS, t[0], t[1], ... t[m]]
label =                [ -1,  -1, ....,  -1,  EN, TRANSCRIBE, NO_TIMESTAMPS, t[0], t[1], ... t[m],  EOT]

Where:

p[i]: Are the prompt tokens
t[i]: Are the text tokens
-1 is ignore_index in torch.nn.CrossEntropyLoss. Basically we ignore the prompt predictions and I also use it as a padding for the batch. Is this correct @jongwook ?

1 reply

dataraptor Jul 15, 2023

Hi! Could you clarify what does "NO_TIMESTAMPS" mean and how it works? In my experiment I have found that transcribing with timestamps provides more accurate results.

sanchit-gandhi · 2022-11-04T15:25:56Z

sanchit-gandhi
Nov 4, 2022

Check-out this blog for fine-tuning Whisper for multilingual ASR with Hugging Face Transformers: https://huggingface.co/blog/fine-tune-whisper

It provides a step-by-step guide to fine-tuning, right from data preparation to evaluation 🤗 There'a Google Colab so you can also run it as a notebook 😉

45 replies

toshiouchi Jan 16, 2023

Hello, everyone, tokenizer was designated in pipeline function. and 'pip install torch transformers sentencepiece==0.1.95' was executed, no more errors. But, shown as In [10] alphabets are obtained. Shown as In [7], desired output is japanese. do i forget any option ?

sanchit-gandhi Jan 16, 2023

Hey @HenryYuen128!

The code snippet that you've used with pipeline looks correct!

I followed the tutorial here https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py, and try to inference audio longer than 30s, but it fails, the text i got is just in 30s.

This is expected! The Whisper model is defined such that the inputs are always padded/truncated to 30s. Consequently, the model always expects audio samples of the same input length (30s). So when we use the model for inference, we'll only transcribe the first 30s of audio. This is explained in more depth in the blog post (https://huggingface.co/blog/fine-tune-whisper#load-whisperfeatureextractor).

The only way of transcribing audio samples longer than 30s is by:

Chunking the audio into 30s samples
Transcribing chunks the chunks individually
Stitching the transcriptions together at the boundaries

This is the workflow employed by pipeline. You can see that we set chunk_length_s=30, which tells pipeline to transcribe the audio in 30s chunks and stitch the transcriptions back together at the boundaries.

To summarise, if you want to transcribe audio files longer than 30s, you should use pipeline!

Here's a very nice Colab which explains pipeline vs model.generate: https://colab.research.google.com/drive/1l290cRv4RdvuLNlSeo9WexByHaNWs3s3?usp=sharing

sanchit-gandhi Jan 16, 2023

Hey @toshiouchi!

There's no need to override the tokenizer arg in pipeline - we should leave it blank to load the default Whisper tokenizer:

transcriber  = pipeline(
    task="automatic-speech-recognition", 
    model="toshiouchiyama/whisper-small-ja",
    chunk_length_s=30,
    device=device,
)

Once loaded, you can force the language (e.g. "japanese") and task ("transcribe" or "translate") as follows:

transcriber.model.config.forced_decoder_ids = transcriber.tokenizer.get_decoder_prompt_ids(language="japanese", task="transcribe")

toshiouchi Jan 17, 2023

Thank you, Gandhi. As shown in figure, fine tuned pipeline worked well with chunk_length_s = 25.
Speech recognition is separate from training on jupyter notebook, so tokenizer option was described.

A problem occurred with chunk_length_s = 20 and 30. Problems were word repetition problem.

The text is confidential, so the image is blurred.
Best regards.

sanchit-gandhi Jan 19, 2023

Hey @toshiouchi! Glad to hear a chunk length of 25s worked well. You can treat this variable as a hyper-parameter and tune it to your distribution of data. I can't read Japanese, but take joy in learning that the transcriptions are adequate for your needs!

HenryYuen128 · 2023-01-01T08:20:08Z

HenryYuen128
Jan 1, 2023

@sanchit-gandhi
Hi gandhi, I am using your fintune-code here
https://huggingface.co/sanchit-gandhi/whisper-medium-switchboard-5k/blob/main/run_speech_recognition_whisper.py
with latest whsiper. But I got the error: AttributeError: 'Whisper' object has no attribute 'config'. How can I fix it? Thx

3 replies

Eran-BA Jan 2, 2023

Try to save the model with PyTorch, model.save() also,
And it will create you a config file.
And move the config file to the same dir.
Please tell me if it’s worked.

HenryYuen128 Jan 3, 2023

Try to save the model with PyTorch, model.save() also, And it will create you a config file. And move the config file to the same dir. Please tell me if it’s worked.

Problem solved. The fine-tuning code above is based on the older version of Whisper.

sanchit-gandhi Jan 5, 2023

Hey @HenryYuen128! My advice would be to use the updated code from the blog post / Whisper fine-tuning event

Eran-BA · 2023-01-02T06:19:40Z

Eran-BA
Jan 2, 2023

How is possible to add a new layer above whisper it self? And change the task to be more specific, I still can’t understand how to do so.
Thanks,
Eran.

2 replies

spacemonqi Jan 5, 2023

Same issue here - is it somehow possible to modify the final layers of whisper to perform classification for example, instead of transcription? Essentially that the internal representation produced by whisper is used "as input" to the final classification layers.

sanchit-gandhi Jan 16, 2023

Suggestions here: https://huggingface.co/sanchit-gandhi/whisper-medium-switchboard-5k/discussions/1#63c139e73bb1857eeac05bc3

esonec · 2023-01-04T22:09:16Z

esonec
Jan 4, 2023

Whisper recognizes audio to text almost perfect for me! But it doesn't know some very specific termins, names and abbreviations from my domain. How can I suggest to Whisper the vocabulary list or train(fine-tune) it to recognize special terms as nice as general English words?

How can this "initial_prompt" parameter facilitate me in that?
parser.add_argument("--initial_prompt", type=str, default=None, help="optional text to provide as a prompt for the first window.")

5 replies

silvacarl2 Jan 12, 2023

can ypu post one specific example? this can probably be done with initial_promt without fine tuning

siddhijain47 Feb 7, 2023

Hi @silvacarl2 @esonec
Any update on this part , is there any way to add our own vocabulary in Whisper?

sanchit-gandhi Feb 10, 2023

Here's how you can do this in 🤗Transformers: https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311/2?u=nbroad

Pretty straightforward!

codewaly Mar 25, 2023

Hello,
Is there any way to train a Whisper model from scratch without transformers?

sanchit-gandhi Apr 4, 2023

Are there any issues with using transformers? It should make using the Whisper model easier :)

huynhthanh98 · 2023-02-01T08:05:24Z

huynhthanh98
Feb 1, 2023

Hi!
Can i fine-tune whisper model with beam_size=5?

13 replies

ExtReMLapin Mar 19, 2023

Well i guess it should output to a folder, not just a file

lekyamurthy98 Mar 19, 2023

I am running the colab file as a .py on a virtual machine with GPUs and it is throwing a multiprocessing error. So I changed it to pytroch instead of pytorch lightning and i only output a .pt file. Is there any other fine-tuning script, without having to upload my data to the hub?

sanchit-gandhi Mar 24, 2023

As @ExtReMLapin mentioned, you should have a bunch of files output, one being config.json! Did you push the model to the Hub? I can take a look at the files if you did!

lekyamurthy98 Mar 27, 2023

Hi, @sanchit-gandhi , I am working with some confidential data, so I can't upload it to the hub. But I will re-check my code and get back if there are issues. Thank you so much.

sanchit-gandhi Apr 4, 2023

Hey @lekyamurthy98! Awesome, sounds good!

Jiltseb · 2023-04-11T20:24:07Z

Jiltseb
Apr 11, 2023

@sanchit-gandhi How to make sure that sequential finetuning works (finetuning with lang x followed by finetuning with lang y)? I have tried finetuning on Language 1 and then on Language 2. As we are only learning to update specific tokens for the given language during finetuning, it shouldn't affect the performance of other languages if they don't share a similar script. On evaluation with test data, it seems that the WER for language 1 got increased after finetuning its last checkpoint with language 2 and I can see that the predictions are based on language 2 instead of language 1.

I don't think it is expected behavior or am I missing something here?
How should one go about training multiple languages one after the other ensuring that the final model is better for both of these languages compared to the original whisper model?
How does one finetune it for unsupported languages in Whisper?

1 reply

sanchit-gandhi Apr 21, 2023

Answered in detail on the HF Hub: https://huggingface.co/spaces/openai/whisper/discussions/6#643d8bc551e2958ef6cd69ef

Regarding your third question (sorry I missed this previously!), When you fine-tune it on a new language, Whisper does a pretty good job at leveraging its knowledge of the other 96 languages it’s pre-trained on. So you probably still only need 10s of hours of labelled audio data in this case.

Pretty much all modern languages will be linguistically similar to at least one of the 96 languages Whisper already knows, so you’ll probably fall under this paradigm of cross-lingual knowledge representations.

We tried two ways of setting it up for fine-tuning on new languages:

Remove the language prediction task so that Whisper doesn’t get caught up with the fact it’s working on a new language and just focuses on transcribing text as accurately as possible (set langauge=None, task=None in the tokenizer + processor)
Keep the language prediction task and tell Whisper that the new language is the same as one of it’s current languages (e.g. if fine-tuning on Nepali, tell Whisper it’s actually predicting Hindi, since the two are linguistically most similar): our thinking here was that we’d be able to leverage Whisper’s knowledge of the most linguistically similar language to the new language that we were showing it (set langauge=nepalese, task=transcribe in the tokenizer + processor)

In the end, 1 & 2 gave very comparable performance, so Whisper figures out how to make use of its existing knowledge itself

jaggzh · 2023-09-01T12:03:46Z

jaggzh
Sep 1, 2023

Regarding the Japanese Kana notebook: Anyone have some small dataset representation? (I won't be training Japanese so downloading that entire dataset just for reference....)

1 reply

sanchit-gandhi Sep 6, 2023

Could you use streaming mode here? https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/fine_tune_whisper_streaming_colab.ipynb

dgoryeo · 2023-09-10T18:03:15Z

dgoryeo
Sep 10, 2023

@sanchit-gandhi , is it posisble to finetune for Japanese if my dataset is Japanese audio to English translation? Can one finetune just for task=translation? Or am I totaly missing the right flow :)

Thanks

0 replies

Finetuning/Training code ? #64

Replies: 13 comments · 98 replies

jongwook Sep 26, 2022 Maintainer

jongwook Sep 27, 2022 Maintainer

OleguerCanal Oct 27, 2022 Author

Replies: 13 comments 98 replies

jongwook
Sep 26, 2022
Maintainer

jongwook Sep 27, 2022
Maintainer

OleguerCanal
Oct 27, 2022
Author