Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Whisper Conversion Script: Correct decoder_attention_heads and _download function #26834

Merged
merged 14 commits into from
Nov 7, 2023

Conversation

zuazo
Copy link
Contributor

@zuazo zuazo commented Oct 16, 2023

What does this PR do?

This PR addresses two issues in the convert_openai_whisper_to_hf.py script for it to work correctly.

  1. It corrects the decoder_attention_heads value. This did not produce any error, but the models converted did not transcribe correctly.

  2. It also fixes the _download() function:

  • Adds the root parameter, previously gave the following error:
$ python convert_openai_to_hf.py \
  --checkpoint_path tiny \
  --pytorch_dump_folder_path pytorch_model_hf.bin

Traceback (most recent call last):
  File "convert_openai_to_hf.py", line 184, in <module>
    convert_openai_whisper_to_tfms(args.checkpoint_path, args.pytorch_dump_folder_path)
  File "convert_openai_to_hf.py", line 133, in convert_openai_whisper_to_tfms
    original_checkpoint = _download(_MODELS[checkpoint_path])
TypeError: _download() missing 1 required positional argument: 'root'
  • Returns the download path instead of the model bytes, it produced the following error before:
$ python convert_openai_to_hf.py \
  --checkpoint_path tiny \
  --pytorch_dump_folder_path pytorch_model_hf.bin

100%|████████████████████████████████| 72.1M/72.1M [00:01<00:00, 41.8MiB/s]
Traceback (most recent call last):
  File "convert_openai_to_hf.py", line 185, in <module>
    convert_openai_whisper_to_tfms(args.checkpoint_path, args.pytorch_dump_folder_path)
  File "convert_openai_to_hf.py", line 137, in convert_openai_whisper_to_tfms
    dimensions = original_checkpoint["dims"]
TypeError: byte indices must be integers or slices, not str

Before submitting

  • I've read the contributor guideline.
  • This was discussed/approved via a Github issue or the forum. Please add a link to it if that's the case.
  • I have updated the documentation with my changes where necessary.
  • I have written any new necessary tests.

Who can review?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, quite surprised that this went un-noticed for this long.
Seems like the docstring of the config is also wrong as the tiny has 6 heads but 4 layers not 6 layers 4 heads (if you want to update this as well!)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Comment on lines 134 to 135
checkpoint_path = _download(_MODELS[checkpoint_path], root)
original_checkpoint = torch.load(checkpoint_path, map_location="cpu")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually with this we end up loading the checkpoint twice, which can be a problem for bigger models, maybe juste adding the root path can do the trick WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We are reading the model file twice.

Before the _download() function returned raw bytes, not a dict. So the following code did not work:

def _download(url: str, root: str) -> bytes:
    # [...]
    return model_bytes


def convert_openai_whisper_to_tfms(checkpoint_path, pytorch_dump_folder_path):
    if ".pt" not in checkpoint_path:
        original_checkpoint = _download(_MODELS[checkpoint_path])
    else:
        original_checkpoint = torch.load(checkpoint_path, map_location="CPU")
    # Here, it gave the "byte indices must be integers or slices, not str" error mentioned:
    dimensions = original_checkpoint["dims"]
    state_dict = original_checkpoint["model_state_dict"]

I have modified the _download() code to reuse the read model_bytes by converting it to io.BytesIO, and then calling torch.load():

def _download(url: str, root: str) -> io.BytesIO:
    # [...]
    return torch.load(io.BytesIO(model_bytes))

Let me know if you know or find a better approach.

@zuazo
Copy link
Contributor Author

zuazo commented Oct 17, 2023

Thanks for the review!

Seems like the docstring of the config is also wrong as the tiny has 6 heads but 4 layers not 6 layers 4 heads (if you want to update this as well!)

Nice catch! Indeed, the default values are also wrong, including the d_model (Width) value.

I updated the code with the following:

  • Fix those default values of the WhisperConfig to match the Tiny size.
  • Add a docstring on top with a doctest.
  • Add a shebang and +x.
  • Change the _download() logic to avoid loading it twice (as you advised).

Let me know if you don't agree with any of the changes and thanks again for all the feedback!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. @sanchit-gandhi the default config changes are actually breaking the default initialization but it's a bug fix. How do you feel about it?

Comment on lines 2 to 46
"""Converts a Whisper model in OpenAI format to Hugging Face format.

Example:

```bash
# Converts the model from OpenAI to Hugging Face format:
python convert_openai_to_hf.py \
--checkpoint_path tiny \
--pytorch_dump_folder_path whisper-tiny-hf
```

```python
>>> import torchaudio
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from transformers.models.whisper.convert_openai_to_hf import convert_openai_whisper_to_tfms

>>> # Converts the model from OpenAI to Hugging Face format:
>>> convert_openai_whisper_to_tfms("tiny.en", "whisper-tiny.en-hf") # doctest: +IGNORE_RESULT

>>> # Select an audio file:
>>> audio_path = "https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav"

>>> # Load the Whisper model in Hugging Face format:
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("whisper-tiny.en-hf")
>>> model.config.forced_decoder_ids = None

>>> # Select an audio file:
>>> waveform, sampling_rate = torchaudio.load(audio_path)

>>> # Use the model and processor to transcribe the audio:
>>> input_features = processor(
... waveform.squeeze().numpy(), sampling_rate=sampling_rate, return_tensors="pt"
... ).input_features

>>> # Generate token ids
>>> predicted_ids = model.generate(input_features)

>>> # Decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

>>> transcription[0]
' Chapter 16. I might have told you of the beginning of this liaison in a few lines'
```
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmm this should rather go in the whisper.md if not already there! I'd rather not have it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I moved it to the whisper.md documentation file.

I checked the file with doctest, and passed.

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config changes LGTM, but would be in favour of directly promoting the pre-trained models on the Hub, rather than the conversion script.

Note that we have converted all official OpenAI checkpoints to Transformers already, so users should be able to download them from pre-trained from the Hub!

Here is a step-by-step guide to transcribing an audio sample using a pre-trained Whisper model:

```python
>>> import torchaudio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid using extra dependencies like torchaudio in our code snippets - could you maybe refactor this to only use dependencies in the Transformers library?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use datasets.load_dataset.

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration

>>> # Select an audio file:
>>> audio_path = "https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a shorter audio file - this will take a long time to transcribe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> # Load the Whisper model in Hugging Face format:
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")
>>> model.config.forced_decoder_ids = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need - this API is deprecated now (cc @ArthurZucker)

Suggested change
>>> model.config.forced_decoder_ids = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know. Removed.

>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

>>> transcription[0]
' Chapter 16. I might have told you of the beginning of this liaison in a few lines'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get a different result here to the one we got before (Chapter 16 only)? Is one of them incorrect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know why, but the transcriptions in HF and OpenAI do not match using this audio. To further explore this, I’ve created a Kaggle notebook and transcribed the audio here using Tiny.en from both libraries. This also includes what happens when we convert between formats using the script here (convert_openai_to_hf.py) and the one in #26854 (convert_hf_to_openai.py).

Here are the transcription results:

Model Transcription[:30] SHA256 Prefix
OpenAI " Chapter 16 I might have told " 80e1a202
Hugging Face " Chapter 16." c9f30e2e
OpenAI -> HF " Chapter 16. I might have told" 090c43de
HF -> OpenAI " Chapter 16 I might have told " 80e1a202

From the results, transcriptions using the OpenAI library are consistent, even after conversion from the HF model, suggesting the conversion is accurate. However, the transcriptions using the HF library, both with your model in https://huggingface.co/openai/whisper-tiny.en, and after converting the original tiny.en from OpenAI, differ.

I ignore the reason for this. Maybe some post-processing step. I have also not tested other model sizes or audios.

Should I open a new ticket with this?

' Chapter 16. I might have told you of the beginning of this liaison in a few lines'
```

This step is not usually required if we are using the models already [provided by OpenAI in the Hugging Face Hub](https://huggingface.co/openai).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just promote this directly instead and remove the notes on the conversion script: there is no need for users to have to use the conversion script since we've already converted all official Whisper checkpoints on the Hub https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! Conversion example completely removed.

@@ -151,7 +155,7 @@ def convert_openai_whisper_to_tfms(checkpoint_path, pytorch_dump_folder_path):
encoder_layers=dimensions["n_audio_layer"],
encoder_attention_heads=dimensions["n_audio_head"],
decoder_layers=dimensions["n_text_layer"],
decoder_attention_heads=dimensions["n_text_state"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprised we managed to convert the original checkpoints with this bug @ArthurZucker 🤔 The state dicts surely won't have matched? Maybe we hardcoded this before?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I think I hardcoded the values when converting and then later on made it automatic. I checked by actually re-running the script and seeing that this was a nice type 🤣 but good sign that no one else tried to convert the checkpoints !

Copy link
Contributor Author

@zuazo zuazo Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the history behind this. But looking at the blame, the script was correct once:

decoder_attention_heads=dimensions["n_text_head"],

Then it was deleted and recovered in: #20600

That's where the problem seems to come from. So the original script you used may have worked properly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice digging! Yep I think I uploaded an old version late by a few commits

@zuazo
Copy link
Contributor Author

zuazo commented Oct 25, 2023

Thanks for the review, @sanchit-gandhi. I think I have made all the changes requested. Please, could you check our comments on the transcription differences?

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for fixing this @zuazo and for the explanation on the differences! Since the logits for a single forward pass are equivalent, this is most likely the accumulation of small numerical errors.

@ArthurZucker
Copy link
Collaborator

If you can just fix the conflict @zuazo we'll be able to merge!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the wait!

@ArthurZucker ArthurZucker merged commit 606d908 into huggingface:main Nov 7, 2023
18 checks passed
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
…ownload function (huggingface#26834)

* Fix error in convert_openai_to_hf.py: "_download() missing 1 required positional argument: root"

* Fix error in convert_openai_to_hf.py: "TypeError: byte indices must be integers or slices, not str"

* Fix decoder_attention_heads value in convert_openai_to_hf.py.

Correct the assignment for `decoder_attention_heads` in the conversion script for the Whisper model.

* Black reformat convert_openai_to_hf.py file.

* Fix Whisper model configuration defaults (for Tiny).

- Correct encoder/decoder layers and attention heads count.
- Update model width (`d_model`) to 384.

* Add docstring to the convert_openai_to_hf.py script with a doctest

* Add shebang and +x permission to the convert_openai_to_hf.py

* convert_openai_to_hf.py: reuse the read model_bytes in the _download() function

* Move convert_openai_to_hf.py doctest example to whisper.md

* whisper.md: Add an inference example to the Conversion section.

* whisper.md: remove `model.config.forced_decoder_ids` from examples (deprecated)

* whisper.md: Remove "## Format Conversion" section; not used by users

* whisper.md: Use librispeech_asr_dummy dataset and load_dataset()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants