Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAI to HF: Add large-v3 to conversion script #27335

Closed

Conversation

jstoone
Copy link

@jstoone jstoone commented Nov 7, 2023

What does this PR do?

It adds the new large-v3 to the script which converts OpenAI model to HF.

The URL is copied from the PR releasing large-v3: openai/whisper#1761

I can't find any tests related to the conversion script, so I've gone ahead and skipped that. Let me know, if I've missed something.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sanchit-gandhi Do you know what has changed, since the model checkpoint seems to be a byte array instead of a directory?

@jstoone jstoone marked this pull request as draft November 7, 2023 07:17
@jstoone jstoone changed the title Feature/convert whisper large v3 OpenAI to HF: Add large-v3 to conversion script Nov 7, 2023
@jstoone
Copy link
Author

jstoone commented Nov 7, 2023

I've converted PR to draft, as I'm currently getting the following exception:

Traceback (most recent call last):
  File "/Users/jstoone/Sites/huggingface/transformers/src/transformers/models/whisper/convert_openai_to_hf.py", line 185, in <module>
    convert_openai_whisper_to_tfms(args.checkpoint_path, args.pytorch_dump_folder_path)
  File "/Users/jstoone/Sites/huggingface/transformers/src/transformers/models/whisper/convert_openai_to_hf.py", line 137, in convert_openai_whisper_to_tfms
    dimensions = original_checkpoint["dims"]
                 ~~~~~~~~~~~~~~~~~~~^^^^^^^^
TypeError: byte indices must be integers or slices, not str

@amyeroberts
Copy link
Collaborator

Hi @jstoone, thanks for opening this PR and contributing to the library!

There's some other PRs relating to this

I believe #26834 tackles the issue being seen here. As #27336 is a more complete update - handling the addition of cantonese - then that will be the PR to be merged in.

cc @ArthurZucker for reference.

@jstoone
Copy link
Author

jstoone commented Nov 7, 2023

@amyeroberts That's amazing! The community moves so darn fast, it's amazing to see. I'll go ahead and close this one then. Thanks!

@jstoone jstoone closed this Nov 7, 2023
@jstoone jstoone deleted the feature/convert-whisper-large-v3 branch November 7, 2023 21:48
@gerrynjenny
Copy link

gerrynjenny commented Nov 8, 2023

@amyeroberts

As #27336 is a more complete update - handling the addition of cantonese - then that will be the PR to be merged in.

Maybe I'm not understanding you, but the PR appears to only remove mention of Cantonese

@amyeroberts
Copy link
Collaborator

amyeroberts commented Nov 8, 2023

@gerrynjenny The diff in the open PR removes instructions in the docs mentioning that you have to specify 100 languages. This is because it's now inferred from the checkpoint when converting. The PR previously included logic which added cantonese as a language in the tokenizer mapping. However that has already been merged in with #27338

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants