Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MQTTS #24142

Closed
2 tasks done
susnato opened this issue Jun 9, 2023 · 6 comments
Closed
2 tasks done

Add MQTTS #24142

susnato opened this issue Jun 9, 2023 · 6 comments

Comments

@susnato
Copy link
Contributor

susnato commented Jun 9, 2023

Model description

MQTTS is a Text to Speech model which was introduced in the paper A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech. Their work explore the use of more abundant real-world data for building speech synthesizers. It's architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality.They show that MQTTS outperforms existing TTS systems in several objective and subjective measures.

I would like to add this model to HF.

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

Implementation - https://github.com/b04901014/MQTTS
Checkpoints -

  1. Config - https://cmu.box.com/s/hvv06w3yr8mob4csjjaigu5szq2qcjab
  2. Quantize - https://cmu.box.com/s/966rcxkyjps80p7thu0r6lo2udk1ezdm
  3. Transformer model - https://cmu.box.com/s/xuen9o8wxsmyaz32a65fu25cz92a2jei
@susnato
Copy link
Contributor Author

susnato commented Jun 9, 2023

cc: @sanchit-gandhi and @ArthurZucker

@sanchit-gandhi
Copy link
Contributor

I think this is a cool model - whether it outperforms Bark (#24086) is up for debate. My only concerns are:

  1. The NC license which is not super permissive
  2. The low-visibility of the original repo: with only 130 GH stars, it seems like the community is not super excited by the model (and thus are unlikely to use it in the library)

While the voice prompting feature would be cool and inference much faster than a hierarchical transformer model like Bark, I think the lack of visibility / excitement around the model means it would be a big effort to add with maybe little usage as a result

cc @Vaibhavs10 who has had more experience with MQTTS, @ylacombe who's adding Bark and @hollance who's adding VITS MMS

What do you all think?

@Vaibhavs10
Copy link
Member

IMO for MQTTS - doesn't make as much sense, purely from a licensing standpoint. Plus it uses a non-standard quantizer, which makes it difficult to maintain (primarily because it'll be used only for MQTTS).

I think a more ambitious idea would be to add tortoise-tts - https://github.com/neonbjb/tortoise-tts (Was released a while back but still is the king) - the original repo is not as optimised so with the transformers bells and whistles we can make sure that it works faster and better?

Another idea would be to add StyleTTS - https://github.com/yl4579/StyleTTS, the results are quite promising and given there is training code as well, it opens up the opportunity to train a bigger model.

@sanchit-gandhi
Copy link
Contributor

sanchit-gandhi commented Jun 28, 2023

Tortoise TTS would probably go in the diffusers repo (since we could build it as a diffusion pipeline with a transformer encoder) - since the purpose of diffusers is more pure performance (which is not the objective of transformers) it would be a good fit here

Would you like to open a feature request for Tortoise TTS on the diffusers repo and tag myself and @Vaibhavs10? We can then discuss how feasible a new pipeline addition would be!

@susnato
Copy link
Contributor Author

susnato commented Jun 28, 2023

thanks a lot for all the insights!

Also I opened an issue for Tortoise TTS on the diffusers repo. It is here

@sanchit-gandhi
Copy link
Contributor

Perfect, thanks @susnato! Going to close this then since we're in agreement that MQTTS is not a good addition for transformers. Tortoise TTS issue in diffusers: huggingface/diffusers#3891

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants