-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MusicGen: Add Stereo Model #27084
MusicGen: Add Stereo Model #27084
Conversation
@@ -75,6 +75,9 @@ class MusicgenDecoderConfig(PretrainedConfig): | |||
The number of parallel codebooks forwarded to the model. | |||
tie_word_embeddings(`bool`, *optional*, defaults to `False`): | |||
Whether input and output word embeddings should be tied. | |||
audio_channels (`int`, *optional*, defaults to 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the EnCodec naming here:
transformers/src/transformers/models/encodec/configuration_encodec.py
Lines 50 to 51 in d7cb5e1
audio_channels (`int`, *optional*, defaults to 1): | |
Number of channels in the audio data. Either 1 for mono or 2 for stereo. |
Note that the EnCodec model used is still 1-channel (mono) - it's just the MusicGen model that works in a 2-channel fashion.
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks looks very clean 😉
Is the format we gave (interleaving) forced by BC? Otherwise storing in two list or tuples would be better IMO (kind of like how audio are stored no?) |
The original model is designed to predict them in an interleaved way:
We could change this to predict left first, then right:
Which would require re-shaping the LM head weights, and duplicating the pattern mask along the row dimension. Overall I think the complexity would be similar to the interleaved way we have now. But predicting two sets of codebooks as two separate tuples would break compatibility with the existing mono musicgen, or otherwise complicate the code since we'll have different inputs / sampling logic depending on whether we're mono / stereo. |
Awesome thanks for explaining |
* [MusicGen] Add stereo model * safe serialization * Update src/transformers/models/musicgen/modeling_musicgen.py * split over 2 lines * fix slow tests on cuda
What does this PR do?
The original MusicGen model generates mono (1-channel) outputs. It does this by predicting a set of 4 codebooks at each generation step:
After generating, the sequence of predicted codebooks is passed through the EnCodec model to get the final waveform.
This PR adds the MusicGen stereo model. It works by predicting two sets of codebooks at each step. One set of codebooks corresponds to the left channel, the other set corresponds to the right channel. The sets of codebooks are interleaved as follows:
After generating, the sequence of generated codebooks are partitioned into their left/right parts, and then each sequence passed through EnCodec to get the left/right waveform respectively.