You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, i want to use a Latent Diffusion Model to regenerate Identity embeddings for speech synthesis. The identity embedding is concatenated with a Hubert embedding and feed into a decoder for speech synthesis.
To train the LDM i want to use speaker and emotion embeddings as a condition. I just cat these two embeddings and feed it into the unet encoder hidden states to use them as cross attention conditioning, is this valid?
The LDM learned and had some results. However when i used the identity embeddings generated by the LDM, the final results of the decoded speech where pretty bad. I think the model is learning not enought? Or the small value diffs in the embeddings are destroying the identity embedding?
Identity embedding has shape (1, 128), a vector with 128 channels. Batch size is currently 64.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello, i want to use a Latent Diffusion Model to regenerate Identity embeddings for speech synthesis. The identity embedding is concatenated with a Hubert embedding and feed into a decoder for speech synthesis.
To train the LDM i want to use speaker and emotion embeddings as a condition. I just cat these two embeddings and feed it into the unet encoder hidden states to use them as cross attention conditioning, is this valid?
The LDM learned and had some results. However when i used the identity embeddings generated by the LDM, the final results of the decoded speech where pretty bad. I think the model is learning not enought? Or the small value diffs in the embeddings are destroying the identity embedding?
Identity embedding has shape (1, 128), a vector with 128 channels. Batch size is currently 64.
Beta Was this translation helpful? Give feedback.
All reactions