2DConditionalUnet replace text embedding hidden states with custome ones | bad LDM results #9545

Conscht · 2024-09-27T23:24:22Z

Conscht
Sep 27, 2024

Hello, i want to use a Latent Diffusion Model to regenerate Identity embeddings for speech synthesis. The identity embedding is concatenated with a Hubert embedding and feed into a decoder for speech synthesis.

To train the LDM i want to use speaker and emotion embeddings as a condition. I just cat these two embeddings and feed it into the unet encoder hidden states to use them as cross attention conditioning, is this valid?

The LDM learned and had some results. However when i used the identity embeddings generated by the LDM, the final results of the decoded speech where pretty bad. I think the model is learning not enought? Or the small value diffs in the embeddings are destroying the identity embedding?

Identity embedding has shape (1, 128), a vector with 128 channels. Batch size is currently 64.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2DConditionalUnet replace text embedding hidden states with custome ones | bad LDM results #9545

{{title}}

Replies: 0 comments

Select a reply

2DConditionalUnet replace text embedding hidden states with custome ones | bad LDM results #9545

Conscht Sep 27, 2024

Replies: 0 comments

Conscht
Sep 27, 2024