-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions on text2video? #25
Comments
This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the |
It looks like it's working well, may I ask how many steps this was trained for? |
when I try to figure out how to adapt the framework for text2video synthesis, I found that the SpatialTemporalUNet has a input channel 8 which is depicted in this line:
Then I check the pipeline inference and I found the denoising input is actually a concatenation of a noise and latent input:
my question is that how to obtain the image_latents if we only use text as a input when training a text2video model?
Do you recently have any progress on text2video?
The text was updated successfully, but these errors were encountered: