Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on text2video? #25

Open
hitsz-zuoqi opened this issue Jan 17, 2024 · 5 comments
Open

Questions on text2video? #25

hitsz-zuoqi opened this issue Jan 17, 2024 · 5 comments

Comments

@hitsz-zuoqi
Copy link

when I try to figure out how to adapt the framework for text2video synthesis, I found that the SpatialTemporalUNet has a input channel 8 which is depicted in this line:


    @register_to_config
    def __init__(
        self,
        sample_size: Optional[int] = None,
        in_channels: int = 8,
        out_channels: int = 4,
        down_block_types: Tuple[str] = (

Then I check the pipeline inference and I found the denoising input is actually a concatenation of a noise and latent input:


# Concatenate image_latents over channels dimention
latent_model_input = torch.cat([latent_model_input, image_latents], dim=2)

my question is that how to obtain the image_latents if we only use text as a input when training a text2video model?
Do you recently have any progress on text2video?

@pixeli99
Copy link
Owner

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...).
If anyone has any suggestions, feel free to share them here, and I will give them a try.

@hitsz-zuoqi
Copy link
Author

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like:
Prompt: "a desk"
step_23500_val_img_0_a-desk

Prompt: "a sofa"
step_23500_val_img_0_a-sofa

When the training beginning, the sampling results are:
Prompt: "a desk"
step_1_val_img_0_a-desk
Prompt: "a sofa"
step_1_val_img_0_a-sofa

From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in of unet to 4 channels.

@liiiiiiiiil
Copy link

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk" step_23500_val_img_0_a-desk step_23500_val_img_0_a-desk

Prompt: "a sofa" step_23500_val_img_0_a-sofa step_23500_val_img_0_a-sofa

When the training beginning, the sampling results are: Prompt: "a desk" step_1_val_img_0_a-desk step_1_val_img_0_a-desk Prompt: "a sofa" step_1_val_img_0_a-sofa step_1_val_img_0_a-sofa

From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in of unet to 4 channels.

The first two videos look very good, how did u do that?

@pixeli99
Copy link
Owner

It looks like it's working well, may I ask how many steps this was trained for?

@CallMeFrozenBanana
Copy link

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk" step_23500_val_img_0_a-desk step_23500_val_img_0_a-desk

Prompt: "a sofa" step_23500_val_img_0_a-sofa step_23500_val_img_0_a-sofa

When the training beginning, the sampling results are: Prompt: "a desk" step_1_val_img_0_a-desk step_1_val_img_0_a-desk Prompt: "a sofa" step_1_val_img_0_a-sofa step_1_val_img_0_a-sofa

From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in of unet to 4 channels.

It seems there is a different latent space between the text2video and img2video models. By the way, what model are you finetuning on the Objaverse dataset? It looks work...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants